Model selection and hypothesis testing for large-scale network models with overlapping groups

Model selection and hypothesis testing for large-scale network models with overlapping groups

Tiago P. Peixoto tiago@itp.uni-bremen.de Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D-28359 Bremen, Germany
Abstract

The effort to understand network systems in increasing detail has resulted in a diversity of methods designed to extract their large-scale structure from data. Unfortunately, many of these methods yield diverging descriptions of the same network, making both the comparison and understanding of their results a difficult challenge. A possible solution to this outstanding issue is to shift the focus away from ad hoc methods and move towards more principled approaches based on statistical inference of generative models. As a result, we face instead the more well-defined task of selecting between competing generative processes, which can be done under a unified probabilistic framework. Here, we consider the comparison between a variety of generative models including features such as degree correction, where nodes with arbitrary degrees can belong to the same group, and community overlap, where nodes are allowed to belong to more than one group. Because such model variants possess an increasing number of parameters, they become prone to overfitting. In this work, we present a method of model selection based on the minimum description length criterion and posterior odds ratios that is capable of fully accounting for the increased degrees of freedom of the larger models, and selects the best one according to the statistical evidence available in the data. In applying this method to many empirical unweighted networks from different fields, we observe that community overlap is very often not supported by statistical evidence and is selected as a better model only for a minority of them. On the other hand, we find that degree correction tends to be almost universally favored by the available data, implying that intrinsic node proprieties (as opposed to group properties) are often an essential ingredient of network formation.

pacs:
89.75.Hc, 02.50.Tt, 89.70.Cf

I Introduction

Many networks possess nontrivial large-scale structures such as communities Newman (2011); Fortunato (2010), core-peripheries Holme (2005); Rombach et al. (2012), bipartitions Larremore et al. (2014) and hierarchies Clauset et al. (2008); Peixoto (2014a). These structures presumedly reflect the organizational principles behind network formation. Furthermore, their detection can be used to predict missing links Clauset et al. (2008); Guimerà and Sales-Pardo (2009) or detect spurious ones Guimerà and Sales-Pardo (2009), as well as determine the robustness of the system to failure or intentional damage Buldyrev et al. (2010), the outcome of the spread of epidemics Apolloni et al. (2014) and functional classification Guimerà and Nunes Amaral (2005), among many other applications. The detail with which such modular features are both represented and detected reflects directly on the quality of these tasks. However, the methods of uncovering such structures in empirical data so far proposed are very different in their suitability to the aforementioned tasks. Many authors have constructed algorithms which attempt to divide the network into groups according to some metric devised specifically for this purpose. Examples of this include modularity Newman (2006a), betweenness Girvan and Newman (2002), link similarity Ahn et al. (2010), clique percolation Palla et al. (2005), encoding of random walks Rosvall and Bergstrom (2008), and many more Fortunato (2010). Unfortunately, many of these methods will result in diverging descriptions for the same network. Furthermore, the information they obtained cannot be easily used to generalize the data, and make predictions Clauset et al. (2008); Guimerà and Sales-Pardo (2009). Alternatively, other authors have focused on constructing generative models that encode the large-scale structure as parameters, which can then be inferred from empirical data (e.g. Holland et al. (1983); Airoldi et al. (2008); Karrer and Newman (2011); Ball et al. (2011)). These methods not only represent a more principled and rigorous stance, but they can also demonstrably overcome inherent limitations of more ad hoc methods Peixoto (2014a). Furthermore, they can be used to generalize the data, and make predictions Clauset et al. (2008); Guimerà and Sales-Pardo (2009). Both approaches, however, suffer from a common fundamental problem, namely the difficulty in deciding which detection method or generative model provides a more appropriate description of a given network. This issue tends to escalate as more elaborate models and methods are developed, including features such as degree correction Karrer and Newman (2011), community overlap Palla et al. (2005); Airoldi et al. (2008); Ahn et al. (2010); Ball et al. (2011), hierarchical structure Clauset et al. (2008); Lancichinetti et al. (2009); Peixoto (2014a), self-similarity Palla et al. (2010); Leskovec et al. (2010a), bipartiteness Larremore et al. (2014), edge and node correlates Mariadassou et al. (2010); Aicher et al. (2014), social tiers Ball and Newman (2013), multilayer structure Kivelä et al. (2014), temporal information Fu et al. (2009), to name only a few. Although such developments are essential, they should be made with care, since increasing the complexity of the network description may lead to artificial results caused by overfitting. While this is a well-understood phenomenon when dealing with independent data or time series, open problems remain when the empirical data are a network, for which many common assumptions no longer hold and the usual methods perform very poorly Yan et al. (2014). This problem is significantly exacerbated when methods are used which make no attempt to assess the statistical significance of the results. Unfortunately, most methods that are not based on generative models fall into this class. Although for certain specially constructed examples some direct connections between statistical inference and ad hoc methods can be made Newman (2013a, b), and in the case of some spectral methods a much deeper connection seems to exist Nadakuditi and Newman (2012); Krzakala et al. (2013), they still inherently lack the capacity to reliably distinguish signal from noise. Furthermore — what is perhaps even more important — these different methods cannot easily be compared to each other. For example, suppose that for the same network a nonoverlapping partition is found by compressing random walks, another overlapping partition is obtained with clique percolation, and yet another with a local method based on link similarity (all of which are methods not based on generative models). Most of the time, these three partitions will be very different, and yet there is no obvious way to decide which one is a more faithful representation of the network. Although methods such as network benchmarks Lancichinetti et al. (2008); Lancichinetti and Fortunato (2009a, b) and perturbation analysis Karrer et al. (2007) have been developed in order to alleviate this issue, they have only limited applicability to the larger problem. Namely, network benchmarks cannot be used when an appropriate representation of an empirical network is not known, and if one wants to decide, for instance, if the network possesses overlapping groups or not. In a similar vein, perturbation analysis provides information about the significance of results originating from a single algorithm, which cannot be directly used to compare two very different ones.

On the other hand, the situation is different if one focuses on generative models alone. Since in this context the same problem is posed in a probabilistic framework, comparison between models is possible, even if the models are very different. And since models can be designed to accommodate arbitrary topological features, we lose no explanatory power when comparing to the ad hoc approaches. We show in this work that this central issue can be tackled in a consistent and principled manner by performing model selection based on statistical evidence. In particular, we employ the minimum description length principle (MDL) Grünwald (2007); Rissanen (2010), which seeks to minimize the total information necessary to describe the observed data as well as the model parameters. This can be equivalently formulated as the maximization of a Bayesian posterior likelihood which includes noninformative priors on the parameters, from which a posterior odds ratio between different hypotheses can be computed, yielding a degree of confidence for a model to be rejected in favor of another. We focus on the stochastic block model as the underlying generative model, as well as variants that include degree correction and mixed memberships. We show that with these models MDL can be used to produce a very efficient algorithm that scales well for very large networks and with an arbitrarily large number of groups. Furthermore, we employ the method to a wide variety of empirical network data sets, and we show that community overlaps are seldom selected as the most appropriate model. This casts doubt on the claimed pervasiveness of group overlaps Palla et al. (2005); Ahn et al. (2010), obtained predominantly with nonstatistical methods, which should perhaps be interpreted as an artifact of using methods with more degrees of freedom, instead of an underlying property of many systems — at least as long as there is a lack of corroborating evidence supporting the overlap (such as, potentially, edge weights Aicher et al. (2014); Rosvall et al. (2014) or multilayer structure Kivelä et al. (2014), which we do not consider here). On the other hand, we find that degree correction tends to be selected for a significant majority of systems, implying that individual node “fitness” that is not uniformly inherited by group membership is a fundamental aspect of network formation.

This paper is divided as follows. In Sec. II we present the generative models considered, and in Sec. III we describe the model selection procedure based on MDL. In Sec. IV we present the results for a variety of empirical networks. In Sec. V we analyze the general identifiability limits of the overlapping models, and in Sec. VI we describe in detail the inference algorithm used. In Sec. VII we finalize with a discussion.

Ii Generative models for network structure

A generative model is one which attributes to each possible graph a probability for it to be observed, conditioned on some set of parameters . Here we will be restricted to discrete uniform models, where specific choices of prohibit some graphs from occurring, but those which are allowed to occur have the same probability. For these models we can write , with being the total number of possible graphs compatible with a given choice of parameters, and is the entropy of this constrained ensemble Bianconi (2009); Peixoto (2012). In order to infer the parameters via maximum likelihood, we need to maximize , or equivalently, minimize . This approach, however, cannot be used if the order of the model is unknown, i.e. the number of degrees of freedom in the parameter set , since choices with higher order will almost always increase , resulting in overfitting. For the same reason, maximum likelihood cannot be used to distinguish between models belonging to different classes, since models with larger degrees of freedom will inherently lead to larger likelihoods. In order to avoid overfitting, one needs to maximize instead the Bayesian posterior probability , with being a normalizing constant. The prior probability , which encodes our a priori knowledge of the parameters (if any) should inherently become smaller if the number of degrees of freedom increases. We will also be restricted to discrete parameters with uniform prior probabilities, so that , with being the entropy of the ensemble of possible parameter choices. We can thus write the total posterior likelihood as , with . The value is the description length of the data Grünwald (2007); Rissanen (2010), i.e. the total amount of information required to describe the observed data conditioned on a set of parameters as well as the parameter set itself Rosvall and Bergstrom (2007). Hence, if we maximize we are automatically finding the parameter choice that compresses the data most, since it will also minimize its description length . Because of this, there is no difference between specifying probabilistic models for both and , or encoding schemes that quantify the amount of information necessary to describe both. In the following, we will make use of both terminologies interchangeably, whenever most appropriate.

ii.1 Overlapping model without degree correction

The main feature we want to consider in our generative model is the existence of well-defined groups of nodes, which are connected to other groups with arbitrary probabilities, such that nodes belonging to the same group play a similar role in the large-scale network structure. We also want to include the possibility of nodes belonging to more than one group, and in so doing inherit the topological properties of all groups to which they belong. In order to implement this, we consider a simple variation of the stochastic block model Holland et al. (1983); Fienberg et al. (1985); Faust and Wasserman (1992); Anderson et al. (1992) with nodes and edges, where the nodes can belong to different groups. Hence, to each node we attribute a binary mixture vector with entries, where a given entry specifies whether or not the node belongs to block . In addition to this overlapping partition, we simply define the edge-count matrix , which specifies how many edges are placed between nodes belonging to blocks and (or twice that number for , for convenience of notation), where we have . This simple definition allows one to generate a broad variety of overlapping patterns, which are not confined to purely assortative structures, and the nonoverlapping model can be recovered as a special case, simply by putting each node in a single group.

The likelihood of observing a given graph with the above constraints is simply , where is the number of possible graphs, and is the associated ensemble entropy. In this construction, the existence of multiple edges is allowed. However, the placement of multiple edges between nodes of blocks and should occur with a probability proportional to , where is the number of nodes which belong to block , i.e. (note that ). Since here we are predominantly interested in the sparse situation where and , the probability of observing parallel edges will decay as , and hence can be neglected in the large network limit. Making use of this simplification, we may approximately count all possible graphs generated by the parameters as the number of graphs where each distinct membership of a single node is considered to be a different node with a single membership. This corresponds to an augmented graph generated via a nonoverlapping block model with nodes, where , but with the same matrix , for which the entropy is Peixoto (2012)

(1)

where was assumed. Under this formulation, we recover trivially the single-membership case simply by assigning each node to a single group, since Eq. 1 remains the same in that special case. It is possible to remove the approximation that no parallel edges occur, by defining the model somewhat differently, as in shown in Appendix B.1, in which case the Eq. 1 holds exactly as long as no parallel edges are observed.

Like its nonoverlapping counterpart, the block model without degree correction assumes that nodes belonging to the same group will receive approximately the same number of edges of that type. Hence, when applied to empirical data, the modules discovered will also tend to have this property. This means that if the graph possesses large degree variability, the groups inferred will tend to correspond to different degree classes Karrer and Newman (2011). On a similar vein, if a node belongs to more than one group, it will also tend to have a total degree that is larger than nodes that belong to either group alone, since it will receive edges of each type in an independent fashion. In other words, the group intersections are expected to be strictly denser than the nonoverlapping portions of each group. Note that, in this respect, this model differs from other popular ones, such as the mixed membership stochastic block model (MMSBM) Airoldi et al. (2008), where the density at the intersections is the weighted average of the groups (see Appendix B.1).

ii.2 Overlapping model with degree correction

In the preceding model, nodes that belong to the same group mixture receive, on average, the same number of connections. This means that the group membership is the only factor regulating the propensity of a given node to receive links. An alternative possibility, formulated by Karrer et al Karrer and Newman (2011), is to consider that the nodes have individual propensities to connect themselves, which are not necessarily correlated with their group memberships. Therefore, in this “degree-corrected” model, nodes of the same group are allowed to possess very different degrees. It has been demonstrated in Ref. Karrer and Newman (2011) that this model yields more intuitive partitions for many empirical networks, suggesting that these intrinsic propensities may be a better model for these systems. In an analogous manner, a multiple membership version of the stochastic block model with degree correction can be defined. This can be achieved simply by specifying, in addition to the overlapping partition , the number of half-edges incident on a given node which belong to group , i.e. . The combined labeled degree of a node is denoted . Given this labeled degree sequence, one can simply use the same edge count matrix as before to generate the graph. If we again make the assumption that the occurrence of parallel edges can be neglected, the total number of graphs fulfilling these constraints is approximately equal to the nonoverlapping ensemble where each set of half-edges incident on any given node that belongs to the same group is considered as an individual node with degree , for which the ensemble entropy is Peixoto (2012)

(2)

where has been assumed. Similarly to the non-degree-corrected case, it is possible to remove the approximation that no parallel edges occur, by using a “Poisson” version of the model, as is shown in Appendix B.2. Under this formulation, it can be shown that this model is equivalent to the one proposed by Ball et al Ball et al. (2011), although here we keep track of the individual labels on the half-edges as latent variables, instead of their probabilities.

Since we incorporate the labeled degree sequence as model parameters, nodes that belong to the same group can have arbitrary degrees. Furthermore, since the same applies to nodes that belong simultaneously to more than one group, the overlaps between groups are neither preferably dense nor sparse; it all depends on the parameters .

Iii Model Selection

As discussed previously, in order to perform model selection, it is necessary to include the information needed to describe the model parameters, in addition to the data. The parameters which need to be described are the overlapping partition , the edge counts , and in the case of the degree-corrected model we also need to the describe the labeled degree sequence .

When choosing an encoding for the parameters (via a particular generative process) we need to avoid redundancy, and describe them as parsimoniously as possible, while at the same time averting biases by being noninformative. In the following, we systematically employ two-level Bayesian hierarchies, where discrete prior distributions are parametrized via generic counts, which are themselves sampled from uniform nonparametric hyperpriors.

iii.1 Overlapping partition,

In order to specify the partition , we assume that all different mixtures are not necessarily equally likely, and furthermore the sizes of the mixtures are also not a priori assumed to follow any specific distribution. More specifically, we consider the mixtures to be the outcome of a generative process with two steps. We first generate the local mixture sizes , from a nonparametric distribution. Then, given the mixture sizes, we generate the actual mixtures for each corresponding subset of the nodes, again using a nonparametric distribution, conditioned on the mixture size size.

The mixture sizes are sampled uniformly from the distribution with fixed counts , where is the number of nodes with a mixture of size , with a likelihood

(3)

For the counts we assume a flat prior , where is the maximum value of , and the denominator is the total number of different choices of , with being the total number of -combinations with repetitions from a set of size .

Then, for all nodes with the same value of , we sample a sequence of from a distribution with support and with fixed counts , where is the number of nodes belonging to a specific mixture of size ,

(4)

For the counts themselves, we again assume a flat prior , where the denominator enumerates the total number of counts with .

The full posterior for the overlapping partition then becomes

(5)

which corresponds to a description length ,

(6)

Although it is possible to encode the partition in different ways (e.g. by sampling the membership to each group independently Latouche et al. (2014)), this choice makes no assumptions regarding the types of overlaps that are more likely to occur, either according to the number of groups to which a node may belong, or the actual combination of groups — it is all left to be learned from data. In particular, it is not a priori assumed that if many nodes belong to two specific groups then the overlap between these same groups will also contain many nodes. As desired, if the observed partition deviates from this pattern, this will be used to compress it further. Only if the observed partition falls squarely into this pattern will further compression be impossible, and we would have an overhead describing it using Eq. 6, when compared to an encoding that expects it a priori. However, one can also see that in the limit , as the first two terms in Eq. 6 grow asymptotically only with and , respectively, the whole description length becomes , where is the entropy of the distribution , which is the optimal limit. Hence if we have a prior that better matches the observed overlap, the difference in description length compared to Eq. 6 will disappear asymptotically for large systems. Another advantage of this encoding is that it incurs no overhead when there are no overlaps at all (i.e. ), and in this case, the description length is identical to the nonoverlapping case,

(7)

as defined in Ref. Peixoto (2014a).

iii.2 Labeled degree sequence,

For the degree-corrected model, we need to describe the labeled degree sequence . We need to do so in a way which is compatible with the partition described so far, and with the edge counts , which will restrict the average degrees of each type.

In order to fully utilize the partition , we describe for each distinct value of its individual degree sequence , via the counts , i.e. the number of nodes with mixture which possess labeled degree . We do so in order to preserve the lack of preference for patterns involving the degrees in the overlaps between groups. Since the model itself is agnostic with respect to the density of the overlaps, not only does this choice remain consistent with this indifference, but also any existing pattern in the degree sequence in the overlaps will be used to construct a shorter description.

In addition, we must also consider the total number of half-edges of a given type incident on a partition , , where is the element of corresponding to group , which must be compatible with the edge counts via .

An overview of the generative process is as follows: We first consider the half-edges of each type and the nonempty () mixtures which contain the same group . We then distribute the labeled half-edges among these mixtures, obtaining the total number of labeled edges incident on each mixture, . This placement constrains the average degree of each type inside each mixture. Finally, given , we sample the actual labeled degree sequence on the nodes of each mixture.

We begin by first distributing all half-edges of type among all bins corresponding to each nonempty mixture that contains the label , i.e. . The total number of such partitions is simply , and hence the likelihood for becomes

(8)

Given , we need to distribute the labeled half-edges inside each partition to obtain each degree sequence. If we sample uniformly from all possible degree sequences fulfilling all necessary constraints, we have a likelihood for the degree sequence inside a mixture given by

(9)

where is the total number of (unlabeled) degree sequences with a total of half-edges incident on nodes. The corresponding description length would then be

(10)

However, most degree sequences sampled this way will result in nodes with very similar degrees. Since we want to profit from degree variability, it is better to condition the description on the degree counts , i.e. how many nodes with mixture possess labeled degree . This alternative distribution is given by

(11)

For the degree counts themselves, we choose a uniform prior , where is the enumeration of all possible counts that fulfill the constraints and . Unfortunately, this enumeration cannot be done easily in closed form. However, the maximum entropy ensemble where these constraints are enforced on average is analytically tractable, and as we show in Appendix C, can be well approximated by , where

(12)

and is the Riemann zeta function. The alternative description length becomes therefore

(13)

This approximation with “soft” constraints should become asymptotically exact as the number of nodes becomes large, but otherwise will deviate from the actual entropy. On the other hand, if the number of nodes is very small, describing the degree sequence via Eq. 13 may not provide a shorter description, even if computed exactly. In this situation, Eq. 10 may actually provide a shorter description of the degree sequence. We therefore compute both Eq. 10 and Eq. 13 and choose whichever is shorter. Putting it all together, the complete posterior for the whole labeled degree sequence is

(14)

with being the largest choice between and . Therefore, the description length for the labeled degree sequence becomes

(15)

In the limit , we have that , and hence the degree sequences in each partition are described close to the optimum limit.

For the nonoverlapping case with , the description length simplifies to

(16)

with

(17)
(18)

and . For we obtain . This approximation was used a priori in Ref. Peixoto (2014a), but Eq. 16 is a more complete description length of the nonoverlapping degree sequence, and its use should be preferred. Hence, like the description length of the overlapping partition, the encoding above offers no overhead when the partition is nonoverlapping.

iii.3 Edge counts,

The final piece that needs to be described is the matrix of edge counts . We may view this set as an adjacency matrix of a multigraph with nodes and edges. The total number of such matrices is , and if we assume that they are all equally likely, we have and can be used as the description length Peixoto (2013). There are, however, two problems with this approach. First, this uniform distribution is unlikely to be valid, since most observed networks still possess structure at the block level. Second, this assumption leads to a limit in the detection of small groups, with a maximum detectable number of groups scaling as  Peixoto (2013). Similarly to what we did for the node partition and the degree sequence, this can be solved by considering a generative model for the edge counts themselves, with its own set of hyperparameters. Since they correspond to a multigraph, a natural choice is the stochastic block model itself, which has its own set of edge counts, that can themselves be modeled by another stochastic block model with fewer nodes, and so on, recursively, until one has a model only one node and one group at the top. This nested stochastic block model was proposed in Ref. Peixoto (2014a), where it has been shown to reduce the resolution limit to , making it often significantly less relevant in practice. Furthermore, since the number of levels and the topology at each level is obtained by minimizing the overall description length, it corresponds to a fully nonparametric way of inferring the multilevel structure of networks. As shown in Ref. Peixoto (2014a), if we denote the observed network to be at the level of the hierarchy, then the total description length is

(19)

with , describing the block model at level , where

(20)

is the entropy of the corresponding multigraph ensemble and

(21)

is the description length of the node partition at level . For the level we have given by Eq. 6, or for the degree-corrected model.

Note that here we use the single-membership non-degree-corrected model in the upper layers. This method could be modified to include arbitrary mixtures of degree correction and multiple membership, but we stick with this formulation for simplicity.

iii.4 Significance levels

By minimizing the description length , we select the model that is most favored given the evidence in the data. But in some situations, one is not merely interested in a binary answer regarding which of two model choices is best, but instead, one would like to be able to rule out alternative models with some degree of confidence. In this case, a level of significance can be obtained by performing a Bayesian hypothesis test based on the ratio of posterior likelihoods. In this context, there are different hypotheses which can be tested. For instance, one could ask whether the entire class of non-degree-corrected overlapping models (NDCO) is favored in comparison to the class of nonoverlapping degree-corrected models (DC). This can be done by computing the posterior distribution for each model class ,

(22)

where is shorthand for the entire set of model parameters [i.e. for , and for ], with being the prior belief we have supporting a given hypothesis, and is a normalizing constant. The standard way in Bayesian statistics to evaluate the relative evidence supporting (or rejecting) hypothesis over is via the posterior odds ratio Jaynes (2003)

(23)

However, there are two issues with this approach. First, computing the sum over all parameter choices is intractable in this context, since it involves summing over all possible overlapping or nonoverlapping partitions. Second, and more importantly, this might not be the answer which is more relevant. If one obtains two model parametrizations by minimizing the description length as described in the previous section, with the two results belonging to different model classes, one would be more interested in selecting or rejecting between these two particular choices, not necessarily the overall class to which they belong. Although the description length itself already provides a means to select the best alternative, one would be interested in obtaining a confidence level for this particular decision. This is a different sort of hypothesis test than the one above, but which can be performed analogously. Since the result of the minimization of the description length is the (possibly overlapping) partition of the network, our hypothesis is a combination of the model class which we were using, and the particular partition that was found. The posterior probability attributed to this hypothesis is therefore

(24)

where again is a normalization constant. The marginal likelihood is obtained by summing over the remaining model parameters. In the case of the overlapping degree-corrected model () they are the matrix and the labeled degree sequence (which is omitted for the non-degree-corrected model, ),

(25)

where the sum trivially contains only one term, since for the same graph and partition , there is only one possible choice for the matrix and degree sequence with nonzero probability, which is a convenient feature of the microcanonical model formulation considered here [the same holds for , i.e. ]. Now if we want to compare two competing partitions and , this can be done again via the posterior odds ratio ,

(26)
(27)
(28)

with being the difference in the description length, and in Eq. 28 it was assumed that , corresponding to a lack of a priori preference for either model variant (which, in fact, makes identical to the Bayes factor Jeffreys (1998)). This is a simple result, which enables us to use the difference in the description length directly in the computation of confidence levels. Being a ratio of probabilities, the value of has a straightforward interpretation: For a value of , both models explain the data equally well, and for values of model is rejected in favor of with a confidence increasing as diminishes. In order to simplify its interpretation, the values of are usually divided into regions corresponding to a subjective assessment of the evidence strength. A common classification is as follows Jeffreys (1998): Values of in the intervals are considered to be very weak, substantial, strong, very strong and decisive evidence supporting model , respectively. In the following, when comparing different models, we will always put the preferred model in the denominator of Eq. 27, such that .

, non-degree-corrected, overlapping, , non-degree-corrected, overlapping, , , non-degree-corrected, nonoverlapping, , non-degree-corrected, nonoverlapping,
Figure 1: Left: Values for posterior odds ratio for the network of co-appearances of characters in the novel “Les Misérables”, for all model variations ( indicates an overlapping model, “DC” a degree-corrected model and “NDC” a non-degree-corrected one). The models with the best and second-best fits are shown at the bottom. Right: Same as in the left, but for the American college football network.

Using the posterior odds ratio is more practical than some alternative model selection approaches, such as likelihood ratios. As has been recently shown Yan et al. (2014), the likelihood distribution for the stochastic block model does not follow a -distribution asymptotically for sparse networks, and hence the calculation of a -value must be done via an empirical computation of the likelihood distribution which is computationally costly, and prohibitively so for very large networks. In contrast, computing can be done easily, and it properly accounts for the increased complexity of models with larger parameters, and protects against overfitting. However, it should be emphasized that these different model selection approaches are designed to answer similar, but not identical questions. Therefore the most appropriate method should be the one that more closely matches the questions raised.

, overlapping, degree-corrected,

, nonoverlapping, degree-corrected,

Figure 2: The network of political blogs by Adamic et al Adamic and Glance (2005). The left panel shows the best model with an overlapping partition, and the right shows the best nonoverlapping one. Nodes with a blue halo belong to the Republican faction, as determined in Ref. Adamic and Glance (2005). For the visualization, the hierarchical edge-bundles algorithm Holten (2006) was used.

Iv Empirical networks

, overlapping, non-degree-corrected,

, overlapping, non-degree-corrected,

Figure 3: Ego network of Facebook contacts Mcauley and Leskovec (2014). Left: The best model fit across all model variations, which puts the ego node in its own group. Right: The alternative hypothesis where the node is split in several groups. Below each network are shown the degree distributions inside each group. The arrow marks the degree of the ego node.

The method outlined in the previous section allows one to determine the best model from the various available choices. Here we analyze some empirical examples, and determine the most appropriate model, and examine the consequences of the balance struck between model complexity and quality of fit. We start with two small networks, the co-appearance of characters in the Victor Hugo novel “Les Misérables” Knuth (1993), and a network of American college football games Girvan and Newman (2002); Evans (2012). For both networks, we obtain the best partition according all model variations and for a different number of groups , and we compute the value of relative to the best model, as shown in Fig. 1. For the “Les Misérables” network, the best fit is a non-degree-corrected overlapping model that puts the most central characters in more than one group. All other partitions for different values of and model types result in values significantly below the plausibility line of , indicating that the overlapping model offers a better explanation for the data with a large degree of confidence. In particular, it offers a better description than the nonoverlapping model with degree correction. For the football network, on the other hand, the preferred model is nonoverlapping and without degree correction with , which matches very well the assumed correct partition into conferences. The groups are relatively homogeneous, with most nodes having similar degrees, such that degree correction becomes an extra burden, with very little added explanatory power. For this network, however, there are alternative fits with values of within the plausibility region, which means that the communities are not very strongly defined, and they admit alternative partitions with and groups which cannot be confidently discarded given the evidence in the data.

Degree correction tends to become a better choice for larger data sets, which display stronger degree variability. One example of this is the network of political blogs obtained by Adamic et al Adamic and Glance (2005). For this network, the best model is a degree-corrected, overlapping partition into groups, shown in Fig. 2. Compared to this partition, the best alternative model without overlap divides the network into groups111In Ref. Peixoto (2014a) using the same nonoverlapping model, a value of was found. This is due the difference in the description length for the degree sequence, where here we use a more complete estimation than in Ref. Peixoto (2014a), which results in this slight difference., but has a posterior odds ratio significantly below the plausibility region. It should be observed that the nonoverlapping version captures well the segregation into two groups (Republicans and Democrats) at the topmost level of the hierarchy. The overlapping version, on the other hand, tends to classify half-edges belonging to different camps into different groups, which is compatible with the accepted division, but the upper layers of the hierarchy do not reflect this, and prefers to merge together groups that belong to different factions, but that have otherwise similar roles in the topology.

Overlapping partitions, however, do not always provide better descriptions, even in situations where it might be considered more intuitive. One of the contexts where overlapping communities are often considered to be better explanations is in social networks, where different social circles could be represented as different groups (e.g. family, co-workers, friends, etc.), and one could belong to more than one of these groups. This is illustrated well by so-called “ego networks,” where one examines only the immediate neighbors of a node, and their mutual connections. One such network, extracted from the Facebook online social network Mcauley and Leskovec (2014), is shown in Fig. 3. The common interpretation of networks such as these is shown on the right in Fig. 3, and corresponds to a partition of the central “ego” node so that it belongs to all of the different circles. Under this interpretation, the ego node is only special in the sense that it belongs to all groups, but inside each group it is just a common member. However, among all model variants, the best fit turns out to be the one where the ego node is put separately in its own group, as shown in the left in Fig. 3. In this example it is easy to see why this is the case: If we observe the degree distribution inside each group for the network on the left, we see that there is no strong degree variation. On the right, as the ego is included in each group, it becomes systematically the most connected node. This is simply by construction, since the ego must connect to every other node. The only situation where the ego would not stand out inside each group, would be if the communities were cliques. Hence, since the ego is not a typical member of any group, it is simpler to classify it separately in its own group, which is selected by the method as a being a more plausible hypothesis. Note that degree correction is not selected as the most plausible solution, since it is burdened with the individual description of every degree in the network, which is fairly uniform with the exception of the ego. One can imagine a different situation where there would be other very well connected nodes inside each group, so that the ego could be described as a common member of each group, but this not observed in any other network obtained in Ref. Mcauley and Leskovec (2014). Naturally, if one considers the complete network, of which the ego neighbourhood is only a small part, the situation may change, since there may be members of each group to which the ego does not have a direct connection.

No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Karate Club Zachary (1977) Enron emails Leskovec et al. (2008); Klimt and Yang (2004)
Dolphins Lusseau et al. (2003) PGP Richters and Peixoto (2011) (directed)
Les Misérables Knuth (1993) Internet AS (Caida)222Retrieved from http://www.caida.org. (directed)
Political Books333V. Krebs, retrieved from http://www-personal.umich.edu/~mejn/netdata/ Brightkite social network Cho et al. (2011)
American football Girvan and Newman (2002); Evans (2012) netflix-pruned-smaller-u
C. elegans Neurons Watts and Strogatz (1998) (directed) arXiv Co-Authors (hep-th) Leskovec et al. (2007)
Coauthorships in network science Newman (2006b) Epinions.com trust network Richardson et al. (2003) (directed)
Disease Genes Goh et al. (2007) arXiv Co-Authors (hep-ph) Leskovec et al. (2007)
Yeast protein interactions (CCSB-YI11) Yu et al. (2008) arXiv Co-Authors (cond-mat) Leskovec et al. (2007)
Political Blogs Adamic and Glance (2005) (directed) arXiv Co-Authors (astro-ph) Leskovec et al. (2007)
Yeast protein interactions (LC) Reguly et al. (2006) Gowalla social network Cho et al. (2011)
Yeast protein interactions (Combined AP/MS) Collins et al. (2007) EU email Leskovec et al. (2007) (directed)
E. coli gene regulation Salgado et al. (2013) (directed) Flickr McAuley and Leskovec (2012)
Yeast protein interactions (Y2H union) Yu et al. (2008) Web graph of stanford.eduLeskovec et al. (2009) (directed)
Facebook egos Mcauley and Leskovec (2014) DBLP collaboration Yang and Leskovec (2012a)
Power Grid Watts and Strogatz (1998) Web graph of nd.eduLeskovec et al. (2009) (directed)
Airport routes 444Retrieved from http://openflights.org/ (directed) WWW Albert et al. (1999) (directed)
Airport routes Amazon product network Yang and Leskovec (2012a)
Wikipedia Votes Leskovec et al. (2010b, c) (directed) IMDB film-actor555Retrieved from http://www.imdb.com/interfaces. Peixoto (2013)
Human protein interactions (HPRD r9) Prasad et al. (2009) APS citations666Retrieved from http://publish.aps.org/dataset. (directed)
arXiv Co-Authors (gr-qc) Leskovec et al. (2007) Youtube social network Yang and Leskovec (2012a)
Table 1: Comparison of different models for many empirical networks. The columns at the top table correspond to the dataset number (with the name given at the bottom table), the number of nodes , the average degree , the posterior odds ratios relative to the best model for the degree-corrected overlapping (