Growing networks of overlapping communities with internal structure
Abstract
We introduce an intuitive model that describes both the emergence of community structure and the evolution of the internal structure of communities in growing social networks. The model comprises two complementary mechanisms: One mechanism accounts for the evolution of the internal link structure of a single community, and the second mechanism coordinates the growth of multiple overlapping communities. The first mechanism is based on the assumption that each node establishes links with its neighbors and introduces new nodes to the community at different rates. We demonstrate that this simple mechanism gives rise to an effective maximal degree within communities. This observation is related to the anthropological theory known as Dunbar’s number, i.e., the empirical observation of a maximal number of ties which an average individual can sustain within its social groups. The second mechanism is based on a recently proposed generalization of preferential attachment to community structure, appropriately called structural preferential attachment (SPA). The combination of these two mechanisms into a single model (SPA+) allows us to reproduce a number of the global statistics of real networks: The distribution of community sizes, of node memberships and of degrees. The SPA+ model also predicts (a) three qualitative regimes for the degree distribution within overlapping communities and (b) strong correlations between the number of communities to which a node belongs and its number of connections within each community. We present empirical evidence that support our findings in real complex networks.
I Introduction
Networks are at the center of the quantitative analysis of social systems Wasserman (1994). They encode the social ties among different individuals within a mathematical construct that allows a quantitative assessment of the role of individuals in social networks through various measures, and the analysis of correlations among them Wasserman (1994); Newman (2010). One instance of these correlations, the similarity between the neighborhoods of different nodes (individuals), has received particular attention since links tend to be clustered in tightly connected groups Watts and Strogatz (1998); Girvan and Newman (2002). Networks are often expressed as a superposition of such densely connected groups, and we refer to this decomposition as the community structure of a network Fortunato (2010); Xie et al. (2013).
We consider the problem of modeling both the emergence of community structure in social networks and the growth of the internal structure of these communities. Many community detection algorithms and community modeling efforts consider a fully random, or ErdősRényi (ER), internal structure Peixoto (2012); Clauset et al. (2008); Guimerà and SalesPardo (2009); Seshadhri et al. (2012); Yang and Leskovec (2012). This is a principled approach, in the sense that it relies on minimal a priori information, but it is unfortunately incompatible with most common growth processes in two respects. One, it ignores the temporal aspect of community growth HébertDufresne et al. (2016). Two, it ignores the fact that nodes can have very heterogeneous structural roles in complex networks Barabási and Albert (1999).
The preferential attachment mechanism (PA) Simon (1955); de Solla Price (1976); Barabási and Albert (1999) offers a simple way to include the temporal and heterogeneous aspects of complex networks in growth processes. PA is based on the assumption that a node’s current state is a good indicator of its future behavior. We take inspiration from the PA model Barabási and Albert (1999) and its recent extension to community structure HébertDufresne et al. (2011, 2012). We combine heterogeneous PA at the level of communities with minimal a priori information for the internal structure of communities. That is, we postulate simple rules for the growth of the internal structure of communities. In so doing, we provide a new growth process that reproduces a number of important properties of overlapping community structures and complex networks.
The structure of the paper is as follows. In Sec. II, we describe a process by which a single community and its structure may grow. We find an upper bound on how many connections an average individual may maintain as the community grows. This finding is discussed in relation to the anthropological theory known as Dunbar’s number. In Sec. III, we incorporate this internal community growth process within a preferential attachment model at the community structure level and provide a recipe for its implementation. This yields a general model for the concurrent growth of overlapping and heterogeneous communities. In Sec. IV, we compare our model to empirical data and investigate its implications. We find that our model generates networks whose global statistics are comparable to that of real networks, and that their internal community structure contain correlations also present in empirical datasets. We close with a short conclusion in Sec. V, and relegate some of the technical details to two Appendices.
Ii Growth of a single community
In this first section, we introduce a simple model that describes the growth of a single community, independently of the rest of the network. The model builds on the recent observation that the rate of growth of a community is predicted by preferential attachment HébertDufresne et al. (2011, 2012, 2015). This hypothesis is known to reproduce some of the statistical properties of the community structure of real networks HébertDufresne et al. (2011, 2012). It can be interpreted as if each node in a community introduces new nodes at a fixed rate: The more nodes, the faster the community grows with respect to other competing communities in the same network. In what follows, we combine this node creation mechanism to an elementary link creation mechanism, and obtain a reasonable model for the growth of a single community.
ii.1 Description of the model and meanfield analysis
We model the growth of a single community with a continuoustime Markov process. The model is simply stated. A community is initially represented by a small graph, e.g. a triad or a single node. Each of these nodes recruit new nodes at a constant rate ; at time , the growth rate of the community is therefore proportional to , where is the size of the community. Whenever a new node is recruited, it is at first only connected to the node who recruited it (its degree , i.e. number of neighbors, therefore equals within the community). To allow for denser communities, we introduce another mechanism whereby each node initiates the creation of an undirected link at a constant rate (unless it is already connected to every node). A second node is randomly selected to complete the link (note that we exclude selfloops and multiple links).
The average number of nodes with degree within an average community of size can be followed through continuous time with the interdependent set of rate equations
(1a)  
where is the integer part of , where is the number of nodes that can initiate link creation events, and where is the total number of potential links. The first term accounts for the arrival of new nodes: Each node recruits at rate and gains new connections accordingly. This creates a flow that brings a node of degree to degree [positive effect on ] and node of degree to degree [negative effect]. The second term is due to the creation of new links: Each node initiates the creation of a new link at rate , and the net effect on is identical to that of the node creation mechanism. The third term accounts for the increase in degree incurred by a node randomly selected to complete new links. Events of this type occur at rate and affect nodes of degree with probability .  
Equation (1) is only valid when for two reasons. One, nodes of degrees cannot initiate or receive new links. Two, node creation only involves nodes of degree . Another set of rate equations is therefore needed to handle the limit cases. We find  
(1b)  
(1c)  
(1d)  
(1e) 
Note that the set of Eqs. (1) becomes inconsistent when , since we obtain different equations for a same compartment . Fortunately, we do not need Eqs. (1) to track the evolution of the community when —this evolution is deterministic. A community that contains a single node must first grow: The only node is already of maximal degree, and link creation events never occur. The same reasoning applies to the case of . Therefore, whenever , we can instead use the initial condition , i.e. track the community starting from the point where randomness plays a role. If temporal information is important, then one can compute the expected amount of time spent in configurations of sizes , and correct the prediction a posteriori (the delay is an exponentially distributed random variable).
Summing Eqs. (1)–(1e), one finds
(2) 
This last equation, together with the observation that allows us to describe the system in terms of the average community size rather than as a function of time. We find
(3) 
and limit cases similar to the expressions listed in Eqs. (1). This formulation has the added benefit of highlighting the dependency in the relative ratio of events . We validate Eq. (II.1) in Fig. 1, where we show that the numerical solutions of this system of differential equations capture the important features of the growth dynamics ^{1}^{1}1A Python implementation of the integrator is available online at https://github.com/spanetworks/spa.. Agreement is, however, not perfect. Discrepancies between simulations and the solutions of Eq. (II.1) can be traced back to the continuous approximation involved in writing differential meanfield equations for discrete quantities, as well as the absence of structural correlation in this type of model. The net effect is a shift of the prediction toward higher degrees for the bulk of the distribution.
Figure 1 shows that small and medium communities are highly homogeneous, while the degree distributions in larger communities are heavily skewed. This heterogeneity arises from the history of the community; the few nodes that join early, when growth is slower, can create more links than the many nodes who join the community as growth accelerates. The separation in three regimes holds for arbitrary values of , with the transition from homogeneous to heterogeneous degree distributions occurring at higher community sizes for larger values of (see the scaling arguments in Appendix A).
ii.2 Approximate average degree
A simpler point of view can be adopted to gain further insights into the relation between the average degree of a node and the size of its community [ is the number of links in the community at time ].
As previously stated, a node will not initiate the creation of new links if its degree equals (see Sec. II.1), while the rest of the nodes create new links at a rate . The total link creation rate is therefore given by
(4) 
where is the contribution of the node recruiting process, and where the second term merely states that only nodes of degree contribute to the creation of new links within the community at a rate .
If we assume a uniform and uncorrelated distribution of links among nodes, and define —the maximal number of links in a community of size —then , the probability that a randomly selected node is of maximal degree , can be approximated by
(5) 
Using Eqs. (2) and (5), we express the rate of change of as a function of the average size at time :
(6) 
While the actual link distribution is neither uniform nor uncorrelated in the model (see Fig. 1), we will see that our approximation is robust enough, and that Eq. (6) accurately reproduces the average degree (see Sec. II.3).
A simple analysis of Eq. (6) highlights an interesting feature of the model. For large sizes , the factor goes rapidly to zero, such that a maximal link creation rate
(7) 
is attained. Hence, the intensive quantity converges toward a constant that depends on the parametrization of the model alone. Considering that one link equals two stubs (or degree), the asymptotic average degree is directly related to the parameter through:
(8) 
This indicates a maximal average number of connections in a social group.
ii.3 Relation with Dunbar’s number
The results shown in Fig. 2 highlight two different behaviors of the average number of links per individual in relation to the size of a social group. For low average sizes , the mean degree scales linearly with the community size . In other words, our model captures the fact that everybody knows everybody within small groups (e.g., family or close friends). At larger sizes , reaches the plateau given by Eq. (8). From this point onwards, an average individual will not gain new connections when the potential number of connections is increased. So, while there is no maximal community size per se, there is a maximal number of connections that an average individual might possess within a given group (e.g., large companies or online communities).
Interestingly, this upperbound on the average activity of an individual is related to an anthropological theory known as Dunbar’s number Dunbar (1992). This theory is based on the observed relation between neocortical size in primates and the average size of their social groups. Its interpretation usually involves information constraints related to the quality of interpersonal connections, and their ability to maintain such relationships. While the importance of neocortical sizes de Ruiter et al. (2011) and the generality of the results Shultz and Dunbar (2007) are both disputable, the fact remains that empirical evidence supports the existence of an upper bound in the absolute number of active relationships for an average individual, in a given activity (e.g., Ref. Gonçalves et al. (2011) for activities on Twitter). In fact, more recent work on social network sizes in humans focus on the progressively higher bounds on average internal degree observed at different social levels or activities: e.g., neighbors, relatives, workplace, and friend circles Hill and Dunbar (2003); Dunbar (2012). These different social levels can be modeled as different communities around one individual (this is the subject of the next section).
In our model, this upper bound naturally emerges and is solely dependent on the parameter . This parameter can be interpreted as the ratio between the involvement of an individual in a community, in the sense of bonding with other members, and its contribution to the growth rate of the community. Note that we do not interpret the plateau as an absolute upper bound, but rather as a bound on the maximal number of connections that an average individual can maintain. For low (or large communities), the rate of change in the population is higher than an individual’s involvement such that the maximal degree stagnates. Whereas, for high (or small communities), the individual is able to follow the population changes and hence create relationships with most of its members. Different types of social organizations will feature different and, consequently, different values of “Dunbar’s number” (an online social network, where relationships are easily maintained, will entail higher values of than a coauthorship network for example): Different type of activities (networks) should also be modeled using different values of .
In this interpretation, the upper bound on the degree is due to the fact that connections and introduction of new members have linear requirements for individuals, but exponential consequences for the group. Other mathematical models describe Dunbar’s number (e.g., Ref. Gonçalves et al. (2011)), usually with arguments of priority and/or time and resources management Dunbar and Shultz (2007). However, our model is based on the observed structure of the communities of real networks and consequently, parsimoniously explains Dunbar’s number in terms of its two basic units—individuals and groups—and the ratio of their respective characteristic growth rates. The consequence of this result for the complete community structure of social networks is discussed in Sec. IV.4. Beforehand, we must first move from a description of the evolution of a single community to a description of the evolution of a superposition of many communities.
Iii A growth model for networks with both inter and intracommunity structure
The model of the previous section is concerned with the growth of an isolated community—a group of friends, a company, or a nascent research group. Most complex networks, however, comprise more than a single overlapping community Xie et al. (2013). To use the model of Sec. II on a larger scale, one therefore needs a mechanism to track multiple, concurrently growing, and overlapping communities. As we will see shortly, the structural preferential attachment (SPA) model of Refs. HébertDufresne et al. (2011) and HébertDufresne et al. (2012) is both a suitable and practical candidate.
In a nutshell, SPA builds on the popular idea that networks can be interpreted as the projections of abstract structures such as communities HébertDufresne et al. (2015). The network is not modeled explicitly: instead, SPA generates an assignment of nodes to overlapping communities, and one instantiates a network based on the community assignments, e.g., by assuming that communities are ER graphs. SPA therefore lacks an explicit growth mechanism for links.
In what follows, we show how to use the community assignments of SPA jointly with the community growth process of Sec. II. Specifically, we construct a model in which the history of each community is described by the model of Sec. II, and the history of the community structure is described by SPA. In this growth model, both facets of the systems—the internal structure and the community structure—evolve simultaneously. But before we introduce the coupled growth model (in Sec. III.3), we first review the key ideas behind SPA.
iii.1 Structural preferential attachment
The essence of SPA can be summarized as follows HébertDufresne et al. (2011). At every discrete time step, a growth event occurs. An event marks the birth of a new node with probability , and the creation of a new fully connected community of nodes, with probability . When an existing node or community is involved (with complementary probabilities and respectively), it is chosen preferentially to its past activity: A node with memberships or a community of size is times more likely to be chosen than a node (or community) with membership (or node). This process ensures that both the membership and size distributions converge to a powerlaw distribution in the limit of large system sizes. The probability controls how interconnected communities are, the probability controls the distribution of community sizes, and the basic size allows one to enforce minimal connectivity in the full system. In SPA, links can only exist between nodes belonging to the same community, and a largescale connectivity of the network is achieved through overlapping node assignments.
We can write rate equations to follow the numbers of nodes belonging to groups and the number of groups with nodes. These equations are similar to most linear preferential attachment equations
(9)  
(10)  
and , can be shown to scale as power laws, i.e., and , with exponents HébertDufresne et al. (2012)
(11)  
(12) 
Because the growth rules are time independent and since and are probabilities in the interval, the average number of memberships per node and average community size converge in time. Therefore, and are always and SPA is not expected to reproduce distributions whose asymptotic decay exponent is smaller than 2. The interested reader is directed to Refs. HébertDufresne et al. (2011, 2012) for a complete derivation of these results.
iii.2 Coupling a discrete and a continuous processes
Recall that our goal is to couple the mechanism of Sec. II (hereafter the local model) and SPA. To do so, we must first determine the relation between the time scales of the local model and that of SPA, thereby allowing a concurrent simulation of both processes. This is not a simple matter, since one must reconcile the continuous nature of the local growth mechanism with the discrete nature of SPA.
In SPA, time is measured in number of events. Without loss of generality and for reasons that will become apparent shortly, let us define a rescaled discrete time scale in which a fraction of the time steps lead to SPA events, such that . The community structure does not change during the remaining time steps. Because a time step marks the birth of a new community (of size ) with probability , or the growth of an existing one with complementary probability , we can write the time dependent sum of the sizes of all communities as
(13) 
The average size of community in discrete time is then governed by a rate equation
(14) 
where we have defined . Equation (14) merely states that growth events affect community with probability (i.e. preferentially to its size). In the limit of large , (14) is equivalent to
(15) 
iii.3 The coupled growth model: SPA+
Equation (17) tells us how fast a community evolves in comparison with the community structure; we can use this information to formulate an algorithm that simulates both processes concurrently. We choose to describe the local link creation process of Sec. II in time . As such, the backbone of the algorithm will be the SPA process, to which we now must add details pertaining to the local model of Sec. II.
The first part of the local model (nodes are recruited at rate ) is easily accounted for: Whenever a node joins a new community, we simply choose a recruiting node uniformly among the current members of that community and form a new link. The exponential growth of communities in SPA ensures that this process is consistent with the model of Sec. II.
The second part of the local model (links are created at rate ) entails a more involved analysis. Let us define as the effective size of community , i.e., the number of nodes that are allowed to create links [number of nodes of degree links within community ]. Then, in the local model, the number of links in a community of effective size grows at a rate
(18) 
such that links are introduced in the community at the rate
(19) 
The purpose of time transformation is then apparent: It can be adjusted to bound to the interval for all . Since is an arbitrary fraction which also lies in , we adopt the simplest choice, i.e.,
(20) 
Equation (19) can then be interpreted in two ways. Straightforwardly, we may say that at each time step of the SPA process, a new link is created between the existing members of a community of effective size with a probability given by the righthand side of (19) for all . Alternatively, we may say that at each time step of the SPA process, a new link is created with probability in a community selected with a probability proportional to its effective size . Equation (20) ensures that this interpretation is always sensible. In both interpretations, if a link must be created, we choose two nodes of degree at random and connect them.
Note that the ratio is not normalized.
In the context of the second interpretation, this implies that at each time step , there is a probability that no link creation event will occur.
Alternatively, we may select the community in which the link creation event occurs proportionally to its actual size and connect two nodes chosen uniformly among all the nodes of that community.
The ratio will then be effectively respected if we consider that a link creation simply “fails” whenever the first randomly selected node has the maximal number of connections.
The above analysis yields a straightforward algorithm for the modified version of SPA (hereafter SPA+) ^{2}^{2}2A C++11 implementation of SPA+ is available online at https://github.com/spanetworks/spa.. Starting with disjoint and fully connected communities of size , at each discrete time step :

a new community of size is created with probability or an existing one (chosen preferentially with respect to its size) grows with probability ;

if a community birth event occurs, one of the involved nodes is a new one with probability or an existing one (chosen preferentially with respect to its current number of memberships) with complementary probability . The other nodes are chosen preferentially with respect to their current number of memberships among existing nodes;

if a community growth event occurs, the involved node is a new one with probability or an existing one (chosen preferentially with respect to its current number of memberships) with probability . Once the node is added to the community, we randomly select another node in the community (uniformly) and create a link;


with probability , a new link is created in a community chosen preferentially to its size. It connects a uniformly chosen node, and a uniformly chosen potential neighbor, provided that the source node is not already connected to every node in the community.
If , link creation occurs on slower time scale than community structure related events, whereas the converse is true if .
iii.4 Redundant memberships, multiple links and selfloops
In SPA, one assumes that a community grows on its own, and that new members are drawn from an infinite reservoir of indistinguishable nodes HébertDufresne et al. (2012). In practice, the reservoir is finite and each node therein is tagged; when the system is small and the parameters and take extreme values ( for any , the worst case being and ), there is a significant probability that a node will appear more than once in a community. To respect the relative rates of all events and preserve the meanfield mapping of Sec III.2, we consider that these duplicate nodes are effectively new. The implications of this observation for the community structure are discussed at length in Ref. HébertDufresne et al. (2012). There is additional implications for the combined SPA+ model.
The fact that the same node can (and will) join the same community more than once implies that we will create parallel links and selfloops, because a node can become connected to copies of itself. Because these types of links are seldom considered in empirical datasets, we collapse the redundant memberships into a single membership at the end of the growth process, i.e., we merge nodes with all their duplicates within each communities. This (a) skews the tail of the membership and size distribution and (b) removes multiple selfloops from the system. The net effect is that communities becomes denser on average. We note that these redundant memberships are known to account for a vanishingly small fraction of all memberships when the number of communities is large and the parameters are not too small HébertDufresne et al. (2012). The consequences of redundant memberships should therefore subside in large networks.
Iv Results and Discussion
The SPA model has previously been shown to capture many properties of the community structure of real networks HébertDufresne et al. (2011, 2012), such as the distribution of community sizes, of node memberships, and of community degrees. We now investigate these properties anew by modeling three social networks: Two coauthorship networks obtained from the arXiv circa 2005 Palla et al. (2005) and from MathSciNet circa 2008 Palla et al. (2008), as well as the email exchange network of Enron Klimt and Yang (2004). We detect their community structure with five different algorithms: A link clustering algorithm Ahn et al. (2010) (LCA), a greedy clique expansion algorithm Lee et al. (2010) (GCE), the order statistics local optimization method Lancichinetti et al. (2011) (OSLOM), a greedy modularity optimization of linegraphs algorithm Evans and Lambiotte (2009) (LG), and a modified version of the classical clique percolation algorithm Young et al. (2015); Palla et al. (2005) (CCPA). This provides us with a total of 15 systems, from which we have selected 5 representative examples: arXiv as described by both the CCPA and LCA, Enron as described by the GCE and OSLOM algorithms, and MathSciNet as described by the GCE algorithm. Note that three of the above algorithms (LCA, LG, CCPA) identify link partitions, while the other two directly find overlapping node communities. We translate link partitions into node communities to analyze every algorithm on a common basis, where the true community of a link is unknown.
We model a real network by estimating a value for the tuple of parameters . The details of the parameter estimation procedure are gathered in Appendix B. In a nutshell, we use the community structure of the real network to first estimate and (yielding and ). We then obtain an estimate of by fitting the model of Sec II to the internal degree distributions of each community. The final number of nodes and the basic community size are both fixed by the empirical dataset. is trivially the number of nodes in the real network, and we select in all cases, because it leads to networks with more than one component, a feature of the empirical datasets listed above.
The SPA+ model is, in some sense, minimal. One parameter controls the amount of overlap (), one parameter controls the distribution of community sizes (), and one parameter controls the density of these communities ().
iv.1 Global statistics
In Fig. 3 we compare the statistical properties of SPA+ networks with their empirical counterparts. In this respect, the new contribution of the present study is the global degree distribution: SPA models the distribution of community sizes and node memberships, while the growth mechanism of Sec. II models the degree distribution within each community. The degree distribution of the network is an emerging property of the SPA+ model, since it is not modeled directly. It is necessarily fat tailed, because it arises from the convolution of two fat tailed distributions (memberships and sizes) HébertDufresne et al. (2012). The parameter [and thus the local model of Sec. II] controls the speed of the decay of the degree distribution, through its effect on the relation between community size and average degree.
Figure 3 shows that SPA+ can reproduce the degree distribution of the real dataset, if the overlapping communities decomposition of the network is in line with our modeling hypotheses. That is, SPA+ can generate degree distributions with the correct shape only if the detected community structure is heterogeneous. By heterogeneous, we mean that the distributions of community sizes and node memberships are either power laws, or power laws with an exponential cutoff. As long as we consider such systems, we can fit both the size and membership distribution robustly Clauset et al. (2009); HébertDufresne et al. (2012) [see Figs. 3(a)–3(c)]. Due to the nature of our model, the quality of the predicted degree distribution is inherently connected to the quality of the predicted size and membership distributions. SPA+ does poorly in two cases [see Figs. 3(d)–3(e)], and since the membership distributions are well represented in all cases studied, the culprits lay mainly with the size distributions. In Fig. 3(d), we diverge from the data at low community sizes and fail to account for an extremely large community (of size ). In Fig. 3(e), the empirical size distribution decays asymptotically slower than the behavior accessible to the model, i.e., [Eq. (11)]. We also note that the statistics of real datasets do contain kinks and bumps (real or spurious) that cannot be reproduced by simple growth models like SPA+, although the average behavior can be well captured [see Figs. 3(a) and 3(c)].
iv.2 Local statistics
In Sec. II, we have established that according to our model, the internal degree distributions of growing communities could display three different regimes: A highly homogeneous regime where every node is nearly of the maximal degree, a homogeneous regime where the bulk of the nodes has similar degrees, and a heterogeneous regime where the majority of the nodes have low degrees (while a few nodes are highly connected). These regimes can be observed in a number real networks, once their community structure is uncovered by algorithms designed for the detection of overlapping communities. In Fig. 4, we present the three regimes in the arXiv coauthorship network, as detected by the CCPA algorithm. The figure illustrates two important facts. On the one hand, it puts the internal model of Sec. II on firmer empirical ground—it confirms that the evolution of the internal degree distribution of arXiv is captured by the model. On the other hand, it emphasizes that the internal degree distributions of the uncovered communities can be quite distinct from random ErdősRényi graphs, as it is often implicitly assumed. This further supports the recent shift towards principled community detection algorithms which explicitly allow for arbitrary degree distributions within communities Peixoto (2012); Karrer and Newman (2011); Peixoto (2015).
The results of Fig. 4 must, however, be taken with some caution. We have not performed an exhaustive search, instead we have selected a network well reproduced by SPA+ (see Fig. 3), and have averaged the distribution not only over all communities of the same size , but also over many community sizes. This procedure was necessary since there is only a handful of communities at any given size . A more thorough study of real (large) networks will be able to tell us just how prevalent the separation in three regimes actually is.
iv.3 Correlations between the global community structure
and the local structure of communities
An additional property is also captured by SPA+. The results shown in Fig. 5 investigate correlations between the organization within communities and the overarching community structure. We obtain the relation between the average internal degree of a node within communities of size (i.e., the “social involvement” of an individual within a group), and its membership number , in empirical datasets and the corresponding simulated networks. We quantify this relationship by the ratio .
Generally, all algorithms except GCE find that nodes active in the community structure (high number of memberships) tend to be also active within communities (high average internal degree ). Even though agreement is not perfect, our model reproduces this effect through agememberships and agedegree correlations. While the available data do not tell whether these correlations are indeed age related, it is natural to assume that authors or employees who have been active for a longer time the arXiv or a company, tend to have both more social groups and more relations within them. To the best of our knowledge, these correlations are not considered in other growth models, but naturally emerge here, from our link creation mechanism. In essence, this means that individuals acting as hubs in the community structure (many memberships), tend to act also as hubs within the structure of their communities.
These remarks bear some relation to the hub dichotomy, first introduced in the literature of proteinprotein interaction networks Han et al. (2004); Bertin et al. (2007); Agarwal et al. (2010), namely the distinction between date hubs (nodes with many links in different communities) and party hubs (nodes with many links from a given community). What we see in social networks is that there also exists a different and important class of hubs with many links from many different communities. This stresses anew the importance of nodes that act as social bridges by connecting different communities Nepusz et al. (2008); Wang et al. (2011). While these hubs have long been recognized as important Granovetter (1973), they are now also a focus of immunization methods on networks Masuda (2009); HébertDufresne et al. (2013).
iv.4 Some implications of Dunbar’s number
In Sec. II.3, we have discussed the theoretical relation between our model for the internal structure of communities and a cognitive limit in an individual’s social relationships known as Dunbar’s number. In our model, this limit stems from the ratio of effort put into building new connections and in increasing group size , which constraints the average internal degree in large groups. In Fig. 6, we observe a similar behavior in our social network datasets. The empirical results are also compared with our model using the leastsquares estimator (see Appendix B for details).
In the context of overlapping communities, we wish to emphasize three important caveats on the connection between our work and recent studies on Dunbar’s number. First, most work on bounds of active relationships in different communities is concerned with nested social levels Zhou et al. (2005); Hamilton et al. (2007). While our communities overlap, they are not in any way nested. Second, on a related issue, if we wish to interpret different communities as a node’s family, friends, or workplace, we should allow nodes to have different involvement in different communities. Third, if on the other hand we wish to interpret an entire network as one level of activity, Dunbar’s number then implies a bound on a node’s total degree. While both the internal average degree per community and the number of communities per node are bounded, we have shown strong correlations between these two quantities. Actually, one can easily infer from the algorithmic description of the model (see Sec. III.3) that the average degree converges to and is thus also bounded.
Finally, the observed plateau in internal degree implies a vanishing average density , i.e., fraction of potential links that exist, for large communities. Regardless of the nature of the network, of the community detection algorithm and of the parameters , the simple existence of the plateau implies that community density vanishes as . This is obviously true in our model, and observed for our datasets in Fig. 6. Only the community structure of Enron as detected by LG stands out from the prediction. Further empirical studies would, however, be required to support this finding.
V Conclusion
We have introduced a simple model for the growth of a community and focused on its connection with a model that describes the growth of overlapping community structures (SPA). In so doing, we have showed that the local model is consistent with empirical observations (vanishing density and varying heterogeneity in communities). We have then explored a number of properties of the combined model (SPA+) and investigated the same properties in empirical networks. These properties came in three categories: global statistics (distributions of sizes, memberships and degrees), correlations between a node’s activities within communities and within the overarching community structure, and the vanishing density (as ) of large communities. In all cases, we have found that SPA+ behaves much like its empirical counterpart. We have also shown that our model is consistent with the theory of Dunbar’s number, both within communities and at the level of the complete network. The presentation of shortcomings and successes of the SPA+ principle (in terms of predictive value) shows the importance and the need for further study in stochastic growth models.
Acknowledgments
We are grateful to anonymous referees for their suggestions in improving our presentation. The authors thank Calcul Québec for the computing facilities. This work has been supported by the Instituts de recherche en santé du Canada, the Conseil de recherches en sciences naturelles et en génie du Canada, the Fonds de recherche du QuébecNature et technologies, and the James S. McDonnell Foundation Postdoctoral Fellowship. J.G.Y. and L.H.D. contributed equally to this work.
Appendix A Peloton dynamics
This Appendix presents our preliminary analysis of the results of Fig. 1 which are reminiscent of the peloton dynamics studied in Ref. HébertDufresne et al. (2012). It is a finitesize effect related to the leaders dynamics; groups of highly connected individuals result in a clearly identifiable bulge in the degree distribution. Averaging over multiple realizations of the growth of a community leads to the creation of a peloton where one is significantly more likely to find entities than predicted by the asymptotic distribution. Because the same peloton evolves with growing , it is expected to retain its shape across a large range of community sizes. The simplest scaling ansatz takes the form
(21) 
where is a universal function. The construction is clear: takes care of the power law decreases and the scaled variable aligns all curves together. This exercise is carried out in Fig. 7 for the case . The procedure is inspired by Ref. Christensen and Moloney (2005) and is called quite appropriately data collapse.
Although, we have not investigated the exact form of , its general behavior is characteristic of a number of selforganized critical systems observed thus far (Ref. Christensen and Moloney (2005)): A flat curve sharply rising to a well defined maximum followed by a rapid exponential decrease as a function of the rescaled variable. The scaling information is captured by the exponents and . They can be extracted numerically from the positions of the maxima of the bulges of the individual probability distributions, together with the values of the probabilities at these maxima [see Fig. 7(a)] and the scaling ansatz of Eq. (21). The search for the best scaling exponents and is done separately under the assumption that they are independent. This is coherent with our scaling ansatz. In practice, one obtains from the asymptotic slope of the distributions (i.e., the initial dependence on before the peloton) and from a power law fit to versus . Our initial findings, based only on two values of , reveal that the exponents have only a mild dependence on and in particular that seems to be close to . In view of our small datasets, it is not expected that the numerical values of used in Fig. 7 are the absolute best scaling exponents. A complete analytical justification of our scaling ansatz and a derivation of the expected values of the exponents are still lacking. However, the mere existence of a scaling behavior provides useful estimates of how the degrees of the leaders scale with network size. This is a crucial information when one is interested in the statistics of the extremes, both in theory Krapivsky and Redner (2002) and application Albert et al. (2000). This calls for a more extensive study beyond the scope of the present contribution.
Appendix B Parameter estimation
This Appendix presents our parameter estimation method. The problem is simply stated: We are given an empirical network of nodes, and an assignment of its nodes in overlapping communities. A number of statistics are associated to the network–communities pair: the node membership distribution, the community size distribution, and the internal degree distribution of each communities. Our task is to identify the parameters which will generate synthetic network–communities pairs whose statistics are as close as possible to the statistics of the empirical dataset. Because the final network size and basic community size are both automatically determined by the empirical dataset, this amounts to identifying the optimal value of three free parameters: (size of communities), (memberships of nodes), and (density of communities).
One could be tempted to fit these three free parameters simultaneously, especially since the algorithm of Sec. III.2 integrates the local growth model (parametrized by ) and SPA (parametrized by ). Three observations indicate that this is not necessary. First, it is clear that only determines the number of communities to which an average node belongs, a quantity that has no bearing on the internal connectivity of a community. We can therefore fit this parameter independently of . Second, the introduction of [see Eq. (20)] allows us to treat and independently, even though both parameters are related to the rate of growth of communities. That is, we can always obtain a distribution of community sizes of exponent and simultaneously generate communities of average asymptotic degree . Only the value of —a nonphysical parameter—changes from one set of parameters to the other. Third, the coupling between and is already understood: Changes in the value of mostly affect the size distribution, and changes in the value of mostly affect the membership distribution HébertDufresne et al. (2011, 2012). We use the word “mostly” because these parameters are independent if (always the case in this study), but there exists a weak coupling if ; the interplay between the two parameters is then prescribed by Eqs. (11) and (12). In what follows, we will explain how to fit and independently from one another, starting with and . Note that all estimated values of the parameters will be affixed with a caret: , , and .
b.1 Community structure estimators
The estimates and are obtained directly from the memberships and size distributions of the empirical data. We first assume that these distributions are pure power llaws and use a systematic method to extract their exponents by likelihood maximization Clauset et al. (2009). We then find a first set of values for and by inverting Eqs. (11) and (12). Because neither the empirical nor the modeled distribution are pure power laws, these values act as first approximations; small perturbations () to and can increase the quality of the fit. We select the estimates and that minimize the difference between the CCDF of the empirical and the simulated distributions.
b.2 Density estimators
There exist many methods to fit to the empirical data. We will focus on two simple ones. The first is a straightforward least square estimator (LSE); it compares the distance between the observed average degree in groups of size and the analytical prediction of Eq. (6) for . By minimizing the distance over all , one obtains the estimate . The other method is a simple likelihood maximization (MLE) which relies on the results of the rate equations of Sec. II.1, i.e., on the internal degree distribution, parametrized by the community size . Let be the sequence of internal degrees of a real network, where refers to node and is the index of a community of node . Assuming uncorrelated communities, the loglikelihood that was used to generate the sequence is then
(22) 
where is the probability of finding a node of degree in a community of size , if the growth ratio equals . This probability is obtained by integrating Eq. (II.1), with the initial condition . We select the estimate that maximizes Eq. (22).
b.3 Bias of the density estimators
There exists three sources of bias for : the distribution of community sizes, the redundant memberships discussed in Sec. III.4, and the presence of overlap in empirical networks. In this section, we delineate these effects and introduce a simple correction mask that circumvents the bias. We use the following procedure to quantify this bias: We construct a number of SPA+ networks and obtain clean matrices of internal degrees . By “clean”, we mean that we do not collapse redundant memberships into single memberships (see Sec. III.4), and we do not take overlap into account. Then, we gradually introduce effects which are present in real systems, and establish how each effect influences the estimate . The numerical results of this investigation are displayed in Fig. 8 and Tables 2–2.
Case  LSE  MLE  

Pure  0.75  0.78  0.84  1.01  0.99  1.01 
Collapsed  0.80  0.81  0.86  1.09  1.05  1.05 
Collapsed and overlapping^{a}  2.02  1.50  1.23  1.11  1.03  1.05 
Case  LSE  MLE  

Pure  0.77  0.80  0.85  1.01  0.99  1.00 
Collapsed  0.82  0.83  0.88  1.09  1.06  1.04 
Collapsed and overlapping ^{3}^{3}3We excluded some points from the average , because the estimates lied outside of the search ranges (LSE) (MLE) for and . The excluded points are (LSE) those who satisfy [a small lower right triangle in the space] and (MLE) those who satisfy or [left or bottom edge in the space].  2.30  1.65  1.32  1.12  1.05  1.04 
b.3.1 Effect of community size
The estimators are first calibrated on pure internal structures [Figs 8(a) and 8(b)]. In this regime, we do not transform the matrices of internal degrees. It corresponds to the case where communities are directly generated by the model of Sec. II. The quality of the estimate depends on , through its effect on the distribution of community sizes. As increases, the inference task becomes harder, because communities are smaller and mostly live in the fully connected regime, where there are few discriminating features (large ranges of yield similar internal degree distributions). The LSE performs best when : SPA+ generates only a few extremely large communities, and the internal degree within these communities falls neatly on the plateau of . The MLE performs relatively well across a wide range of values of , but we nonetheless observe a positive bias when : For these values of , SPA+ generates many communities in the intermediate size range, where the meanfield description of the local model is known to be numerically inaccurate (see Fig. 1). We also note that there is a noticeable variation of the ratio for fixed values of in the case of the LSE. This variation is due to changes in the maximum community size: If is large, then the network quickly reaches the target number of nodes, and the largest communities fall in the linear regime of .
b.3.2 Effect of redundant memberships
The next case of interest is that of the collapsed internal structures [Figs 8(c) and 8(d)]. It is obtained by merging redundant memberships into single entities, and then removing the resulting selfloops and parallel links (see Sec. III.4). As a result of this procedure, communities that contain redundant copies of a same node decrease both in size and number of links. This leads to denser communities on average. These effects are only significant at very low values of and , i.e., for parameters that yield highly redundant communities. Redundant memberships have been shown to account for a vanishingly small fraction of all memberships when the number of communities is large HébertDufresne et al. (2012). However, our numerical experiments show that, for the LSE, the effects of this source of bias do not decrease with system size for extreme values of —in fact, they increase slightly (see Tables 2 and 2). This is because the effect of redundant memberships is more prominent in large communities (the most valuable communities for estimating ), which are more frequent when the network is larger.
b.3.3 Effect of overlap
A significant bias is introduced when one does not assign links to specific communities. This is what we call the collapsed and overlapping structures, where links increase the density of all the communities to which they belong, rather than a single one. This final case encompasses all the biases, and makes use of the information that should be recovered by means of a perfect community detection algorithm. As shown by our results, the bias is more pronounced in the significantly overlapping regime , where communities grow slower than the node reservoir. Again, our numerical experiment show that the effects of this source of bias increase slightly with community size (see Tables 2 and 2).
b.3.4 Bias removal mask
Since most overlapping community detection algorithms do not explicitly assign links, we are often placed in the “collapsed and overlapping” case. We use the following modeling procedure to account for the bias: (i) obtain the parameters that best model the community structure, (ii) compute an initial estimate of the strength of the internal connectivity of communities, and (iii) finally obtain a corrected estimate , as . The correction is the value of the bias removal mask for networks of nodes at point . Since the mask depends on the network size (see Tables 2 and 2), it is computed for each network separately. In practice, we obtain by first generating a number of SPA+ networks of nodes with fixed parameters . We have found that the final results are almost independent on the precise value of ; we have used but is an equally good choice. We then extract from the collapsed and overlapping communities (averaged over the number of SPA+ networks realizations) and take . This bias removal mask allows us to generate networks with mean internal degrees on a curve that resemble the empirical data. We use the MLE because it is more stable with respect to changes in .
References
 Wasserman (1994) S. Wasserman, Social Network Analysis: Methods and Applications (Cambridge University Press, Cambridge, 1994).
 Newman (2010) M. E. J. Newman, Networks: An Introduction (Oxford University Press, Oxford, 2010).
 Watts and Strogatz (1998) D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘smallworld’ networks,” Nature 393, 440–442 (1998).
 Girvan and Newman (2002) M. Girvan and M. E. J. Newman, “Community structure in social and biological networks,” Proc. Natl. Acad. Sci. U.S.A. 99, 7821–7826 (2002).
 Fortunato (2010) S. Fortunato, “Community detection in graphs,” Phys. Rep. 486, 75–174 (2010).
 Xie et al. (2013) J. Xie, S. Kelley, and B. K. Szymanski, “Overlapping community detection in networks: The stateoftheart and comparative study,” ACM Comput. Surv. 45, 43 (2013).
 Peixoto (2012) T. P. Peixoto, “Entropy of stochastic blockmodel ensembles,” Phys. Rev. E 85, 056122 (2012).
 Clauset et al. (2008) A. Clauset, C. Moore, and M. E. J. Newman, “Hierarchical structure and the prediction of missing links in networks,” Nature 453, 98–101 (2008).
 Guimerà and SalesPardo (2009) R. Guimerà and M. SalesPardo, “Missing and spurious interactions and the reconstruction of complex networks,” Proc. Natl. Acad. Sci. U.S.A. 106, 22073–22078 (2009).
 Seshadhri et al. (2012) C. Seshadhri, T. G. Kolda, and A. Pinar, “Community structure and scalefree collections of ErdősRényi graphs,” Phys. Rev. E 85, 056109 (2012).
 Yang and Leskovec (2012) J. Yang and J. Leskovec, “Communityaffiliation graph model for overlapping network community detection,” in IEEE 12th International Conference on Data Mining (IEEE, Los Alamitos, CA, 2012) pp. 1170–1175.
 HébertDufresne et al. (2016) L. HébertDufresne, A. Allard, J.G. Young, and L. J. Dubé, “Constrained growth of complex scaleindependent systems,” Phys. Rev. E. 93, 032304 (2016).
 Barabási and Albert (1999) A.L. Barabási and R. Albert, “Emergence of scaling in random networks,” Science 286, 509–512 (1999).
 Simon (1955) H. A. Simon, “On a class of skew distribution functions,” Biometrika 42, 425–440 (1955).
 de Solla Price (1976) D. de Solla Price, “A general theory of bibliometric and other cumulative advantage processes,” J. Am. Soc. Inf. Sci. 27, 292–306 (1976).
 HébertDufresne et al. (2011) L. HébertDufresne, A. Allard, V. Marceau, P.A. Noël, and L. J. Dubé, “Structural preferential attachment: Network organization beyond the link,” Phys. Rev. Lett. 107, 158702 (2011).
 HébertDufresne et al. (2012) L. HébertDufresne, A. Allard, V. Marceau, P.A. Noël, and L. J. Dubé, ‘‘Structural preferential attachment: Stochastic process for the growth of scalefree, modular, and selfsimilar systems,” Phys. Rev. E 85, 026108 (2012).
 HébertDufresne et al. (2015) L. HébertDufresne, E. Laurence, A. Allard, J.G. Young, and L. J. Dubé, “Complex networks as an emerging property of hierarchical preferential attachment,” Phys. Rev. E 92, 062809 (2015).
 (19) A Python implementation of the integrator is available online at https://github.com/spanetworks/spa.
 Dunbar (1992) R. I. M. Dunbar, “Neocortex size as a constraint on group size in primates,” J. Hum. Evol. 22, 469–493 (1992).
 de Ruiter et al. (2011) J. de Ruiter, G. Weston, and S. M. Lyon, “Dunbar’s number: Group size and brain physiology in humans reexamined,” Am. Anthropol. 113, 557–568 (2011).
 Shultz and Dunbar (2007) S. Shultz and R. I. M. Dunbar, “The evolution of the social brain: anthropoid primates contrast with other vertebrates,” Proc. R. Soc. Lond. B 274, 2429–2436 (2007).
 Gonçalves et al. (2011) B. Gonçalves, N. Perra, and A. Vespignani, “Modeling users’ activity on Twitter networks: Validation of Dunbar’s number,” PLoS ONE 6, e22656 (2011).
 Hill and Dunbar (2003) R. A. Hill and R. I. M. Dunbar, “Social network size in humans,” Hum. Nat. 14, 53–72 (2003).
 Dunbar (2012) R. I. M. Dunbar, “Social cognition on the internet: testing constraints on social network size,” Philos. Trans. R. Soc. Lond. Ser. B 367, 2192–2201 (2012).
 Dunbar and Shultz (2007) R. I. M. Dunbar and S. Shultz, “Evolution in the social brain,” Science 317, 1344–1347 (2007).
 (27) A C++11 implementation of SPA+ is available online at https://github.com/spanetworks/spa.
 Palla et al. (2005) G. Palla, I. Derényi, I. Farkas, and T. Vicsek, “Uncovering the overlapping community structure of complex networks in nature and society,” Nature 435, 814–8 (2005).
 Palla et al. (2008) G. Palla, I. J. Farkas, P. Pollner, I. Derényi, and T. Vicsek, “Fundamental statistical features and selfsimilar properties of tagged networks.” New J. Phys. 10, 123026 (2008).
 Klimt and Yang (2004) B. Klimt and Y. Yang, ‘‘The Enron corpus: A new dataset for email classification research,” in European Conference on Machine Learning (Springer, Berlin, 2004).
 Ahn et al. (2010) Y.Y. Ahn, J. P. Bagrow, and S. Lehmann, “Link communities reveal multiscale complexity in networks,” Nature 466, 761 (2010).
 Lee et al. (2010) C. Lee, F. Reid, A. McDaid, and N. Hurley, “Detecting highly overlapping community structure by greedy clique expansion,” arXiv:1002.1827 (2010).
 Lancichinetti et al. (2011) A. Lancichinetti, F. Radicchi, J. J. Ramasco, and S. Fortunato, “Finding statistically significant communities in networks.” PLoS ONE 6, e18961 (2011).
 Evans and Lambiotte (2009) T. S. Evans and R. Lambiotte, “Line graphs, link partitions, and overlapping communities,” Phys. Rev. E 80, 016105 (2009).
 Young et al. (2015) J.G. Young, A. Allard, L. HébertDufresne, and L. J. Dubé, “A shadowing problem in the detection of overlapping communities: Lifting the resolution limit through a cascading procedure,” PloS ONE 10, e0140133 (2015).
 Clauset et al. (2009) A. Clauset, C. R. Shalizi, and M. E. J. Newman, “Powerlaw distributions in empirical data,” SIAM Rev. 51, 661–703 (2009).
 Karrer and Newman (2011) B. Karrer and M. E. J. Newman, “Stochastic blockmodels and community structure in networks,” Phys. Rev. E 83, 016107 (2011).
 Peixoto (2015) T. P. Peixoto, “Model selection and hypothesis testing for largescale network models with overlapping groups,” Phys. Rev. X 5, 011033 (2015).
 Han et al. (2004) J.D. J Han, N. Bertin, T. Hao, D. S. Goldberg, G. F. Berriz, L. V. Zhang, D. Dupuy, A. J. M. Walhout, M. E. Cusick, F. P. Roth, et al., “Evidence for dynamically organized modularity in the yeast protein–protein interaction network,” Nature 430, 88–93 (2004).
 Bertin et al. (2007) N. Bertin, N. Simonis, D. Dupuy, M. E. Cusick, J.D. J. Han, H. B. Fraser, F. P. Roth, and M. Vidal, ‘‘Confirmation of organized modularity in the yeast interactome,” PLoS Biol 5, e153 (2007).
 Agarwal et al. (2010) S. Agarwal, C. M. Deane, M. A. Porter, and N. S. Jones, “Revisiting date and party hubs: novel approaches to role assignment in protein interaction networks,” PLoS Comput. Biol. 6, e1000817 (2010).
 Nepusz et al. (2008) T. Nepusz, A. Petróczi, L. Négyessy, and F. Bazsó, “Fuzzy communities and the concept of bridgeness in complex networks,” Phy. Rev. E 77, 016107 (2008).
 Wang et al. (2011) Y. Wang, Z. Di, and Y. Fan, “Identifying and characterizing nodes important to community structure using the spectrum of the graph,” PloS ONE 6, e27418 (2011).
 Granovetter (1973) M. S. Granovetter, “The strength of weak ties,” Am. J. Sociol. 78, 1360–1380 (1973).
 Masuda (2009) N. Masuda, “Immunization of networks with community structure,” New J. Phys. 11, 123018 (2009).
 HébertDufresne et al. (2013) L. HébertDufresne, A. Allard, J.G. Young, and L. J. Dubé, “Global efficiency of local immunization on complex networks,” Sci. Rep. 3, 2171 (2013).
 Zhou et al. (2005) W.X. Zhou, D. Sornette, R. A. Hill, and R. I. M. Dunbar, “Discrete hierarchical organization of social group sizes,” Proc. R. Soc. Lond. B 272, 439–444 (2005).
 Hamilton et al. (2007) M. J. Hamilton, B. T. Milne, R. S. Walker, O. Burger, and J. H. Brown, “The complex structure of hunter–gatherer social networks,” Proc. R. Soc. Lond. B 274, 2195–2203 (2007).
 Christensen and Moloney (2005) K. Christensen and N. R. Moloney, Complexity and Criticality (Imperial College Press, London, 2005).
 Krapivsky and Redner (2002) P. L. Krapivsky and S. Redner, “Statistics of changes in lead node in connectivitydriven networks,” Phys. Rev. Lett. 89, 258703 (2002).
 Albert et al. (2000) R. Albert, H. Jeong, and A.L. Barabási, “Error and attack tolerance of complex networks,” Nature 406, 378–382 (2000).