Benchmark model to assess community structure in evolving networks
Detecting the time evolution of the community structure of networks is crucial to identify major changes in the internal organization of many complex systems, which may undergo important endogenous or exogenous events. This analysis can be done in two ways: considering each snapshot as an independent community detection problem or taking into account the whole evolution of the network. In the first case, one can apply static methods on the temporal snapshots, which correspond to configurations of the system in short time windows, and match afterwards the communities across layers. Alternatively, one can develop dedicated dynamic procedures, so that multiple snapshots are simultaneously taken into account while detecting communities, which allows us to keep memory of the flow. To check how well a method of any kind could capture the evolution of communities, suitable benchmarks are needed. Here we propose a model for generating simple dynamic benchmark graphs, based on stochastic block models. In them, the time evolution consists of a periodic oscillation of the system’s structure between configurations with built-in community structure. We also propose the extension of quality comparison indices to the dynamic scenario.
The analysis and modeling of temporal networks has received a great deal of attention lately, mainly due to the increasing availability of time-stamped network datasets Kovanen et al. (2011); Holme and Saramäki (2012); Perra et al. (2012); Starnini et al. (2012); Barrat et al. (2013). A relevant issue is whether and how the community structure of networks Fortunato (2010) changes in time. Communities reveal how networks are organized and function, hence major changes in their configuration might signal important turns in the evolution of the system as a whole, possibly anticipating dramatic developments such as rapid growth or disruption.
Indeed, there has been a great deal of activity around this topic in recent years Hopcroft et al. (2004); Chakrabarti et al. (2006); Palla et al. (2007); Ferlez et al. (2008); Ronhovde and Nussinov (2009); Mucha et al. (2010); Granell et al. (2011); Ronhovde et al. (2012); Bassett et al. (2013); Bródka et al. (2013); De Domenico et al. (2015). However, most investigations lack strength on the validation part, which typically consists in checking whether the results of the algorithm “make sense” in one or more real networks whose community structure is usually unknown. Actually, it is not obvious what exactly it means to test an algorithm for detecting evolving communities. One idea could be that of correctly identifying the community structure of the system at each time stamp. However, during the evolution of the system several events that affect the network structure may occur, such as the creation or deletion of nodes or links or link rewiring, and it is not possible to detect these events by observing a single time-stamped network, they require taking into account the whole picture to be properly understood.
To explicitly keep track of the history of the system, an option is to consider multiple snapshots at once. For instance, in the evolutionary clustering approach Chakrabarti et al. (2006) the goal is to find a partition that is descriptive of the structure of a given snapshot as well as correlated to the structure of the previous snapshots. Furthermore, the added value of any approach should be the ability to promptly detect changes in the community structure of the network. It would be possible to verify this if there were suitable benchmark graphs with evolving clusters, but those are still missing. This paper aims at filling this gap. We propose a model, derived from the classic stochastic block models Holland et al. (1983); Girvan and Newman (2002); Lancichinetti et al. (2008); Guimerà and Sales-Pardo (2009), that generates three classes of dynamic benchmark graphs. The objective is to provide time-evolving networks, such that at each snapshot the partition into communities is well defined according to the model. To keep things simple we consider a periodic evolution such that the same history repeats itself in cycles and is invariant under time reversal. The analysis of the community structure evolution for the designed benchmarks reveals that approaches exploiting the flow of system configurations might be more accurate in detecting the evolving community structure than methods that consider the snapshots independently. Note that in real data sets this evolution can be sharp and bursty, however in these cases the challenge of finding the community structure is not well defined, because the range of timescales makes the mesoscopic structure clearly disconnected.
The paper is structured as follows. In Section 2 we describe the model to generate the benchmark networks, Section 3 introduces measures of comparison between dynamic clusterings, Section 4 shows an example of the application of a dynamic multislice algorithm on the proposed benchmarks. Section 5 gives a summary and reports our conclusions.
2 Model description
The model we propose for generating networks with evolving community structure is based on the classic stochastic block model (SBM) Holland et al. (1983). It works as follows. A network is divided into a number of subgraphs and the nodes of the same subgraph are linked with a probability , whereas nodes of different subgraphs are linked with a probability . Such probabilities match the link densities within and between subgraphs. Supposing subgraphs of equal size, if the resulting subgraphs are communities, as the (expected) link density within subgraphs exceeds their connectivity to the rest of the graph. The generation of samples from this model has a built-in efficiency: If there are pairs of nodes, the actual number of edges is drawn from a binomial distribution with parameters and . Then, we simply place this number of edges randomly to generate a sample from our ensemble.
The model implements the two fundamental classes of dynamic processes: growing or shrinking and merging or splitting of communities. By combining these two reversible types of processes one can capture the most common behaviors of dynamic communities in real systems. We are then able to generate three standardized benchmarks: One consists in communities that grow and shrink in size (keeping fixed the total number of nodes of the network), while the second considers communities that merge and split. The third one is a mixed version of the previous two, and consists of a combination of the last four operations.
2.1 Grow-shrink benchmark
This process models the movement of nodes from one community to another. At all times, two communities are kept in a SBM ensemble with intracommunity link density and intercommunity link density . However, the number of nodes in the two communities changes over time. In the basic process, we have a total of nodes in two communities. In the balanced state, these are split into two equal communities of nodes, which we call and . At the extremes, a fraction of nodes in community will switch to community . If we take as the size of community , then the number of nodes in the community is . Then, at time the number of nodes in community is
with the phase factor specifying equal sized communities at . The function is the triangular waveform
(with ), which controls the time periodicity. The constant is a phase factor with for the case and specified otherwise in the case of . With this formulation, we get communities of sizes , , , and at , , , and , respectively. In practice, all nodes are sorted in some arbitrary order, and the first nodes are put into community , and the others into community . Say these nodes are to .
After the community sizes are decided, the edges must be placed, taking into account that it is necessary that we keep the two communities in the proper SBM ensemble with equal and independent link probability at all times. The independence of pairs provides a hint on how to do this. When a node is moved from community to , all the existing edges of node are removed. Then an edge is added between and each node in the destination community with equal and independent probability and between and each node in community with equal and independent probability , thus the ensemble is maintained. Conveniently, all edges can be pre-computed and stored to allow a strictly repeating process, with the state at time being identical to the state at time , in analogy to the merging process.
A special case that we need to cope with is the situation where is very high and is very low. When this happens, a community shrinks too much and it may become disconnected. In order to preserve the ensemble, we do not take actions to totally eliminate this possibility, but we ensure that to reduce the probability of disconnection. However, if a disconnection occurs, the process is aborted and re-run. Figure 1(a) is a sketch of the grow-shrink benchmark for the case .
2.2 Merge-split benchmark
This process models the merging of two communities. In this setup, we have a set of nodes, divided into two communities of nodes each. Each of the two initial communities has a link density of , where those links are placed at initialization and kept unmodified over time. There are two extreme states: the unmerged and the merged state. In the unmerged state, all possible pairs of nodes between the two communities have an edge with probability . This means that the network still has a connected component, but the nodes form two communities. In the merged state, all possible pairs of nodes between these two communities have an edge with probability , which implies that all pairs of nodes in the network have the same link density , the previous two communities are now indistinguishable, and thus we have one large community with nodes.
The merge-split process is a periodic interpolation of the merged and unmerged states. The numbers of intercommunity edges in the unmerged state and in the merged state are first picked from a binomial distribution consistent with the binomial distribution parameters and or . All possible intercommunity edges are placed in some arbitrary but random order, and the first
edges are selected to be active at time . The effective intra-community link density is . The parameter is the triangular waveform from Eq. (2). In practice, this means that at time the communities are unmerged and at the communities are merged, with linear interpolation (of the number of edges) between these points. Since the possible edges are ordered only at initialization, the process is strictly periodic, that is, the edges present at time are identical to those present at time .
One may think that the communities are fully merged at the extreme of this process, where the intercommunity link density is (at ). However, due to the detectability limit of communities in stochastic block models, this is not the case Decelle et al. (2011). Even when , it can be that the configuration is indistinguishable from one large community. Following Decelle et al. (2011), at the point
we consider the communities to be merged into one for all practical purposes. While this limit is strictly speaking only accurate in the sparse and infinite-size limit, it is an adequate approximation. A schematic representation of the merge-split benchmark, for is shown in Fig. 1(b).
2.3 Mixed benchmark
This process is a combination of the merging and growing processes. In this process, there is a total of nodes with two merging-splitting communities ( nodes) and two growing-shrinking communities ( nodes). The intra-community links are managed with the same processes as above with phase factors of for both. If there are total communities, then the pairs of communities involved in merging and growing process have phase factors . Between the pairs of nodes that belong to different processes, an edge exists with a probability of . Figure 1(c) exemplifies the mixed benchmark when .
3 Time-dependent comparison measures
The assessment of the performance of any clustering algorithm requires the use of measures to define the distance or similarity between any pair of partitions. The list of available measures is long, including e.g. the Jaccard index Jaccard (1912), the Rand index Rand (1971), the adjusted Rand index Hubert and Arabie (1985), the normalized mutual information Strehl and Ghosh (2002), the van Dongen metric Dongen (2000) and the normalized variation of information metric Meilă (2007). All of them have in common the possibility of being expressed in terms of the elements of the so-called confusion matrix or contingency table, thus we focus first on its calculation. Let and be two partitions of the data in and disjoint clusters. The th component of the contingency table accounts for the number of elements in the intersection of clusters and ,
The sizes of the clusters simply read and and the total number of elements is . With these definitions at hand, one can calculate the Jaccard index,
the normalized mutual information index,
and the normalized variation of information metric,
where, by convention, .
In the case of evolving networks we have to compare two sequences of partitions and , a task that can be performed in different ways. The simplest solution is the independent comparison of partitions at each time step, by measuring the similarity or distance between and for each value of , thus obtaining ,e.g., a Jaccard index for each snapshot, see Fig. 2(a). However, this procedure discards the evolutionary nature of the communities: We would like to quantify not only the static resemblance of the communities but also if they evolve in a similar way.
Our proposal consists in the definition of windowed forms of the different indices and metrics, obtained by considering sequences of consecutive partitions, i.e. time windows of a predefined duration . In Fig. 2(b) we show the comparison between individual snapshots and sequences of length 2. For example, let us consider the time window formed by time steps from to . Every node belongs to a different cluster at each snapshot, and this evolution can be identified as one of the items in for the first sequence of partitions, and for the second one, where the multiplication sign denotes the Cartesian product of sets. Since the number of nodes is , there are at most different nonvoid sets and the same for . For example, in Fig. 2(b), the combinations of partitions (excluding empty sets) are and . Next, we may define the elements of the contingency table for this time window as
which accounts for the number of nodes following the same cluster evolutions and . Likewise, we have
Finally, we may use Eqs. (6)–(8) to calculate the corresponding windowed Jaccard index , windowed normalized mutual information index , and windowed normalized variation of information metric , respectively. Of course, the windowed measures reduce to the standard static ones when , and are able to capture differences in the evolution of communities that cannot be distinguished using their classical versions (see the Appendix).
We will see in the next section how the plots of are valuable to compare different algorithms and to detect in which moments of the time evolution they differ. Nevertheless, it is also convenient to have a single number to quantify the overall deviation. A simple solution is the use of the average squared errors, which is expressed as follows:
For simplicity and for its superior mathematical properties (see Meilă (2007)) we have chosen to use only the NVI metrics in the rest of this article. See Supplemental Material for the results using the normalized mutual information and the Jaccard index.
Here we show an example of the application of a community detection algorithm, designed to take into account the evolution of complex networks, to reveal the community structure in our benchmarks. The chosen method is the multislice algorithm in Mucha et al. (2010), which extends the definition of modularity to multilayer networks. In their representation, each layer (slice) consists of a single network at a particular time. The slices are connected between them by joining each node with its counterpart in the next and previous layer, and this link has a specified weight , equal for all links of this kind, which acts as a tuning parameter. For , no connection between slices is considered and the algorithm is performed statically. As this value increases, more consideration is given to the communities across layers. The formulation includes an additional parameter , which accounts for the tuning of the resolution at which communities are found, in the manner of Reichardt and Bornholdt (2004). In this work, we have used the code available in gen (), setting the resolution parameter to 1 and varying the interslice coupling .
The benchmarks used to put to test this algorithm are generated using the model proposed in this paper. For the sake of simplicity, we generate three simple standard benchmarks, one for each basic procedure: grow-shrink, merge-split and mixed. The grow-shrink benchmark consists in a network with communities, where each community has initially nodes (therefore the total size of the network is ), with , , , and time steps. The merge-split test has a variable number of communities; in this paper we use the parameters communities of size each, with , , and . The mixed benchmark, a combination of the previous two, has communities of nodes each, and the other parameters are set as in the previous cases.
Figure 3 shows the planted partitions for the three benchmarks and the results from the multislice algorithm at three different interslice couplings: In the extreme case slices are considered independently, is an intermediate value that provides good results, and provides an example of the partitioning obtained when using strong coupling between layers. It can be seen that for we obtain a different partition for each time step, and the results are mostly correct, except for those configurations of the sizes of the communities where the preference of modularity for equal-sized communities hampers the process (see the first column of Fig. 3). Higher values of request higher consistency through time, which implies that the number of misclassified individual snapshots is reduced. We have also compared the multislice method with a temporal stability approach Petri and Expert (2014) and the results obtained are very similar to the results of the multislice algorithm obtained at .
To quantitatively evaluate the results, we use the windowed measures introduced in the previous section. We calculate the measures between the partitions obtained by the algorithm and the planted ones, for three values of the time window. When the time window is of size 1 (), each snapshot is considered independently, that is, we have computed the measure between the planted partition at and the algorithm’s result at , repeating this process until . Instead, with the time window of size 2 (), we evaluate the evolution of the partitions during two consecutive time steps, following the same process but comparing the planted partitions at with the algorithm’s results at . This formulation is more restrictive, as we impose, in addition to the condition that the nodes must belong to the same community, that their evolution during two consecutive time steps is also the same. Similarly, we have also analyzed time windows of size 5 () to check the quality of the detected community evolutions at longer ranges.
Figure 4 shows the results for the NVI. We observe that, for the grow-shrink benchmark, the error is large for , but becomes almost zero at . Moreover, the values of the NVI increase with the size of the time window for and , but in a larger amount when the parameter corresponds to the static version of the multislice algorithm. This means that the interslice weight is helping to find the persistence of nodes in their communities, as expected. The merge-split benchmark shows an almost identical bad performance for the three values of at windows of size 1, but does not make it worse when the size of the window increases, unlike the other two. The mixed benchmark is quite neutral, with just a small difference from . Finally, the NVI squared errors reported in Table 1 and calculated using Eq. (15) are in perfect agreement with this analysis. The results using the NMI and Jaccard indices (see the Supplemental material) also support these observations. Thus, we may conclude that, in this case, the use of memory to track the evolution of communities is convenient, but the trade-off between the continuity of the community structure and its static relevance must be carefully adjusted.
|\arraybackslash||\arraybackslashTime||NVI squared error|
We have presented a simple model based on the stochastic block model that allows for the construction of time-dependent networks with evolving community structure. It is useful for benchmarking purposes in testing the ability of community detection algorithms to track properly the structural evolution. We have also introduced extended time-dependent measures for the comparison of different partitions in the dynamic case, which allow for the observation of differences between the outcome of the algorithms and the planted partitions through time.
Our code for benchmark generation and the time-dependent comparison indices is available at rkd () and released under the GNU General Public License.
Acknowledgements.This work was partially supported by MINECO through Grant No. FIS2012-38266; and by the EC FET-Proactive Project PLEXMATH (Grant No. 317614). A.A. also acknowledges partial financial support from the ICREA Academia and the James S. McDonnell Foundation. R.K.D. and S.F. gratefully acknowledge MULTIPLEX, Grant No. 317532 of the European Commission, and the computational resources provided by Aalto University Science-IT project.
Distinguishing community evolutions with windowed measures
Figure 5 shows an example in which, according to the planted partitions, the eight nodes of a network are divided in two communities of four nodes each and these partitions remain constant throughout the three times steps of the network evolution. Two different community detection algorithms find the communities evolutions represented in Figs. 5(a) and 5(b), which are characterized by the assignment of just one node to the wrong community at each time step. In Fig. 5(a) this node is the fourth one during the three time steps, while in Fig. 5(b) they are the second, the third, and the sixth, respectively. Since the nature of the mistake is the same at all time steps, the comparison of the planted and algorithm partitions with a time window of size 1 generates equivalent contingency tables, thus the standard comparison measures do not change in time, with a constant value of the NVI equal to 0.2856. However, if we take into account a time window of size 3, the two evolving community structures detected by the algorithms are different, yielding structurally different contingency tables and values of the NVI equal to 0.2856 and 0.3852, respectively. Therefore, the conclusion is that windowed measures give complementary information for the comparison of time evolving community structures due to their capacity to take into account several snapshots at the same time.
|\arraybackslash||\arraybackslashTime||Jaccard squared error|
|\arraybackslash||\arraybackslashTime||NMI squared error|
- L. Kovanen, M. Karsai, K. Kaski, J. Kértesz, and J. Saramäki, J. Stat. Mech. (2011) P11005.
- P. Holme and J. Saramäki, Phys. Rep. 519, 97 (2012).
- N. Perra, A. Baronchelli, D. Mocanu, B. Gonçalves, R. Pastor-Satorras, and A. Vespignani, Phys. Rev. Lett. 109, 238701 (2012).
- M. Starnini, A. Baronchelli, A. Barrat, and R. Pastor-Satorras, Phys. Rev. E 85, 056115 (2012).
- A. Barrat, B. Fernandez, K. K. Lin, and L.-S. Young, Phys. Rev. Lett. 110, 158702 (2013).
- S. Fortunato, Phys. Rep. 486, 75 (2010).
- J. Hopcroft, O. Khan, B. Kulis, and B. Selman, Proc. Natl. Acad. Sci. USA 101, 5249 (2004).
- D. Chakrabarti, R. Kumar, and A. Tomkins, KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2006), pp. 554–560.
- G. Palla, A.-L. Barabási, and T. Vicsek, Nature (London) 446, 664 (2007).
- J. Ferlez, C. Faloutsos, J. Leskovec, D. Mladenic, and M. Grobelnik, Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (IEEE Computer Society, Washington, DC, 2008), pp. 1328–1330.
- P. Ronhovde and Z. Nussinov, Phys. Rev. E 80, 016109 (2009).
- P. J. Mucha, T. Richardson, K. Macon, M. A. Porter, and J. P. Onnela, Science 328, 876 (2010).
- C. Granell, S. Gómez, and A. Arenas, Chaos 21, 016102 (2011).
- P. Ronhovde, S. Chakrabarty, D. Hu, M. Sahu, K. K. Sahu, K. F. Kelton, N. A. Mauro, and Z. Nussinov, Sci. Rep. 2, 329 (2012), ISSN 2045-2322, URL http://dx.doi.org/10.1038/srep00329.
- D. S. Bassett, M. A. Porter, N. F. Wymbs, S. T. Grafton, J. M. Carlson, and P. J. Mucha, Chaos 23, 013142 (2013).
- P. Bródka, S. Saganowski, and P. Kazienko, Soc. Network Anal. Min. 3, 1 (2013), ISSN 1869-5450.
- M. De Domenico, A. Lancichinetti, A. Arenas, and M. Rosvall, Phys. Rev. X 5, 011027 (2015).
- P. Holland, K. B. Laskey, and S. Leinhardt, Soc. Networks 5, 109 (1983).
- M. Girvan and M. E. Newman, Proc. Natl. Acad. Sci. USA 99, 7821 (2002).
- A. Lancichinetti, S. Fortunato, and F. Radicchi, Phys. Rev. E 78, 046110 (2008).
- R. Guimerà and M. Sales-Pardo, Proc. Natl. Acad. Sci. USA 106, 22073 (2009).
- A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, Phys. Rev. Lett. 107, 065701 (2011).
- P. Jaccard, New Phytol. 11, 37 (1912).
- W. M. Rand, J. Am. Stat. Assoc. 66, 846 (1971).
- L. Hubert and P. Arabie, J. Classif. 2, 193 (1985), ISSN 0176-4268.
- A. Strehl and J. Ghosh, J. Mach. Learn. Res. 3, 583 (2002), ISSN 1532-4435.
- S. V. Dongen, Ph.D. thesis, Dutch National Research Institute for Mathematics and Computer Science, University of Utrecht, 2000, (2000.
- M. Meilă, J. Multivar. Anal. 98, 873 (2007).
- J. Reichardt and S. Bornholdt, Phys. Rev. Lett. 93, 218701 (2004).
- URL http://netwiki.amath.unc.edu/GenLouvain/GenLouvain.
- G. Petri and P. Expert, Phys. Rev. E 90, 022813 (2014).
- URL http://rkd.zgib.net/proj/multiplex/.