An inferential procedure for community structure validation in networks
‘Community structure’ is a commonly observed feature of real networks. The term refers to the presence in a network of groups of nodes (communities) that feature high internal connectivity and are poorly connected to each other. Whereas the issue of community detection has been addressed in several works, the problem of validating a partition of nodes as a good community structure for a network has received little attention and remains an open issue. We propose an inferential procedure for community structure validation of network partitions, which relies on concepts from network enrichment analysis. The proposed procedure allows to compare the adequacy of different partitions of nodes as community structures. Moreover, it can be employed to assess whether two networks share the same community structure, and to compare the performance of different network clustering algorithms.
Keywords: community structure; NEAT; network clustering; network enrichment analysis; networks.
The growing availability of data on real world networks has inspired the study of complex networks in the multidisciplinary fields of social, technological and biological networks. What makes networks so attractive? We are constantly dealing with networks: supermarkets use networks of customers to propose specific deals to targeted groups; banks orchestrate a complex system of transactions between them and clients; terrorists are connected in networks all over the world; media networks dominate our lives, and inside each living being genes express and co-regulate themselves via complex networks, even when we sleep. Networks constitute a mathematical model of complex systems, and the study of their structure of networks is typically a challenging task.
Often, the study of the structure of a network is achieved by decomposing it into its constituent modules or communities. Girvan and Newman (2002) address the concept of community structure as a network property. Indeed networked systems can be described via main statistical properties such as small-world property, power-law degree distributions, network transitivity and clustering coefficient. Girvan and Newman (2002) highlight that the property of community structure is found in most real networks. This essentially means that nodes within a network are connected together in tightly joined groups, while between those groups connections are looser.
Detecting communities in a network is highly relevant, as it enables to disclose the presence of an internal network structure at a very preliminary analysis step. A significant effort has been recently devoted to the development of several community detection algorithms, with a strong focus on the scalability of these methods to large networks. Recently,Yang et al. (2016) analysed eight different state-of-the-art clustering algorithms available in R. In Sections 4 and 5, we will use four of these to explain our work and general approach.
In applied network analysis, the communities that constitute a network are usually unknown. A network may not have any property of community structure; or, even if it does have a community structure, its communities have to be estimated via a community detection algorithm. After the communities have been estimated, one is left with questions on their adequacy. As a matter of fact, the estimated communities may be subject to misclassification, or they could correspond to a graph with no true underlying community structure.
The question that motivates our work is thus: how can we evaluate when a partition of a given network is meaningful?
In this paper, we propose a method for the validation of network partitions as community structures. We focus on the fact that a network partition validation method should not only take into account the nodes that form the groups in the partition, but it should primarily focus on the distribution of links between the groups. Indeed, when assessing the goodness of a partition over a network, intuitively we would like to rate better a partition showing a high density of links within communities and a low density of links between communities. Our method is based on a significance testing procedure on the number of links that are observed between and within the communities; the results from these tests are then combined into a community structure validation (CSV) index that allows an overall evaluation of whether a certain partition of nodes induces a community structure in the network.
Our work borrows the concept of network enrichment from the literature on cross-talk enrichment between gene sets and pathways in biological networks, implementing in particular a one-tailed adaptation of the Network Enrichment Analysis Test (NEAT) proposed by Signorelli et al. (2016a). Although the comparison of genetic networks is an important driver of our work, we emphasize that the proposed methodology is more general and it can be applied to other types of networks as well. Our approach provides also a practical way of comparing networks, by assessing similarity and differences in their community structures.
The remainder of the paper is organised as follows: in Sections 2 and 3 we discuss the costruction of community structure validation indices and describe their application to the validation of a network partition and to the comparison of two networks. The proposed methodology is evaluated through simulations in Section 4. Section 5 provides an example of how CSV indices can be employed to quantify the extent of similarity between networks. A summary of the results presented in this paper is provided in Section 6.
In this section we will introduce the main methodological contribution of our work. In the Subsection 2.1 we specify a statistical testing procedure to assess whether a specific partition is a valid community structure for a given graph. In Subsections 2.2 and 2.3 we define a set of indexes that provide a measure of evidence that a partition induces a valid community structure on a given Network.
2.1 Inferential procedure
We consider a graph , which consists of a set of vertices (or nodes) connected by a set of edges or arrows. In this paper we focus mainly on undirected graphs, but the proposed approach can be applied to directed and mixed graphs as well. We denote by a partition of the nodes into disjoint sets, such that if and .
In order to assess whether induces a community structure in , we compare the observed number of links within and between each set with the number of links that we would expect to observe by chance if the groups were irrelevant. We do this by recurring to a one-tailed implementation of NEAT, the Network Enrichment Analysis Test proposed by Signorelli et al. (2016a). For undirected networks, the test compares the observed number of edges between the set of nodes and with an hypergeometric null model which assumes that
where , and denote the total degrees of sets , and . For directed networks, NEAT compares the observed number of arrows from the set of nodes to the set of nodes with
where denotes the outdegree of and and are the indegrees of and .
In its original implementation (Signorelli et al., 2016b), NEAT tests the null hypothesis that the expected number of edges (arrows) between and , , is equal to the expected number of links obtained from models (1) and (2) against the two-tailed alternative . Here, we consider instead two distinct one-tailed tests, one for overenrichment, vs , and one for underenrichment, vs .
Since a community structure features high internal connectivity within each community and few connections between different communities, we assess the extent to which generates a community structure by testing
Note that this procedure requires the computation of tests for undirected graphs, or tests for directed graphs. Therefore, after all the abovementioned tests are computed, we apply the multiple testing correction of Benjamini and Hochberg (1995) and derive the adjusted p-values and .
2.2 Community structure validation indices
Ideally, we find evidence that a partition induces a clear community structure if every null hypothesis is rejected for a given type I error , i.e., and . More generally, a large proportion of rejections can be seen as evidence of a valid community structure. For undirected graphs we consider the following Unweighted Community Structure Validation index ():
which represents the proportion of enrichment tests that yielded to the rejection of . Clearly, ; higher values of provide stronger evidence that a partition of nodes induces a community structure in a graph. The corresponding index for directed graphs is
A weighted version of UCSV, which we denominate WCSV, can be obtained by weighting each test by the distance between the adjusted p-value and the significance threshold . For example, for undirected graphs this yields
2.3 Single community validation
Although the overall validation of a network partition described in Section 2.2 can address the general question on the capacity of that partition to induce a community structure, it does not provide a separate validation of each set of nodes in the partition. In particular, it does not point out whether every set is well separated from the others, or whether some sets are better isolated than others.
This can be done by considering, for each set , the results of the corresponing tests for overenrichment in (3) and for underenrichment in (4). Then, the following unweighted index for single community validation can be considered:
The weighted community validation index of set is analogously defined as
3 Comparing networks by assessing similarity and differences in their community structures
One challenge in the study of large networks is the difficulty to inspect relations between thousands of nodes. Thus, network clustering algorithms are often applied to a network with the aim of summarizing the communities of nodes that compose the network.
This approach can also be employed to compare networks. The idea, in this context, is that similar networks are expected to share similar communities, so that the comparison of communities in networks can point out structural similarities and differences between networks.
This can be done, for example, by applying a clustering algorithm to the networks of interest and checking the overlap (proportion of shared genes) between communinities in different networks. A high overlap between the partition of graph and the partition of graph gives an indication that the networks share similar community structures. However, such a comparison directly compares only the nodes in and , ignoring the distribution of edges in the two graphs.
Here, we propose to employ community structure validation to carry out an assessment of the overall similarity between the community structures of two graphs. We propose a procedure that is based on the assessment of the validity of as community structure for , and of as community structure for . The idea at the basis of this approach is that if and have the same (or similar) community structure, then the communities extracted from one graph should also induce a community structure in the other graph.
More specifically, we propose the following procedure:
choose a community detection method and apply it to so as to derive its partition in communities . Similarly, obtain from ;
compute the community structure validation indices of in and , and of in and ;
compute the relative indices
which compare the values of the UCSV index of partition in graph with the value of UCSV for in .
The rationale behind is that since is the partition in communities of , we expect a high value of . The value of will be typically smaller: we expect it to be close to 0 if provides a bad partition for ; however, if partitions well, the value of can be expected to be closer to .
As a result, we expect higher values of and when t and share similar communities; if, on the other hand, the communities in the two graphs are different, we expect and to be close to 0.
3.1 A degree-corrected stochastic blockmodel for binary graphs
In order to assess of the performance of the CSV indices, a realistic generative model of graphs that can exhibit community structures is required. This is tipically achieved by recurring to stochastic blockmodels (Holland et al., 1983), in which the probability of observing an edge between two nodes depends on the communities they belong to. A problem with pure stochastic blockmodels, however, is that they are often too simple to reproduce real networks, because they assume nodes within a community to behave similarly. This, for example, implies that in graphs generated from such models, nodes within each community have roughly the same degree: a fact that is in sharp contrast with most real networks, which feature a strong heterogeneity in the degree distribution.
To overcome this limitation, different extensions of stochastic blockmodels have been proposed (Wang and Wong, 1987; Karrer and Newman, 2011; Signorelli and Wit, 2017). Among them, Karrer and Newman (2011) proposed a degree-corrected blockmodel for edge-weighted graphs where the value of an edge between nodes and depends both on their communities and , and on nodal weights and : .
Here, we introduce a degree-corrected stochastic blockmodel for binary undirected graphs that is closely related to that of Karrer and Newman (2011). We assume that the probability of an edge between nodes and depends both on their communities and by means of a block-interaction parameter , and on nodal weights and :
where , and
Note that the weights are defined in such a way that the average nodal weight in each community is 1; to wit, will indicate that the expected degree of node is above the average expected degree of nodes in community .
We study the behaviour of the proposed community structure validation in three different sets of simulations. First, we check the capacity of and to detect a clustering of nodes as valid community structure with respect to the size and modularity of graphs (simulation 1, Section 4.1). Then, we study the behaviour of the indices with respect to increasing levels of community degradation, considering at the same time different values of modularity (simulation 2, Section 4.2). Finally, we employ community structure validation to compare the performance of four different algorithms for network clustering (simulation 3, Section 4.3).
4.1 Simulation 1: performance of CSV with respect to modularity and number of vertices
The aim of this simulation is to evaluate how the proposed community structure validation is affected by the modularity and number of vertices of the graph. CSV relies on a significance testing procedure between each pair of groups, whose power is expected to be affected both by the size of the groups between which enrichment is tested and by the extent to which the communities are well-separated in the graph (to wit, the modularity). Therefore, we expect that CSV performs better with larger and denser networks and, for a given network size, with higher modularity and smaller number of communities.
In order to assess the performance of CSV with respect to network size and modularity, we consider four sequences of graphs with number of vertices . For each , we generate a sequence of 100 binary graphs with communities from the degree-corrected stochastic blockmodel described in Section 3.1, where the probabilities to have an edge between nodes belonging to the same community are fixed in such a way that and , and we progressively increase the probability to have an edge between nodes belonging to different communities, i.e., . Since the s are fixed, increasing reduces the modularity of the graphs.
For each of the graphs thus generated, we compute the and indices associated to the partition of nodes induced by the true communities, applying the Benjamini-Hochberg (BH) correction for multiple testing. Since we are testing enrichments between the true communities, ideally the CSV indices should always attain value 1. As it can be observed from Figure 1, however, this does not happen when , or (for larger values of ) when the modularity is low. On the one hand, this indicates that in very small networks () the network enrichment test is not powerful enough to detect enrichments, but also that the performance of CSV improves considerably for larger networks (). On the other hand, the results in Figure 1 point out that the CSV indices do not reach 1 if the modularity of the partition generated by the true communities is low (). This result is desirable, because the low modularity indicates that the partition do not induce a community structure in the network.
Moreover, it can be observed that the difference between WCSV and UCSV tends to vanish in large networks.
4.2 Simulation 2: behaviour of CSV with respect to community degradation
In Section 4.1 we have assessed how network size and modularity affect the capacity of community structure validation to declare that the real communities result into a community structure.
The purpose of this second simulation, instead, is to understand the sensitivity of community structure validation to different levels of community degradation. This is important because, in reality, the true communities will typically be unknown and one will need to retrieve them with a clustering algorithm, that is likely to misclassify some of the nodes. From a practical point of view, thus, it is important to know whether CSV is capable detect community structures even when a small proportion of nodes is misclassified.
Our expectation is that a partition of nodes where most of the nodes are correctly classified, and only a small proportion of nodes is assigned to a wrong cluster, still induces a community structure in the network. Higher proportions of wrongly classified nodes, instead, should progressively destroy the community structure, and determine a sharp decrease of the CSV indices.
In order to check this, we generate six graphs with nodes and blocks from a degree-corrected stochastic blockmodel where we keep constant the probabilities of interaction within blocks, , and we progressively decrease the probabilities of interaction between blocks from 0.01 in Simulation 2A to 0.3 in Simulation 2F. Note that the modularity of the graphs decreases as increases.
In each simulation, we take the graph thus generated and its communities as reference. Then, we generate a sequence of graphs from a degree-corrected stochastic blockmodel where we keep the same block-interaction probabilities and , but we change community to a proportion of nodes. We consider 100 graphs for each level of community degradation and compute the UCSV associated to the reference communities, after applying the Benjamini-Hochberg (BH) correction; in this way we obtain a distribution of UCSV for each value of that is displayed in Figures 2 and 3.
Figure 2 shows that for high values of modularity, partitions of nodes with levels of community degradation up to 20-25% still result into a clear community structure. The tolerance to community degradation is instead lower the modularity decreases, as shown in Figure 3. In both cases, the UCSV index is stable around 1 for moderate values of community degradation and, then, rapidly decreases towards 0, indicating that higher levels of perturbation of the real communities break the community structure.
4.3 Simulation 3: a comparison of network clustering algorithms based on CSV
The purpose of this simulation is to exploit community structure validation to compare some popular algorithms for network clustering. We consider the same six scenarios of Simulation 2 (where , , and progressively increases from 0.01 to 0.3), generating 100 random graphs for each scenario. We apply to each of the graphs thus generated the following clustering algorithms
fast greedy, proposed by Clauset et al. (2004)
leading eigenvalue, proposed by Newman (2006)
Louvain, proposed by Blondel et al. (2008)
walktrap, proposed by Pons and Latapy (2005)
so that for each graph we obtain four partitions of nodes (one for each method). We compare these partitions by computing the UCSV index with Benjamini-Hochberg (BH) correction: the idea is that a good clustering method should detect partitions that induce strong community structures, with an associated high value of UCSV.
Figure 4 shows the distribution of USCV for the 4 clustering methods in each scenario. Note that for high values of modularity (Simulations 3A, 3B and 3C) the methods fast greedy, Louvain and walktrap perform very well, whereas the leading eigenvalue method already produces partitions that produce weaker community structures. As the value of the modularity increases, however, we also observe a relevant drop in the performance of the fast greedy algorithm.
Overall, walktrap and Louvain appear to be the most effective clustering algorithms. Note how for small values of modularity the distribution of CSV remains concentrated on 1 with walktrap, whereas it slightly drops with Louvain. This seem to indicate that walktrap is more likely to validate “weak communities”, i.e. communities that are associated to a low modularity, whereas Louvain may give a hint that the community structure that they induce is weak.
5 Real data application
In this section we show the application of our overall strategy to the set of 30 tissue specific gene-networks inferred in Gambardella et al. (2013).
5.1 Dataset description
In this case study 30 tissue-specific human gene co-regulation networks were reverse engineered. The authors classified 2930 microarrays (Affymetrix HG-U133A and HG-U133plus2) extracted from ArrayExpress in 30 different tissues. The microarrays were normalized independently for each tissue using Robust Multichip Average as implemented in the R package Bioconductor (Irizarry et al., 2003). A Spearman correlation matrix of dimension for each tissue was computed. Each pair of probes was then associated with an estimated Sperman Correlation Coefficient (SCC) significance. A t statistics and a p-value were then computed for each SCC value. We refer to the original paper (Gambardella et al., 2013) for details.
5.2 Analysis description
For each couple of graphs and ( ), we consider only the subnetworks induced by the common nodes. We apply Louvain as community extraction method, hence obtaining a partition of graph and a partition of . To guarantee statistical power in the testing procedure, only communities of size greater than 5 are retained in our analysis.
We then employ a 1-tailed implementation of NEAT to compute the relative indices as defined in Section 2.2 . This way we obtain a matrix such that . In order to have a general picture of the tissues similarity, we build a similarity matrix , derive the corresponding distance matrix and apply a complete-linkage clustering over . The dendrogram resulting from the clustering is represented in Figure 5 in circular layout. We discriminate among 13 clusters, highlighted in different colours. 3 of these clusters consist of isolated tissues (testis, skin and cartilage). Among the other clusters, note that one cluster exclusively comprises all cerebral tissues considered in the analysis (cerebrum, cerebellum, mid brain and brain stem) and another the only two striated muscles (heart and skeletal muscle) involved in this study. Moreover, the reproductive system female organs (mammary gland, uterus and ovary) are linked together in the same cluster, and the two tissues from the lower digestive system (colon and intestine) form together a unique cluster. Other associations can be the starting point of future discussion within the life science community to give an insight into tissues similarity.
Community structure is a commonly observed property of real networks. The term refers to the presence, in a network, of groups of nodes (also referred to as modules or communities) that are strongly tied to each other, and sporadically connected to other nodes in the network.
This feature is often exploited to simplify the interpretation of large networks and to identify their relevant modules. Whereas the problem of community detection in networks has received wide attention, the assessment of the validity of a partition of nodes as community structure for a given graph remains substantially unexplored.
In this article, we have proposed a strategy to perform community structure validation of a partition of nodes that consists of two steps. First, the presence of enrichment between any two sets in is assessed with NEAT, the test for network enrichment analysis proposed by Signorelli et al. (2016a). Then, the results from these tests are summarized into a synthetic index for community structure validation, which can either be unweighted (USCV) or weighted (WCSV).
The rationale behind CSV indices, which range between 0 and 1, is that they approach 1 when there is a strong separation between the sets in - to wit, induces a clear community structure - and they are close to 0 otherwise. Thus, they constitute a measurement of the validity of as community structure. As illustrated in Section 3, CSV indices can also be employed to evaluate the similarity of two graphs.
Our simulations indicate that the performance of the proposed indices is poor for very small networks (e.g., ), where the hypothesis testing procedure is not enough powerful to reject the null hypothesis of no enrichment between gene sets, but it heavily improves for larger networks (), where CSV behaves as expected and the difference between and rapidly vanishes.
Thus, CSV is capable to identify whether a partition of nodes induces a community structure as long as the network is large enough. It is also robust to a moderate extent of community degradation, thus making allowance for the possibility that a clustering algorithm may assign some nodes to wrong clusters.
In Section 4.3, we have employed the proposed procedure to compare four popular clustering algorithms for networks. The results indicate that the Louvain and walktrap clustering algorithms outperform the leading eigenvalue and fast greedy methods.
Section 5 provides an example of how community structure validation can be employed to quantify the extent of similarity between networks. We have considered 30 tissue-specific genetic networks inferred in Gambardella et al. (2013). After comparing each pair of networks, we have derived a distance matrix between tissues which is represented in Figure 5. The dendrogram therein displays the extent to which two tissue-specific networks are similar, pointing out, for example, a strong similarity between the brain stem and mid brain tissue-specific networks, or a strong difference between the skin and the testis networks.
We acknowledge funding from the COST Action CA15109 “European Cooperation for Statistics of Network Data Science”, supported by COST (European Cooperation in Science and Technology).
The work of Luisa Cutillo has been supported by the European Union under Horizon 2020, Marie Sklodowska-Curie Individual Fellowship.
The authors would like to thank Diego di Bernardo for his valuable comments on the application.
- Benjamini and Hochberg (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B, pages 289–300.
- Blondel et al. (2008) Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008.
- Clauset et al. (2004) Clauset, A., Newman, M. E., and Moore, C. (2004). Finding community structure in very large networks. Physical review E, 70(6):066111.
- Gambardella et al. (2013) Gambardella, G., Moretti, M. N., de Cegli, R., Cardone, L., Peron, A., and di Bernardo, D. (2013). Differential network analysis for the identification of condition-specific pathway activity and regulation. Bioinformatics, 29(14):1776–1785.
- Girvan and Newman (2002) Girvan, M. and Newman, M. E. J. (2002). Community structure in social and biological networks. PNAS, 12(99):7821–7826.
- Holland et al. (1983) Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks, 5(2):109–137.
- Irizarry et al. (2003) Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P. (2003). Summaries of affymetrix genechip probe level data. Nucleic acids research, 31(4):e15–e15.
- Karrer and Newman (2011) Karrer, B. and Newman, M. E. (2011). Stochastic blockmodels and community structure in networks. Physical Review E, 83(1):016107.
- Newman (2006) Newman, M. E. (2006). Finding community structure in networks using the eigenvectors of matrices. Physical review E, 74(3):036104.
- Pons and Latapy (2005) Pons, P. and Latapy, M. (2005). Computing communities in large networks using random walks. In International Symposium on Computer and Information Sciences, pages 284–293. Springer.
- Signorelli et al. (2016a) Signorelli, M., Vinciotti, V., and Wit, E. (2016a). NEAT: an efficient network enrichment analysis test. BMC Bioinformatics, 17(352):1–17.
- Signorelli et al. (2016b) Signorelli, M., Vinciotti, V., and Wit, E. (2016b). neat: efficient Network Enrichment Analysis Test (R package), https://cran.r-project.org/package=neat.
- Signorelli and Wit (2017) Signorelli, M. and Wit, E. C. (2017). A penalized inference approach to stochastic block modelling of community structure in the italian parliament. Journal of the Royal Statistical Society: Series C (Appl. Statist.).
- Wang and Wong (1987) Wang, Y. J. and Wong, G. Y. (1987). Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 82(397):8–19.
- Yang et al. (2016) Yang, Z., Algesheimer, R., and Tessone, C. (2016). Community structure in social and biological networks. Nature.