Leveraging disjoint communities for detecting overlapping community structure
Abstract
Network communities represent mesoscopic structure for understanding the organization of realworld networks, where nodes often belong to multiple communities and form overlapping community structure in the network. Due to nontriviality in finding the exact boundary of such overlapping communities, this problem has become challenging, and therefore huge effort has been devoted to detect overlapping communities from the network.
In this paper, we present PVOC (Permanence based Vertexreplication algorithm for Overlapping Community detection), a twostage framework to detect overlapping community structure. We build on a novel observation that nonoverlapping community structure detected by a standard disjoint community detection algorithm from a network has high resemblance with its actual overlapping community structure, except the overlapping part. Based on this observation, we posit that there is perhaps no need of building yet another overlapping community finding algorithm; but one can efficiently manipulate the output of any existing disjoint community finding algorithm to obtain the required overlapping structure. We propose a new postprocessing technique that by combining with any existing disjoint community detection algorithm, can suitably process each vertex using a new vertexbased metric, called permanence, and thereby finds out overlapping candidates with their community memberships. Experimental results on both synthetic and large realworld networks show that PVOC significantly outperforms six stateoftheart overlapping community detection algorithms in terms of high similarity of the output with the groundtruth structure. Thus our framework not only finds meaningful overlapping communities from the network, but also allows us to put an end to the constant effort of building yet another overlapping community detection algorithm.
Accepted at Journal of Statistical Mechanics: Theory and Experiment (JSTAT)

April 2015
1 Introduction
One of the most used aspects of social network analysis is to discover and display clusters and communities in networks – the dense subnetworks, where there are more links internally, than externally. It is easy for the common person to spot dense clusters of connection in a small network visualization. However, this is extremely difficult problem to detect such groups from large scale networks. There has been a constant effort since last one decade from the researchers of both computer science and physics domains to explore such community structure from networks after the pioneering effort of Girvan and Newman [1]. Today there are dozens of community detection algorithms that can detect the disjoint/nonoverlapping community structure from the network using different heuristics and frameworks (see [2, 3] for the survey). However, in realworld scenario, it has been observed that a node can be a part of multiple communities, which has eventually led to the idea of overlapping/soft communities [4, 5, 6, 7]. This problem is even more harder because of the exponential number of possible solutions. Therefore, a new direction of research has been started to detect the overlapping community structure from the network (see [8] for the survey).
The dichotomy between “disjoint” and “overlapping” community detection algorithms is unfortunate because it limits the application of each algorithm. If a network has overlapping communities, a “disjoint” algorithm cannot find them; conversely, if communities are known to be disjoint, a “disjoint” algorithm will generally perform better than an “overlapping” algorithm. Therefore, to obtain the actual community structure, it is important to choose the right kind of algorithm. Note that the question of how to choose the right kind of algorithm is outside the scope of the present paper.
However, we hypothesize that there is perhaps no need to develop yet another overlapping community finding algorithm given the assumption that we have diverse and efficient disjoint community detection algorithms in hand. In this paper, we present a method to allow any “disjoint” community detection algorithm to be used to detect overlapping community structure instead for finding another overlapping community detection algorithm. This means that a user wishing to find overlapping communities need no longer be forced to use one of the overlapping algorithms that exist, but can also choose from the many disjoint community finding algorithms. The proposed framework is called as PVOC (Permanence based Vertexreplication algorithm for Overlapping Community detection) which is a twophase framework – in the first step, an efficient disjoint community detection algorithm is used to detect the nonoverlapping community structure from the network; in the second step, each node in the disjoint communities is processed appropriately using a new vertexbased metric, called permanence [9], in order to measure the extent of belongingness of a vertex in its own community and its attached neighboring communities. If the membership of the vertex in its assigned community is similar to that in the neighboring community, we assign the vertex into the neighboring community, keeping its original community intact. Thus the postprocessing step is the fundamental component in PVOC to find out overlapping vertices from the nonoverlapping structure.
We compare our framework with six stateoftheart overlapping community detection algorithms on both synthetic and large realworld networks (whose groundtruth community structure is available). We observe that PVOC significantly outperforms other baseline algorithms in terms of high resemblance of the output with the groundtruth structure. Moreover, we show that even if it is scalable, it does not compromise the correctness of the output.
Our paper makes several unique contributions to the stateoftheart in community detection. These include (i) analyzing the realworld community structure and observing that the disjoint communities are enough to be processed for discovering overlapping community structure, (ii) proposing a new framework by combining existing disjoint community detection algorithm along with the postprocessing step, (iii) showing the accuracy of PVOC in terms of accurately discovering the groundtruth structure.
The organization of the paper is as follows. In the next section, we provide a brief overview of stateoftheart approaches in overlapping community detection. Section 3 provides a brief description of the synthetic and realworld datasets. Following this, in Section 4, we present a detailed results of our empirical observation followed by the description of our proposed framework. Section 5 describes the results of the experiments to detect overlapping communities and a comparative analysis with the baseline algorithms. The experiments in this paper use a combination of PVOC with two existing disjoint community detection algorithms, Louvain [10] and Infomap [11]. Finally, we conclude the paper in Section 6 with some immediate future directions.
2 Related work
There has been a class of algorithms for network clustering, which allow nodes belonging to more than one community. Palla proposed “CFinder” [12], the seminal and most popular method based on cliquepercolation technique. However, due to the clique requirement and the sparseness of real networks, the communities discovered by CFinder are usually of low quality [13]. The idea of partitioning links instead of nodes to discover community structure has also been explored [14, 15, 16, 17].
On the other hand, a set of algorithms utilized local expansion and optimization to detect overlapping communities. For instance, Baumes et al. [18] proposed a twostep algorithm “RankRemoval” using a local density function. LFM [19] expands communities from a random seed node to form a natural community until a fitness function is locally maximal. MONC [20] uses the modified fitness function of LFM which allows a single node to be considered a community by itself. OSLOM [5] tests the statistical significance of a cluster with respect to a global null model (i.e., the random graph generated by the configuration model) during community expansion. Chen et al. [21] proposed selecting a node with maximal node strength based on two quantities – belonging degree and the modified modularity. EAGLE [6] and GCE [22] use the agglomerative framework to produce overlapping communities. COCD [23] first identifies cores and then remaining nodes are attached to cores with which they have maximum connections.
Few fuzzy community detection algorithms have been proposed that quantify the strength of association between all pairs of nodes and communities [24]. Nepusz et al. [25] modeled the overlapping community detection as a nonlinear constrained optimization problem which can be solved by simulated annealing methods. Zhang et al. [26] proposed an algorithm based on the spectral clustering framework. Due to the probabilistic nature, mixture models provide an appropriate framework for overlapping community detection [27, 28, 29, 30]. MOSES [31] uses a local optimization scheme in which the fitness function is defined based on the observed condition distribution. Zhang et al. used Nonnegative Matrix Factorization (NMF) to detect overlapping communities when the number of communities and the feature vectors are provided [32, 33]. Ding et al. [34] employed the affinity propagation clustering algorithm for overlapping detection. Recently, BIGCLAM [35] algorithm is also built on NMF framework.
The label propagation algorithm has been extended to overlapping community detection by allowing a node to have multiple labels. In COPRA [36], each node updates its belonging coefficients by averaging the coefficients from all its neighbors at each time step in a synchronous fashion. SLPA [37, 38] spreads labels between nodes according to pairwise interaction rules. A gametheoretic framework is proposed in Chen et al. [39] in which a community is associated with a Nash local equilibrium.
Beside these, CONGA [7] extends GN algorithm [40] by allowing a node to split into multiple copies. Zhang et al. [41] proposed an iterative process that reinforces the network topology and propinquity that is interpreted as the probability of a pair of nodes belonging to the same community. István et al. [42] proposed an approach focusing on centralitybased influence functions. Recently, Gopalan and Blei [43] proposed an algorithm that naturally interleaves subsampling from the network and updating an estimate of its communities. The reader can get more details in a nice survey paper by Xie et al. [8].
3 Test suite of networks
3.1 Synthetic networks
It is necessary to have good benchmarks to both study the behavior of a proposed community detection algorithm and to compare the
performance across various algorithms. In light of this requirement, Lancichinetti et al. [44] introduced
LFR
Networks  N  E  C  S  

DBLP  317,080  1,049,866  13,477  429.79  2.57 
Amazon  334,863  925,872  151,037  99.86  14.83 
Youtube  1,134,890  2,987,624  8,385  9.75  10.26 
Orkut  3,072,441  117,185,083  6,288,363  34.86  95.93 
3.2 Realworld networks with groundtruth communities
We use four realworld networks
DBLP: It is a coauthorship network where nodes represent authors and edges connect nodes whose corresponding authors have coauthored in at least one paper. Since research communities stem around conferences or journals, the publication venues are used as groundtruth communities in DBLP.
Amazon: It is a Amazon product copurchasing network where nodes represent products and edges connect commonly copurchased products. Each product (i.e., node) belongs to one or more product categories. Each product category is used to define a groundtruth community.
Youtube: In the Youtube social network, users form friendship with each other and users can create groups where other users can join. Here, such userdefined groups are considered as groundtruth communities.
Orkut: Orkut is a free online social network where users form friendship with each other. Orkut also allows users to form a group where other members can then join. Here also such userdefined groups are considered as groundtruth communities.
4 Vertexreplication algorithm
Our proposed algorithm is motivated from an empirical study on the groundtruth community structure of both synthetic and realworld networks. In this section, we first describe the empirical observation and then illustrate a new algorithm that can detect overlapping communities from a network with the help of any standard disjoint community detection algorithm.
4.1 Empirical observation
We empirically study the structure of the groundtruth communities. We speculated that if we remove the vertices that are part of multiple communities from the groundtruth structure, the rest of the portion, i.e., the community structure composed of only nonoverlapping vertices can be efficiently captured by the standard disjoint community detection algorithm. To verify this intuition, we take all the networks with their groundtruth communities and two standard disjoint community detection algorithms, namely Louvain[10] and Infomap[11, 46]. Then for each network, we run the following steps:

We run each of these algorithms to obtain the disjoint community structure from the network.

Since we know the groundtruth community structure of the network, we remove from the groundtruth those vertices (refer to set ) which belong to multiple communities.

Similarly, we remove the constituent vertices of from the disjoint community structure obtained from Step I. This step makes sure that the filtered groundtruth community structure and the filtered disjoint community structure obtained from the algorithm contain same set of vertices.

Then we compare two community structures obtained from Step II and Step III.
Networks  Groundtruth  Algorithms  

Louvain  Infomap  
LFR  582  468  501 
DBLP  8,493  7,987  8,145 
Amazon  151,037  142,098  149,876 
Youtube  8,385  7,967  7,132 
Orkut  288,363  284,980  286,791 
A schematic example of the above procedure is shown in Figure 2. Figure 1 shows that in this process, we discard nearly 40% of the vertices (on an average) which belong to multiple communities for each network. In Table 2, we also report the number of disjoint communities obtained from Louvain and Infomap algorithms for both synthetic and realworld networks and that present in the groundtruth structure. We use a standard validation metric, namely Normalized Mutual Information (NMI) [47] to compare these two community structures. Figure 3 shows that the similarity is quite high for all the networks; this observation indeed corroborates our earlier speculation. Therefore, we hypothesize that a standard disjoint community detection algorithm might be able to find the overlapping communities with a suitable postprocessing step. This means that a user wishing to find overlapping communities need no longer be forced to use any overlapping community finding algorithm, rather a disjoint community structure followed by a postprocessing step might produce the expected overlapping community structure. In the rest of this section, we shall use this observation to design a suitable postprocessing technique.
4.2 Permanence based vertexreplication algorithm
Through careful inspection mentioned above, we have found that a standard disjoint community detection algorithms are quite efficient to detect the nonoverlapping part of the community structure. However there exist few vertices in the network, which are part of multiple communities. We intend to design an efficient algorithm that would be able to identify such overlapping vertices with their community memberships. For that, we use a vertexbased metric, called permanence, which by virtue of its underlying formulation measures how intensely a vertex belongs to its community [9]. Below, we present a brief overview of the formulation of permanence.
Formulation of permanence
In an earlier paper [9], we showed that the extent of membership of a vertex to a community depends on the following two factors. (i) The first factor is the distribution of external connections of the vertex to individual communities. A vertex that has equal number of connections to all its external communities (e.g., a vertex with total 6 external connections with 2 to each of 3 neighboring communities) has equal “pull” from each community whereas a vertex with more external connections to one particular community (e.g., a vertex with total 6 external connections with 1 connection each to two neighboring communities and 4 connections to the third neighboring community), will experience more “pull” from that community due to large number of external connections to it. (ii) The second factor is the density of its internal connections. The internal connections of a community are generally considered together as a whole. However, how strongly a vertex is connected to its internal neighbors can differ. To measure this internal connectedness of a vertex, one can compute the clustering coefficient of the vertex with respect to its internal neighbors. The higher this internal clustering coefficient, the more tightly the vertex is connected to its community.
Combining these two factors together, we formulated permanence of a vertex as follows:
(1) 
where is the number of internal connections of , is the degree of , is the maximum connections of to a single external community and is the clustering coefficient among the internal neighbors of . An illustrative example is shown in Figure 4.
For vertices that do not have any external connections, is considered to be equal to the internal clustering coefficient (i.e., ). The maximum value of is 1 and is obtained when vertex is an internal node and part of a clique. The lower bound of is close to 1. This is obtained when , such that and . Therefore for every vertex , .
The PVOC algorithm
Since permanence can assign a score to each of the vertices, we can use it in our postprocessing step to identify overlapping vertices from the detected disjoint community structure. Subsequently, we develop a new algorithm, called PVOC (Permanence based Vertexreplication algorithm for Overlapping Community detection) that can combine any existing disjoint community detection algorithm with the permanence based vertexreplication for detecting overlapping community structure of a network. Algorithm 1 presents the pseudocode of PVOC.
Given undirected network and a threshold , the algorithm works as follows:

A standard disjoint community detection algorithm is used to detect nonoverlapping community structure from .

A set of vertices are identified from such that each constituent vertex in has at least one connection to any external community.

For each vertex in , we do the following steps:

We calculate the sum of permanence of and its neighbors in their assigned communities.

We remove from its own community and place it to each of its external communities separately. This assignment affects the permanence value of and its immediate neighbors.

For each external community , we measure the current sum of permanence of (in its new community) and its neighbors.

If the absolute value of the difference of the permanence values obtained from Step III(a) and Step III(c) is less than , a replica of is placed into the new community , keeping in its original community as well; otherwise is assigned back to its original community. This step identifies overlapping nodes along with their memberships in different communities.

The algorithm finally returns all the vertices with new community membership.

The threshold controls the extent to which one can relax the condition of replicating a vertex into multiple communities. We vary the threshold from 0 to 0.2 and observe that it produces maximum accuracy at 0.05 (see Figure 6). Therefore, for the rest of the experiment, we keep the value of as 0.05. Note that in the permanencebased postprocessing step, we only consider those vertices having at least one external connection. The rationale behind this assumption is that vertices in the core of each community are often considered to be correctly placed by the disjoint community detection algorithm, whereas vertices which are placed in the peripheral region of the community and are loosely connected to the core of the community have high chance to be part of multiple communities. Figure 5 shows an empirical observation where we plot the relation between the number of external connections of a vertex in the detected disjoint community to the number of overlapping memberships of the vertex in groundtruth community. We observe that the correlation is increasing in nature, which indeed strengthens our hypothesis.
The time complexity of measuring the permanence of a vertex takes , where is the average degree of vertices in the network. In realworld networks, the value of is much lower than , where is the number of nodes in the network. Therefore, the PVOC algorithm mostly depends on the underlying disjoint community detection algorithm.
5 Experiments
We combine PVOC with two popular disjoint community detection algorithms, namely
Louvain
5.1 Baseline algorithms
We compare the performance of PVOC with the following stateoftheart overlapping community detection algorithms, whose codes are also available:

Order statistics local optimization method (OSLOM): It is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations, which is estimated with tools of Extreme and Order Statistics [5]. The code is available at http://www.oslom.org.

Community overlap propagation algorithm (COPRA): This algorithm is based on the label propagation technique of Raghavan et al [4], but is able to detect communities that overlap. Like the original algorithm, vertices have labels that propagate between neighboring vertices so that members of a community reach a consensus on their community membership [36]. The code is available at http://www.cs.bris.ac.uk/~steve/networks/software/copra.html.

Speaker listener propagation algorithm (SLPA): The algorithm is an extension of the Label Propagation Algorithm (LPA) [4]. In SLPA, each node can be a listener or a speaker. The roles are switched depending on whether a node serves as an information provider or information consumer. Typically, a node can hold as many labels as it likes, depending on what it has experienced in the stochastic processes driven by the underlying network structure. A node accumulates knowledge of repeatedly observed labels instead of erasing all but one of them. Moreover, the more a node observes a label, the more likely it will spread this label to other nodes [37]. The code is available at https://sites.google.com/site/communitydetectionslpa.

Agglomerative hierarchical clustering based on maximal clique (EAGLE): It uses the agglomerative framework to produce a dendrogram. First, all maximal cliques are found and made to be the initial communities. Then, the pair of communities with maximum similarity is merged. The optimal cut on the dendrogram is determined by the extended modularity with a weight based on the number of overlapping memberships [6]. The code is available at http://code.google.com/p/eaglepp/.

Clusteroverlap Newman Griven algorithm (CONGA): The idea of this algorithm is similar to our idea of finding overlapping communities from disjoint community structure. CONGA is based on Griven Newman’s “GN” algorithm [1] but extended to detect overlapping communities. CONGA adds to the GN algorithm the ability to split vertices between communities, based on the new concept of split betweenness. At first, edge betweenness of edges and split betweenness of vertices are calculated. Then an edge with maximum edge betweenness is removed or a vertex with maximum split betweenness is split. After this step, edge betweenness and split betweenness are recalculated. The above steps are repeated until no edges remain [7]. However, the calculation of edge betweenness and split betweenness is expensive on large networks. The code is available at http://www.cs.bris.ac.uk/~steve/networks/congapaper/.

Cluster affiliation model for big networks (BIGCLAM): In this algorithm, communities arise due to shared community affiliations of nodes. Here the affiliation strength is explicitly modeled for each node to each community. Then each nodecommunity pair is assigned a nonnegative latent factor which represents the degree of membership of a node to the community. The probability of an edge between a pair of nodes is then modeled in the network as a function of the shared community affiliations [35]. The code is available at http://snap.stanford.edu.
Note that each algorithm is simply used with its default parameters.
5.2 Validation metrics
A stronger test of the correctness of the community detection algorithm, however, is by comparing the obtained community with a given groundtruth structure. For evaluation, we use three metrics that quantify the level of correspondence between the detected and the ground truth communities [35].
Note that all the metrics are bounded between 0 (no matching) and 1 (perfect matching).
5.3 Experimental results
In this experiment, we use PVOC combined with Louvain and Informap separately, and compare the results with six baseline algorithms. First, we check the dependency of PVOC with the value of . Figure 6 shows that at , PVOC achieves maximum accuracy for LFR and one representative realworld network; however the result is almost same of other networks. Therefore, we use in the rest of the experiments. One can tune appropriately to control the extent of overlapping membership of vertices in the network.
In Figure 7, we compare the outputs obtained from different competing algorithms with the groundtruth communities for LFR networks with different parameter settings. Figure 7 (top panel) shows the results for different values of ranging from 0.1 to 0.5. As increases, the community structure becomes less evident and it becomes difficult for all the algorithms to discover the actual community structure. OSLOM performs worst compared to the other algorithms. However, for all the cases, PVOC+LVN is least affected and outperforms other algorithms. This is followed by PVOC+INFO, CONGO and SLPA.
We then vary the average number of community memberships per vertex, from 2 to 8 keeping the other parameters same, and plot the performance of different algorithms in Figure 7 (middle panel). The effect is reasonably less on the accuracy of the competing algorithms. Here we observe that the pattern is almost similar for PVOC+LVN and PVOC+INFO, and are much superior than others.
Finally, in Figure 7 (lower panel) we plot the accuracy of the algorithms with the increasing value of , percentage of overlapping vertices. Surprisingly, OSLOM shows an unexpected behavior with the increasing accuracy after a certain value of . However, on an average the change in accuracy is almost consistent for all the algorithms in all possibilities of .
To understand the utility of including PVOC step with the disjoint community finding algorithms in more details, we further measure the performance of Louvain and Infomap in isolation without PVOC step. We observe that excluding PVOC step significantly deteriorates the performance of Louvain algorithm: for LFR network ( = 10,000, = 4, = 5% , =0.2) ONMI (0.569), Omega Index (0.512), FScore (0.523); for DBLP network ONMI (0.495), Omega Index (0.521), FScore (0.487); for Amazon network ONMI (0.458), Omega Index (0.498), FScore (0.447); for Youtube network ONMI (0.512), Omega Index (0.522), FScore (0.564); and for Orkut network ONMI (0.526), Omega Index (0.556), FScore (0.544). Similar trend is observed for Infomap algorithm. This observation therefore strengthens the need of PVOC as a postprocessing step with the disjoint community detection algorithms.
Now, we run the competing algorithms on the realworld networks. As noted in [35], most of the baseline community detection algorithms do not scale for networks of large size. Therefore, we use the following technique proposed by Yan and Leskovec [35] to obtain several small subnetworks with overlapping community structure from the large real networks. We pick a random node in the given graph that belongs to at least two communities. We then take the subnetwork to be the induced subgraph of consisting of all the nodes that share at least one groundtruth community membership with . In our experiments, we created 500 different subnetworks for each of the six realworld datasets and the results are averaged over these 500 samples. For each validation metric (ONMI, Index, FScore), we separately scale the scores of the methods so that the best performing community detection method has the score of . Finally, we compute the composite performance by summing up three normalized scores. If a method outperforms all the other methods in all the scores, then its composite performance is .
Figure 8 displays the composite performance of the methods for different networks. On an average, the composite performance of PVOC+INFO (2.88) and PVOC+LVN (2.74) significantly outperform other competing algorithms: 6.27% higher than that of BIGCLAM (2.71), 18.03% higher than that of SLPA (2.44), 101.3% higher than that of OSLOM (1.43), 36.4% higher than that of COPRA (2.11), 48.4% higher than that of CONGA (1.94), and 77.8% higher than that of EAGLE (1.62). The absolute average ONMI of PVOC+INFO (PVOC+LVN) for one LFR and six real networks taken together is 0.85 (0.83), which is 4.93% (2.46%) and 26.8% (20.8%) higher than the two most competing algorithms, i.e., BIGCLAM (0.81), and SLPA (0.67) respectively. In terms of absolute values of scores, PVOC+INFO (PVOC+LVN) achieves the average FScore of 0.84 (0.79) and average Index of 0.83 (0.82). Overall, PVOC combined with Louvain and Infomap gives the best results, followed by BIGCLAM, SLPA, COPRA, CONGO, EAGLE and OSLOM.
As most of the baseline algorithms except BIGCLAM do not scale for large real networks [35], we separately compare PVOC with BIGCLAM (which is scalable and also the most competing algorithm) on actual large real datasets. Table 3 shows performance of PVOC and BIGCLAM for different real networks. On average, PVOC+INFO (PVOC+LVN) achieves 4.28% (5.63%) higher ONMI, 1.48% (2.85%) higher Index, and 6.94% (5.63%) higher FScore. Overall, PVOC outperforms BIGCLAM in every measure and for every network. The absolute values of the scores of PVOC+INFO and PVOC+LVN averaged over all the networks are 0.70 and 0.71 (ONMI), 0.69 and 0.70 ( Index), and 0.72 and 0.71 (FScore) respectively.
Networks  BIGCLAM  PVOC+LVN  PVOC+INFO  
ONMI  Omega  F Score  ONMI  Omega  F Score  ONMI  Omega  F Score  
DBLP  0.61  0.59  0.54  0.65  0.61  0.60  0.65  0.62  0.59 
Amazon  0.73  0.69  0.74  0.72  0.71  0.75  0.73  0.74  0.76 
Orkut  0.65  0.68  0.64  0.72  0.70  0.76  0.73  0.72  0.77 
Youtube  0.68  0.76  0.78  0.77  0.78  0.72  0.71  0.68  0.78 
Many optimization algorithms have the tendency to underestimate smaller size communities [50] and sometimes tend to produce very large size communities. In our test suite, we observe the similar tendency in BIGCLAM whereas the communities obtained by PVOC based algorithms are comparable in size with respect to the groundtruth. Earlier in Table 2, we have mentioned the number of communities detected by PVOC based algorithms (the number of communities does not change due to the inclusion of PVOC step with Louvain and Infomap). In Table 4, we show for both LFR and realworld networks that the size of the largest and smallest communities detected by BIGCLAM is much larger than that present in the groundtruth structure. We also measure the similarity (using Jaccard coefficient) between the largest and smallestsize communities detected by BIGCLAM and PVOC based algorithms with the communities in groundtruth structure and notice that PVOC based algorithms are able to detect both largest and smallestsize communities which are most similar to the groundtruth structure. Therefore, we hypothesize that our algorithm has the potentiality to produce meaningful communities which have high resemblance with the groundtruth structure.
Networks  Groundtruth  BIGCLAM  PVOC+LVN  PVOC+INFO  
Max Size  Min size  Max Size  Min size  Max Size  Min size  Max Size  Min size  
DBLP  3,458  124  9,876 (0.56)  877 (0.48)  4,098 (0.71)  243 (0.82)  4,143 (0.76)  204 (0.81) 
Amazon  5,987  245  10,109 (0.45)  765 (0.57)  6,876 (0.69)  398 (0.75)  6,367 (0.72)  323 (0.83) 
Orkut  10,687  1,876  13,768 (0.72)  2,985 (0.69)  11,976 (0.74)  1,908 (0.79)  11,345 (0.75)  1,976 (0.79) 
Youtube  8,987  765  9,976 (0.65)  1,098 (0.62)  8,876 (0.76)  987 (0.71)  9,018 (0.74)  865 (0.82) 
6 Conclusions
In this paper, we presented a study to show that there is perhaps less need of developing yet another algorithm for finding overlapping communities from the network. We demonstrated how the output of an efficient disjoint community detection algorithm can be leveraged to discover the overlapping community structure. For that, we proposed a novel, twophase framework, called PVOC that can be combined with any efficient disjoint community detection algorithm. PVOC uses a new metric, called permanence in the postprocessing step on each vertex and detects the overlapping vertices from the nonoverlapping structure. We combined PVOC with two efficient and scalable algorithms, Louvain and Informap. Experimental results showed that our approach is viable in producing meaningful overlapping communities quite efficiently even from the large real world networks in terms of high resemblance with the groundtruth community structure. PVOC is controlled by only one parameter , which can be efficiently tuned to increase the extent of overlapping memberships per vertex in a network.
However, a major drawback of PVOC is that it produces exactly the same number of overlapping communities that the disjoint community detection algorithm produces. However, it might be possible that due to the overlapping nature of a community, new community might emerge from the disjoint community structure. As an immediate step, we would like to include a new module in the postprocessing step that would consider the emergence of new communities. Moreover, we would try to evaluate PVOC in conjunction with even more disjoint community detection algorithms. To conclude, we would like to emphasize on the fact that considering such a massive literature particularly on community detection, it is perhaps the good time to put an end to such consistent effort of proposing yet another algorithm, and to revisit some of the existing algorithms that are efficient enough to fulfill both the purpose of discovering disjoint and overlapping communities from the network.
7 Reference
Footnotes
References
 Newman M E J and Girvan M 2004 Physical Review E 69 026113+
 Fortunato S 2010 Physics Reports 486 75 – 174
 Papadopoulos S, Kompatsiaris Y, Vakali A and Spyridonos P 2012 Data Min. Knowl. Discov. 24 515–554
 Raghavan U N, Albert R and Kumara S 2007 Physical Review E 76 036106+
 Lancichinetti A, Radicchi F, Ramasco J J and Fortunato S 2011 PLoS ONE 6 e18961
 H Shen X Cheng K C and Hu M B 2009 Physica A 388 1706–1712
 Gregory S 2007 An algorithm to find overlapping community structure in networks Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases PKDD 2007 (Berlin, Heidelberg: SpringerVerlag) pp 91–102 ISBN 9783540749752 URL http://dx.doi.org/10.1007/9783540749769_12
 Xie J, Kelley S and Szymanski B K 2013 ACM Comput. Surv. 45 43:1–43:35
 Chakraborty T, Srinivasan S, Ganguly N, Mukherjee A and Bhowmick S 2014 On the permanence of vertices in network communities The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA  August 24  27, 2014 pp 1396–1405
 Blondel V D, Guillaume J L, Lambiotte R and Lefebvre E 2008 J. Stat. Mech. 2008 P10008
 Rosvall M and Bergstrom C 2007 PNAS 104 7327
 Palla G, DerÃ©nyi I, Farkas I and Vicsek T 2005 Nature 435 814–818
 Fortunato S and Lancichinetti A 2009 Community detection algorithms: A comparative analysis: Invited presentation, extended abstract Proceedings of the Fourth International ICST Conference on Performance Evaluation Methodologies and Tools (ICST, Brussels, Belgium, Belgium: ICST (Institute for Computer Sciences, SocialInformatics and Telecommunications Engineering)) pp 27:1–27:2
 Ahn Y Y, Bagrow J P and Lehmann S 2010 Nature 466 761–764
 Evans T S and Lambiotte R The European Physical Journal B 77 265–272
 Evans T S and Lambiotte R 2009 Physical Review E 80 016105
 Chen Y, Wang X L, Yuan B and Tang B Z 2014 Journal of Statistical Mechanics: Theory and Experiment 2014 P03021 URL http://stacks.iop.org/17425468/2014/i=3/a=P03021
 Baumes J, Goldberg M K, Krishnamoorthy M S, MagdonIsmail M and Preston N 2005 Finding communities by clustering a graph into overlapping subgraphs. IADIS AC (IADIS) pp 97–104
 Lancichinetti A, Fortunato S and Kertész J 2009 New Journal of Physics 11 033015 URL http://stacks.iop.org/13672630/11/i=3/a=033015
 Havemann F, 0003 M H, Struck A and GlÃ¤ser J 2010 CoRR abs/1012.1269
 Chen D, Shang M, Lv Z and Fu Y 2010 Physica A: Statistical Mechanics and its Applications 389 4177 – 4187
 Lee C, Reid F, McDaid A and Hurley N 2010 Detecting highlyoverlapping community structure by greedy clique expansion Workshop  ACM KDDSNA pp 33–42
 Du N, Wang B and WU B 2008 Overlapping community structure detection in networks 17th ACM Conference on Information and Knowledge Management (CIKM’08) pp 1371–1372
 Gregory S 2011 J. of Stat. Mech.
 Nepusz T, Petroczi A, Negyessy L and Bazso F 2008 Phys. Rev. E 1
 Zhang S, Wang R S and Zhang X S 2007 Physica A: Statistical Mechanics and its Applications 374 483–490
 Newman M E J and Leicht E A 2007 Proceedings of the National Academy of Sciences 104 9564–9569
 Ren W, Yan G, Liao X and Xiao L 2009 Physical Review E (Statistical, Nonlinear, and Soft Matter Physics) 79 036111 URL http://dx.doi.org/10.1103/physreve.79.036111
 Nowicki K and Snijders T A B 2001 Journal of the American Statistical Association 96 1077–1087
 Zarei M, Izadi D and Samani K A 2009 Journal of Statistical Mechanics: Theory and Experiment 2009 P11013 URL http://stacks.iop.org/17425468/2009/i=11/a=P11013
 McDaid A and Hurley N 2010 Detecting highly overlapping communities with modelbased overlapping seed expansion Proceedings of the 2010 International Conference on Advances in Social Networks Analysis and Mining ASONAM ’10 (Washington, DC, USA) pp 112–119
 Zhang S, Wang R S and Zhang X S 2007 Phys. Rev. E 76(4) 046103 URL http://link.aps.org/doi/10.1103/PhysRevE.76.046103
 Zhao K, Zhang S W and Pan Q 2010 Fuzzy analysis for overlapping community structure of complex network Control and Decision Conference (CCDC), 2010 Chinese pp 3976–3981
 Ding F, Luo Z, Shi J and Fang X 2010 Overlapping community detection by kernelbased fuzzy affinity propagation International Workshop on Indoor Spatial Awareness (ISAâ10) pp 1–4
 Yang J and Leskovec J 2013 Overlapping community detection at scale: A nonnegative matrix factorization approach WSDM (New York, NY, USA: ACM) pp 587–596
 Gregory S 2010 New Journal of Physics 12 103018
 Xie J and Szymanski B K 2012 Towards linear time overlapping community detection in social networks PAKDD pp 25–36
 Xie J and Szymanski B K 2011 CoRR abs/1105.3264
 Chen W, Liu Z, Sun X and Wang Y 2010 Data Min. Knowl. Discov. 21 224–240
 Girvan M and Newman M E J 2002 Proceedings of the National Academy of Sciences 99 7821–7826
 Zhang Y, Wang J, Wang Y and Zhou L 2009 Parallel community detection on large networks with propinquity dynamics. KDD ed IV J F E, FogelmanSouliÃ© F, Flach P A and Zaki M (ACM) pp 997–1006 ISBN 9781605584959 URL http://dblp.unitrier.de/db/conf/kdd/kdd2009.html#ZhangWWZ09
 István A, Palotai R, Szalay M S and Csermely P K 2010 PLoS ONE 5 e12528 URL http://dx.doi.org/10.1371%2Fjournal.pone.0012528
 Gopalan P K and Blei D M 2013 Proceedings of the National Academy of Sciences 110 14534–14539
 Lancichinetti A, Fortunato S and Radicchi F 2008 Phys. Rev. E 78 046110
 Yang J and Leskovec J 2012 Defining and evaluating network communities based on groundtruth Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics (New York, NY, USA: ACM) pp 3:1–3:8
 Rosvall M and Bergstrom C T 2008 PNAS 105 1118–1123
 Danon L, DiazGuilera A, Duch J and Arenas A 2005 J. Stat. Mech. 9 P09008
 McDaid A F, Greene D and Hurley N J 2011 CoRR abs/1110.2515
 Manning C D, Raghavan P and Schütze H 2008 Introduction to Information Retrieval (New York, NY, USA: Cambridge University Press)
 Fortunato S and Barthelemy M 2007 PNAS