Finding Overlapping Communities in Social Networks: Toward a Rigorous Approach
A community in a social network is usually understood to be a group of nodes more densely connected with each other than with the rest of the network. This is an important concept in most domains where networks arise: social, technological, biological, etc. For many years algorithms for finding communities implicitly assumed communities are nonoverlapping (leading to use of clustering-based approaches) but there is increasing interest in finding overlapping communities. A barrier to finding communities is that the solution concept is often defined in terms of an NP-complete problem such as Clique or Hierarchical Clustering.
This paper seeks to initiate a rigorous approach to the problem of finding overlapping communities, where “rigorous” means that we clearly state the following: (a) the object sought by our algorithm (b) the assumptions about the underlying network (c) the (worst-case) running time.
Our assumptions about the network lie between worst-case and average-case. An average-case analysis would require a precise probabilistic model of the network, on which there is currently no consensus. However, some plausible assumptions about network parameters can be gleaned from a long body of work in the sociology community spanning five decades focusing on the study of individual communities and ego-centric networks (in graph theoretic terms, this is the subgraph induced on a node’s neighborhood). Thus our assumptions are somewhat “local” in nature. Nevertheless they suffice to permit a rigorous analysis of running time of algorithms that recover global structure.
Our algorithms use random sampling similar to that in property testing and algorithms for dense graphs. We note however that our networks are not necessarily dense graphs, not even in local neighborhoods.
Our algorithms explore a local-global relationship between ego-centric and socio-centric networks that we hope will provide a fruitful framework for future work both in computer science and sociology.
Community structure is an important characteristic of social networks and has long been studied in sociology. The classic paper of Luce and Perry in 1949 —which introduced the term “Clique” to graph theory —described a community as subsets of individuals every pair of whom are acquainted. The text of Scott  equates communities with objects such as cliques or other dense subgraphs. Another seminal 1974 paper, Breiger  develops a theory of communities in terms of affiliation networks, which in graph theoretic terms consist of using a bipartite graph with people on one side and communities on the other. In sociology today, the answer to many natural and important questions depends on a better understanding of community structure. Can you travel from one node to another random node using only a few “strong ties” ? Do networks contain “wide” bridges ? How much do communities overlap?
The problem of identifying communities arose independently in other fields such as internet search, study of the web graph, and the problem of clustering network nodes (in networks of biological interactions, citations, etc.). In his recent comprehensive survey of algorithmic approaches, Fortunato  divides them into two camps based upon whether or not the algorithm assumes —implicitly or explicitly— that communities are disjoint. Assuming disjointness implicitly leads to a view of a community as a nonexpanding node set: it contains many edges but has relatively few edges leaving it 111The Girvan-Newman  algorithm does not explicitly define communities as nonexpanding sets. Instead it defines the betweenness of a node as the fraction of nodes whose shortest path passes through . It iteratively removes nodes of low betweenness to isolate communities.. This viewpoint suggests many approaches that have been tried: graph partitioning, hierarchical clustering, spectral clustering, simulated annealing, modularity, betweenness, etc. Gibson et al.  discovered interesting communities via hubs and authorities; Hopcroft et al.  used agglomerative clustering on the the Citeseer database and exhibited interesting communities that persist over time.
However, recently Leskovec et al  presented an extensive study of many of the above methods on larger datasets, and question whether they uncover meaningful structure at larger scales. Leskovec et al. often detect a large “core” in the network that is difficult to break into communities. One possible interpretation is that if there are communities in the core, they must overlap.
Thus there is growing interest in finding communities that are allowed to overlap, as they do in most real-life social networks. When communities overlap, each community will not in general be a nonexpanding set. (Consequently, clustering-based approaches may not work.) For instance, imagine that the network contains many communities that are equal-sized cliques with bounded pairwise intersections, and every person belongs to four communities. Then each community/clique will have in general as much as three times as many edges going out of it as are contained in it.
Approaches for finding overlapping communities involve either heuristic clique-finding, or local-search procedures that maintain overlapping clusters and improving them via a series of heuristics. Sometimes a probabilistic generative model is assumed and a max-likelihood fit is attempted via EM and other ideas. (The very recent survey of Xie et al.  evaluates dozens of competing heuristics introduced in the last two years alone.) However, Fortunato states at the end of his 100-page survey:
..research on graph clustering has not yet given a satisfactory solution of the problem and leaves us with a number of important open issues. The field has grown in a rather chaotic way, without a precise direction or guidelines…What the field lacks the most is a theoretical framework…everybody has his or her own idea of what a community is.
Of course, it is entirely possible that Fortunato’s questions have no clean answer, or at least one that spans all types of networks of interest. Quite possibly, a community in a network of gene-gene interactions is an inherently different object than one in the graph of facebook friendships. Furthermore, a clear definition of the problem does not in itself guarantee a simple algorithm—e.g., if the definition involves cliques.
A related issue is development of models for community formation/growth whose mathematical analysis yields predictions testable on real-life networks. An inspiration here is the large body of Barabasi-Albert  style models which make predictions about degree distributions, graph distance etc. One concrete attempt to model community formation is Lattanzi and Sivakumar’s  affiliation networks model that is inspired by sociology work. In this dynamic model the communities are cliques (or dense subgraphs), and the mode of network growth is adding of either new individuals (i.e., nodes) or new communities (i.e., cliques) to an affiliation network. New individual partially copy the community memberships of existing individuals, and new communities are offshoots (subsets) of existing communities. Additional generative models with community structure appear in [28, 1, 6, 24].
What Fortunato’s questions point to, though, is a seeming chicken-and-egg situation. Developing models requires reliable data about community structure in real networks. Conversely, finding reliable data about community structure requires some implicit model, since without such a model the algorithm is consigned to solving worst-case instances of NP-hard problems like Clique, Dense Subgraph, and Small-set Expansion. (This issue clearly does not arise for simpler graph properties like node degrees.)
We seek a more rigorous approach to the problem of finding communities, where “rigorous” means that we clearly state the following: (a) the object sought by our algorithm (b) the assumptions about the underlying network (c) the (worst-case) running time. We try to break out of the chicken-and-egg situation as follows. Instead of proposing a generative model per se, we list fairly minimal assumptions about the network that are based on theoretical and empirical work in sociology. Moreover, these assumptions are “local” in nature and they largely depend upon objects well-studied in sociology, namely ego-centric networks and individual communities. We think these assumptions will be satisfied by many plausible generative models (including Latanzi-Sivakumar, according to our simulations ). Thus it is interesting that these suffice to recover the communities.
Since our formalization of (a) and (b) draws on sociology we expect our approach to apply more to, say, the Facebook graph than biological networks. Furthermore, since our algorithms involve random sampling in node neighborhoods, they may mesh well with a dominant approach for network study in sociology, namely, ego-centric analysis . An ego-centric network  consists of a person (the ego) and his ties (called “alters”). Ego-centric networks and their structure have been extensively studied in sociology (often via questionnaires and field-study)  as a way to gain insight into the entire network [33, 8, 9, 13]. They give a view of how people develop and manage their social network resources [27, 33]. Data about such networks is easy to collect, even in the field, far from computers . The are even included on the biyearly General Social Survey, a central resource in sociology .
Sociological Foundations for our Assumptions.
Sociologists have observed that ego-centric networks can be clustered into a few communities, from which they infer that individuals participate in only a small number of communities [35, 20, 8, 9]. Furthermore, they have observed that a large portion of a person’s ties fit into communities [35, 20]. The celebrated theory of Feld  gives a theoretical understanding for this fact based upon “foci.” He defines a focus as “a social, psychological, legal, or physical entity around which joint activities are organized (e.g., workplaces, voluntary organizations, hangouts, families, etc.)” and his theory says that they are responsible for creating many ties in the network.
Mathematically, one could say that each individual participates in up to communities, and these communities explain fraction of his/her ties. This information already may greatly help the algorithm, whose running time may depend upon and .
What is a community? As mentioned, in sociology communities are thought of as either cliques or as dense subgraphs which are “relatively tightly connected” together compared with the rest of the network (see chapter 6 of  for an introduction and survey of how sociology models communities). Sometimes variants are considered, e.g., Alba  considers cliques of the th power of the graph (i.e., edges correspond to having distance in the original graph). Jackson’s text  allows communities to be dense graphs and describes various models for how edges are generated within a community: e.g., or the expected degree model.
Furthermore, not all dense graphs would pass muster as communities [32, 15]. For example, a union of two disjoint cliques of size is a fairly dense graph of size but the latter is not community. We will assume that a “community” is a dense subgraph with edges inside it generated according to the expected degree model. (While this assumption simplifies the exposition, in Section 4.1 we will observe that this assumption can be relaxed to deal with other families of dense graphs.) However, we leave it as future work to extend our notions to more hierarchical notions of communities such as in .
Another principle often used in sociology is maximality: we should not be able to add nodes to the community and get the same structure , otherwise these nodes should be considered part of the community. (This was the basis for the maximal clique problem introduced in Luce and Perry’s 1949 paper; see also the text of Scott.) Thus nodes within the community are (in some way) better attached to the community than nodes outside of the community.
1.1 Formal Assumptions and Statement of Results
Our assumptions are grounded in the above observations. The network is a graph of size . Each edge has a probability with which it is picked. This can even be , so we allow adversarial edges. Each community is an arbitrary subset of nodes (unknown to the algorithm), and each node is in at most communities, so that the communities are allowed to overlap. We think of as constant or small (though several of our algorithms run in polynomial time even for ). Any edge where and share a community is a community edge. The remaining edges we call ambient edges.
Assumption 1) Community edges are chosen according to the expected degree model.
Each node in has an affinity that lies in . (Node has a different affinity for each community it belongs to.) Two nodes in are connected by an edge with probability (this is called the expected degree model in the standard text Jackson ). Notice that the model is sufficiently flexible to include other well-studied cases: the subcase for all corresponds to “ is a clique,” and the subcase for all corresponds to “ is a dense subgraph generated according to the random graph model .” We will usually assume is lowerbounded by some constant, so that the community is always a fairly dense graph. However in Section 5 we show conditions under which our algorithms can handle even sparser communities.
For nodes , that belong to more than one community, the probability that they are connected is at least the maximum of for all the communities they are in.
While Assumption 1 seems to be tending towards a “generative model,” we use this formulation primarily for ease of presentation. We remark in Section 4.1 that our algorithms work so long as the edges are “well-distributed.”
Assumption 2) Maximality assumption with gap (also called “Gap Assumption”).
Nodes outside the community are less strongly connected to it than community nodes are. For example, if one posits that for —i.e., each has edges to about community nodes—then our assumption would say that each has edges to less than a fraction of nodes in . This seems reasonable since otherwise one should consider whether should belong to as well. Of course, such assumptions about maximality are standard, though the gap in this context is new. We are able to relax the Gap Assumption in certain instances (see Section 4.2), though the results are no longer as clean and crisp.
Assumption 3) Community membership accounts for a significant portion of each node’s edges, say a constant fraction .
Surprisingly, Assumptions - suffice to let us efficiently recover the communities even though we made no other assumptions about the ambient edges, and allow arbitrary affinities.
Informal Theorem 1 (See Theorem 5) If all node affinities are lowerbounded by then the communities can be recovered in time, where and is size of the largest community.
Unfortunately, this running time is only “quasipolynomial” instead of polynomial, since could be as high as . We can get polynomial and even near-linear time algorithms for more restricted versions of the problem. The paper contains many such theorems and the following is representative.
Informal Theorem 2 (See Section 2) For all constants , if all affinities are (note: this includes cliques as a special case), and all communities sizes are within a factor of each other, then the communities can be recovered in time where . Moreover, if affinities are only guaranteed to be at least , the communities can be recovered in time .
Our approach makes heavy use of the Gap Assumption and uses random sampling coupled with some exhaustive enumeration of small cliques in the subgraph induced on the sample. Similar ideas are well-known in property testing  and the related field of approximation algorithms for NP-hard problems in dense graphs [3, 14]. Note however that our graphs are not dense. If then the induced graph in the neighborhood of each node is dense, though we do allow to be large as in many settings. Even when we do not know how to use, for example, the weak regularity lemma  (a standard tool in those other fields) to recover the communities.
The use of local sampling in the neighborhood of a single node gives our algorithms the feel of ego-centric analysis in sociology. However, when we examine communities as dense subgraphs, our algorithms (partially) explore a two-hop neighborhood from a starting node, which is a more generalized ego-centric analysis, and is necessary because no one node has ties to the entire community. We also adapt our ideas to the case where communities are not dense graphs (as is plausible in really large networks). Though our results are preliminary, that algorithm explores a two-hop neighborhood from a starting node.
Section 2 presents algorithms for the case when all community sizes are within an factor of each other. These algorithms are quite efficient and are a good introduction to our techniques. Section 3 allows communities to have vastly different sizes, derives the most general result (Theorem 5, our Informal Theorem 1), and then studies how to derive more efficient algorithms for specialized cases or under additional assumptions.
1.2 Related Work
The above setting seems superficially similar to other planted graph problems that were successfully treated using SVD (singular value decomposition) (see McSherry  and others). This similarity is however illusory because, first, the non-community edges in our model are not necessarily randomly distributed, and more seriously, because the SVD techniques are known for finding vertex partitions whereas here due to overlap between communities we need to find edge partitions.
Eppstein, Löffler, and Strash  show how to provably find all maximal cliques in time that is exponential in the “degeneracy” of the network, which is bounded by the maximum degree in the worst case. Our network model, however, allowed graphs with arbitrary degeneracy and also works for concepts of community more general than a clique.
Mishra et al.  also study overlapping communities in social networks. They show a simple and elegant algorithm for detecting overlapping communities in a certain parameter space. Their algorithm works best for communities where the overlap is not too large. In our parameter space, to detect a community with density and gap , they require that some has fewer than neighbors outside of . This is a strong restriction of the amount of overlapping, and is impossible for small when is bounded away from 1.
Related independent work:
Balcan et al.  have independently studied the problem of infering overlapping communities. They have a very different starting point in terms of an explanatory model of how network ties are formed via a preference/ranking function among the individuals. Surprisingly, they ultimately arrive at very similar set of “minimal” assumptions and algorithms for infering the communities. Perhaps this convergence is some kind of validation of both approaches.
2 When Communities have Similar Sizes
This section will give very efficient algorithms to find all communities in the graph when the following assumption holds:
Assumption: Each community has size between and where is some constant and is arbitrary but known to the algorithm.
We continue to make the three assumptions made in Section 1.1, and the parameters are as defined there. To emphasize, the communities can be arbitrary sets in the graph so long as the gap assumption is not violated and each node is in at most communities. Furthermore, the placement of “ambient” (i.e., non-community) edges in the graph can be adversarial as long as it does not violate the assumptions.
The running time is exponential in , so one would not use these algorithms if communities have radically different sizes; that case is handled in Section 3.
2.1 Warmup: Communities are Cliques
In this section, “community” is understood to be synonymous with “clique” which corresponds to all affinity values .
The algorithm focuses on the neighborhood of a node , and takes a random sample of nodes from it. Then it uses brute force (or any other suitable heuristics) to find cliques of size about in the graph induced on , and tries to extend them to communities. (To use sociology terms, here egocentric or node-based analysis leads to provably correct socio-centric or societal analysis.) The running time is linear in , albeit with a big “constant” factor term dependent upon .
Given a graph satisfying the assumptions in this section, the Clique-Community-Find-Algorithm outputs each community with probability at least in time 222In this paper hides polynomial terms of the parameters , , and also , if they are relevant. .
The intuition behind why sampling works is that for any node the neighborhood has size at most as communities have size at most . Each community containing lies within and has size at least , which is at least a fraction of . Thus a random sample of will have many representatives from . The only subtlety is to watch out for “false positives”: sets that are not cliques but may present themselves as such during sampling.
Proof:(Theorem 1) For any community the probability that a randomly chosen starting node belongs to it is at least . Thus the expected number of times we pick a starting node from is at least and so by a Markov bound such a node is selected with probability . We show that in each such trial the probability that community is output is at least .
So let . Simple calculation based upon Chernoff bounds shows that each of the following sequence of three statements holds with high probability. (i) the subsampling gives a sample of size at most thrice the expectation. (ii) . Note that is a clique of . Now consider what happens when the for-loop of step 3 tries . Since every node in has an edge to every node in , the set will contain . (iii) . This follows since by the Gap Assumption that each has edges to at most a fraction of nodes in , and thus the probability that it has edges to each node in the random subset is less than . Also, the size of is at most , so in expectation the number of nodes in is only .
However, if , then we can identify nodes in just by their degree in ! Each has degree is at least ; whereas each has at most . Thus the algorithm returns exactly .
A practical note: In step 3 we enumerate over all cliques of a certain size. With a slight parameter modification we can show it suffices to enumerate over all maximal cliques of this size, for which in practice one may be able to use existing heuristic algorithms and reduce running time.
Namely, in the proof we pick , and still take to be . Then Chernoff bound and union bound show that the probability that there is a node in that connects to every node in is small. So in this case is a maximal clique in .
2.2 When Communities are Dense Subgraphs
In the previous section, we equated “community” with “clique”, as has been done in many previous works. This assumes that everybody knows everybody else in a “community”—clearly a strong assumption as networks get larger (or even in smaller networks where data about adjacencies is incomplete).
In this section, we model a community as a dense subgraph which corresponds to all affinity values . All sets have the same affinity ; though this will be relaxed later. If nodes are in more than one community, their affinity is still the same and .
The description of the algorithm will use the following notion, which every community necessarily satisfies.
For an -set is a subset of nodes such that 1) every node in the set has edges to at least an fraction of nodes in the set; and 2) every outside node has edges to at most an fraction of nodes in the set.
Given a graph satisfying the conditions of this section with ,
the Dense-Community-Find-Algorithm outputs each community in with probability at least
over the randomness of and over the randomness of the algorithm in time .
Proof: We consider the following algorithm:
To analyze the algorithm, we first assume the graph is “well formed”. The graph is well formed if 1) the number of edges from any node to any community is within of expectation. 2) For any node , and node in community , the expected size of is within of expectation. In particular, we know if , ; if , .
Since all the requirements of well-formedness are far from their expectation, by Chernoff bound it is easy to show that the graph is well-formed with probability at least .
Conditioned on being selected, by Chernoff bounds we show the following statements hold with high probability: (i) the subsampling gives a sample of less than 3 times the expectation. (ii) If we choose to be the intersection of and subsampled nodes, the number of edges from most nodes in to will be close to expectation. (iii) The symmetric difference of and is at most , because for any node in to be in the symmetric difference its number of edges to will have to be away from expectation. Chernoff bound shows the probability of this is at most , and .
Since the symmetric difference is so small, and is well formed, there will be a gap in degree for nodes in and outside . For any the number of edges into is at least ; for any , the number of edges into is at most . Hence setting threshold at suffices to distinguish the two cases. The set is indeed a subset of of size at least a fraction.
Finally, since the graph is well formed, any node must have at least an fraction of edges to , and any node must have at most an fraction of edges to , again a threshold of is enough to distinguish the two cases and .
The running time depends on the size of the subsampled nodes, which is of order . Thus the running time is .
2.2.1 Allowing Different Affinities
In previous subsection we required edge probabilities to be exactly if belong to the same community, and this probability does not rise even when they belong in more than one community. In real life these requirements may be too stringent. Here we define the Dense-Similar-Size Assumptions which relax these two requirements. In this model, the Dense-Community-Find-Algorithm may fail, and we give a new algorithm that, unfortunately, is less efficient.
Communities satisfy Assumptions 1-3 from Section 1.1 as well as the following:
Each community has size between and
and is generated according to Assumption 1 with affinities .
If are in more than one community then edge has probability at least as large as the maximum requirement () of all communities that they lie in.
Given a graph and a set of communities consistent with Dense-Similar-Size Model with parameters where and , the Robust-Dense-Community-Find algorithm below outputs each community in with probability at least over the randomness of and probability at least over the randomness of the algorithm in time .
Proof: The previous algorithm may fail because in this model is no longer a uniform subset of and can be biased. Thus for a vertex the fraction of edges into the set may be quite different from the fraction of edges into . The idea of the algorithm is that for any community , there is always a set such that contains a large () fraction of . A uniform sample on will be similar enough to a uniform sample on , and the number of edges into sample will be close to the expectation. This allows us to get a set that is very close to and then extend it similarly as before.
We call the graph well formed if the degree of each node and the number of edges from any node to any community are within multiplicative factor of their expectations, also for any the size of their intersection in is within of the expectation. By concentration bounds and union bound, the probability that is well formed is at least . We shall assume is well formed in the discussions below.
For any community , when some is the starting node, let be a random subset of nodes in . Since the size is concentrated for any the probability that none of these nodes are adjacent to is at most . Thus the expected size of is at least .
We fix a set such that contains at least a fraction of , and show that is found with good probability. With high probability the subsampling step returns a sample of size less than 3 times the expectation. After sampling, fix to be the intersection of subsampled nodes and the community . Then this is a uniform sample of the set . For any node , the expected number of edges from to is at least ; for any node , the expected number of edges from to is at most . By Chernoff bound these values are away from expectation (and thus the node is in the symmetric difference of and ) with probability less than . The size of is at most . With high probability the symmetric difference (of and )has size smaller than .
Now since is really close to , it is easy to check that for all vertices , the degree in is larger than ; for all the degree in is smaller. Thus setting a threshold at suffices to distinguish these two cases, it follows that . Now is a large subset of , all vertices will have more than edges to while all vertices have less edges. Setting the threshold at is again sufficient to distinguish the two cases and .
Finally, the running time of the algorithm depend on the size of the subsampled nodes, which is at most . Thus the algorithm runs in time
Notice that although the algorithm works for only a fixed value of , if the communities have different densities we can also apply the algorithm with different parameters to find all communities.
3 When Communities may have Very Different Sizes
When communities have very different sizes, the parameter for our models in Section 2 can be too small and the algorithms are not efficient. In this section we show we can relax the similar size requirement using a quasi-polynomial time algorithm. We can also find cliques of different sizes in polynomial time with some additional assumptions.
3.1 Quasi-polynomial Time Algorithm for Communities of Different Sizes
When we have quasi-polynomial time, we can find all communities that have at least constant density just using assumptions 1, 2 and 3. We only make sure that the minimum density we want to find is a constant , that is, each community satisfies Assumption 1 with smallest .
Given a graph satisfying the assumptions above with parameters , if all communities are sets (which happens with high probability when the size of the communities are not too small) the Any-Size-Dense-Community-Find algorithm will output all communities in in time .
Proof: When trying to apply previous ideas to this model, the difficulty is that communities have very different sizes and sampling will not find small communities. To solve the problem we just enumerate over all sets of size , think of all these points are chosen uniformly at random from a certain community . This will serve as the sampled points, and since it is large we can apply union bound to show we will make no error when extending it to a community.
For each community with density , there must be a value of in the loop (line 2) where and (because the stepsize is ). Assume this is the case, and let be a uniformly random set of size in . For any node , if , then the expected number of edges to the set is more than fraction; if the expected number of edges to the set is less than fraction. The probability that the number of edges are fraction away from expectation is at most . We only need to apply union bound on the nodes of and their neighbors, so the size is much smaller than . By union bound the probability that the algorithm successfully find is not 0. Since we are trying all possible sets the algorithm will always find all the communities.
Although the algorithm is for dense subgraphs, if run it with , it will find all clique communities of any size.
3.2 Polynomial Time Algorithm for Cliques of Different Sizes
Now we try to improve the quasipolynomial time in Theorem 5 in the subcase when communities are cliques of different sizes. The idea will be to reduce the amount of sampling and exhaustive enumeration. To prove this works we need to make assumptions beyond . The difficulty is that the solution can be highly nonunique and degenerate if communities are allowed to be too “similar.” For example, suppose node is not in a community but is contained in other communities with large subsets of . Should we now consider to be part of , since it does have edges to all (or most) of ? Our network model assumes such cases do not arise.
Assumption 4) Communities are fairly distinct. For each node in community , at least a constant factor, say , of does not lie in any other community containing . This is in accord with the intuitive view of how communities arise: the interconnection structure provides utility to its members above and beyond what existed before .
The next assumption is technical and perhaps was being assumed by the reader all along. Surprisingly, we did not need it until now.
Assumption 5) Completeness Assumption. Any set that satisfies all the assumptions of a community in the model is a community. (Also called “Duck Assumption”: “If it looks like a duck, quacks like a duck, and walks like a duck, it’s a duck.”) This ensures the adversary can’t satisfy Assumption 4 by just pretending that a certain set is not a community even though it looks like one.
Finally we want to strengthen Assumption 3 so that smaller communities are distinguishable in principle from the noise introduced by the ambient edges:
Assumption 3’) Every community a node belongs to has size at least times the number of ambient edges incident on the node .
Given a graph that satisfies all assumptions in this section with parameters
where that , the Any-Size-Clique-Community-Find-Algorithm will output all communities in with probability at least in time
Proof: The main algorithmic difficulty will be that for any node may contain cliques of many different sizes. A subsample of would be likely to hit large cliques quite often, but not the smaller cliques. To solve this problem we try to find large cliques first. After cliques of size greater than are found, we can henceforth ignore their edges, and proceed to find smaller cliques.
Another problem is that after removing all edges in the large cliques, the remaining neighborhood of (called in the algorithm) may not contain all nodes of a remaining clique. To solve the problem the algorithm uses a set of size . We should think of as a random set in the community , then by Assumption 4 and concentration bounds we know a large fraction of nodes in are in .
We show that if the algorithm correctly finds all cliques of size larger than , then an iteration of the WHILE loop at line 2 will correctly find all communities with size between and with probability . The theorem then follows from a union bound.
Fix a community of size between and , and assume a node has already been chosen at step 3. Let be a random subset of of size . By Assumption 4 we know even after ignoring all edges of larger size, the number of remaining edges from any to is at least a fraction, thus . A random set of size intersects any set of size with probability , therefore in expectation contains a fraction of . Since we are enumerating over all sets we can now assume is such that contains at least fraction of .
Now similar to Theorem 1 it is easy to check that the following statements hold with high probability (i) the subsampling gives a sample of size at most thrice the expectation. (ii) For any node there is a set of size at least in that is not connected to . Suppose we take to be the intersection of community and subsampled nodes, then we have (iii) . This is because by Assumption 3’ the size of is bounded by and each of the nodes outside has only probability of being in .
The last event implies is really close to . Now in graph each has degree ; each has degree at least . This gap enables the algorithm to use a threshold of to distinguish whether is in the community or not. The set will be equal to .
Finally by the Gap Assumption we know during the greedy extension of step 11, we can only include nodes in and in fact will include all nodes in . Therefore and the community is found with high probability.
The running time of the algorithm is dominated by the round when . At that round on the size of the subsampled set is at most , we want to find a set of size . Thus the running time is
We leave it as an open problem to identify reasonable set of assumptions that allow polynomial time when communities are dense subgraphs. The problem is that the “duck assumption” is not well defined: we know what a clique looks like, but it is hard to tell whether a subgraph âlooks likeâ a community generated according to Assumption 1 when there are overlapping communities and ambient edges. We could try to make a stronger âduck assumptionâ by assuming every large - set is a community, and then a similar algorithm will be able to find all dense communities in polynomial time. But this is not as reasonable as our other assumptions: consider two -sets and of size and their intersection has size , then it’s quite likely that their union is a set but we don’t consider this set as a community.
4 Relaxing the Assumptions
4.1 Relaxing Assumption 1
Assumption 1 states that each community’s edges are generated according to a expected degree model. In this section we note that the algorithms and proofs of Theorems 4 and Theorem 5 actually apply to a more general setting.
We first note that we can substantially relax the Dense-Similar-Size Model by replacing Assumption 1 with the following two requirements:
Concentration: the number of edges from any node to any community is concentrated around the expectation, that is, , and the degree of each node is concentrated similarly.
-Regularity: for all ,
These properties do not require full dependence, but only limited independence, which could be satisfied, for example, by the configuration model  which generates a multigraph with (nearly) any particular preassigned degree distribution. This definition could also accommodate additional structure that introduces dependencies among the edges as long as there is still sufficient independence to satisfy Concentration and Regularity. Consider, for instance, the disjoint union of two equal-sized cliques with a random bipartite graph of density between them. This is -regular for any . Thus communities can be much more clumpy than in the expected degree model.
Theorem 4 still holds with the same proof after replacing Assumption 1 with Concentration and -regularity.
Theorem 5 still holds with the same proof after replacing Assumption 1 with Concentration.
4.2 Relaxing Assumption 2–the Gap Assumption
Though plausible, the Gap Assumption may not exactly hold in a real-life graph since there may be nodes that fall in the “gap.” Our algorithm needs to still return sensible answers. Now we argue that our algorithms in Theorem 1, and Theorem 3 produce sensible answers even when this happens. We use the Clique-Community-Find-Algorithm (Theorem 1) to illustrate.
Of course in this setting we cannot hope to return the exact communities. Instead the algorithm will return some that contains more than a fraction of and has density more than .
If is a graph that satisfies Assumptions 1 and 3, and each community is a clique that has size between and where is some constant and is arbitrary but known to the algorithm, then the Clique-Community-Find-Algorithm can be adapted so that for any community , the algorithm finds a set such that and for each , the number of edges to is at least .
Proof: The idea is to run the Clique-Community-Find-Algorithm as before. Once we get corresponding to setting (recall that under the assumptions of Theorem 1 that this contains and has only nodes outside ), we know with high probability consists of 3 parts: the community itself, some set of nodes that have more than edges to , and some set of nodes that have no more than edges to . In these 3 parts, the first part is what we want, the third part is very small by the proof of Theorem 1, and so only the second part worries us. Among these three parts, the nodes in should tend to have the largest degrees, so we will use the degree of these nodes to identify them. For any node in , we call the density of . The idea will be to run Clique-Community-Find-Algorithm with some parameter that is somewhat smaller than to get . Then repeatedly remove nodes from that have density less than until the density of all remaining nodes is more than . We will show by a simple calculation that not many nodes in will be removed. The details are as follows:
Run Clique-Community-Find-Algorithm with .
Once the algorithm has produced the set , repeatedly remove all nodes of density less than until for every , the density is at least .
Let , where is the community, are the nodes that are connected to more than a fraction of and are the nodes that are connected to at most a fraction of . By the proof of Theorem 1 , thus it is very small and can be ignored in the computation below.
The size of is at most , because each node in it is adjacent to the node in the algorithm. We shall show that (i) each iteration removes at most an fraction of nodes in , and (ii) if any iteration removes less than half of the nodes, each node in the remaining graph will be connected to at least a fraction of nodes. Claim (ii) implies that we will have at most iterations and then Claim (i) implies that at most an fraction of nodes in are removed.
For (i), notice that as long as less than half of is removed, the fraction of edges from any node in to the remaining part of is at least . Thus the average density of nodes in is at least . By Markov’s inequality the fraction of nodes that have density less than is at most .
For (ii), notice that all remaining nodes had density , if less than half of the nodes are removed, their density should still be larger than .
Notice that since the used in Clique-Community-Find-Algorithm is now , the running time of the algorithm will be .
Similar ideas can be used to relax the gap assumption in Theorem 3. The main difficulty of applying the same argument is that the edges not from community membership can be adversarially chosen. However in real life graphs this is not likely to happen: if two people do not share any community the probability that they know each other should be lower. If for all edges we have the probability , then for any community the following algorithm can always find some set with density at least that contains at least a fraction of :
Run Robust-Dense-Community-Find algorithm with .
Once the algorithm has produced the set , repeatedly remove all nodes of density less than until the density of every node is at least .
The proof is very similar to Theorem 7. First we focus only on the expected degree of nodes. Since we can normalize these probabilities by multiplying by . Now for nodes in a community, and for all pairs , . Thus same argument as Theorem 7 shows that the algorithm works if we are given the true values of . Then we argue that the algorithm should also work even if we are just given a random graph , because the algorithm only uses the degree of nodes in various sets and they all concentrate around their expectation. There are some techinalities here when many nodes have expected degree very close to the threshold we are setting, which can be resolved if we choose a random threshold between and .
5 Sparser Communities
In previous sections we have been talking about communities as dense graphs. This is natural when the community considered is small and people inside are closely related, such as people in the same year and department in a university. However this may not be true for larger communities: if we consider all students in a large university, or even all computer scientists, then it is unlikely that every person knows a constant fraction of other people in the community. In this section we show how our ideas can be applied to communities that are not so dense.
Consider a simple set of assumptions (“Sparse”) where the affinities for a community . That is two people in the same community of size know each other only with probability . We assume the network satisfies Assumption 1 from Section 1.1 as well as the following:
(1.) Each community has size , for any two nodes , in the same community the probability where is a constant larger than 10. (2.) There are no ambient edges. (3) The intersection of any two communities has size at most .
Notice that we do not need to require the gap assumption nor the duck assumption here because they are both implied by property 3 and the fact that every edge is in some community.
Instead of giving an algorithm to find communities under the “Sparse” assumptions, we show that it can be transformed to a graph such that , fulfill the Dense-Similar-Size Assumptions. Then we can directly apply the Robust-Dense-Community-Find algorithm of Theorem 4 to find the communities.
Let a graph and a set of communities be consistent with the Sparse Model. Construct a graph on the same set of nodes, where has an edge in if and only if they have at least length-2 paths in . Then the pair are consistent with Dense-Similar-Size Assumptions with parameters .
Proof: We rely on the relaxation of the Dense-Similar-Size model in Section 4.1 using the concentration and (, )-regularity.
For concentration, focus on one community , notice that once we fix all the edges adjacent to , the probability that has more than length-2 paths in are independent for different ’s in the same community. This is because the number of length-2 paths is completely determined by the number of edges from to , and these are disjoint sets for different ’s. Moreover, by symmetry the probability only depends on the number of edges adjacent to in community . Thus once we fix the degree of inside in graph , all the edges where are independent and they satisfy a Chernoff bound. The degree of itself is also concentrated.
For (, )-regularity, we show that for any , the size of their intersection within is also concentrated. Just consider randomly choosing and , with high probability both their sizes and the size of their intersection are close to the expectation. In this case whether some node has many length-2 paths to or is also independent (because the relevant edges are disjoint for different ). Chernoff bounds implies the concentration.
For the probability of edges , the expected number of length-2 paths between and in the same community is at least , by Chernoff bound the probability that the number drops down to half of expectation is smaller than 0.1 when .
For Assumption 3 in Section 1.1, for each node , if it is in communities, by the calculation above the expected number of community edges in is at least (here the first 0.9 is the probability of an edge within the community, the second 0.9 is because the communities may overlap, however by property 3 the overlap must be small). The expected number of length-2 paths starting from in is at most , thus the expected number of edges of in must be smaller than a fraction of this, which is . Since the number of ambient edges is small.
Finally, we would like to show the gap assumption: if , the expected number of edges from to in is small. To do that we only need to show the expected number of length-2 paths from to in is small. We divide the length-2 paths from to into two cases: those that enter at the first step and those that enter at the second step. For the first type, the expected size of is only , thus the number of length-2 paths from to where the first step is inside is at most . For the second type, in the second step each node only has at most expected edges to , thus the total expected number of length-2 paths is at most . Combining the two cases, we know the expected number of length-2 paths from to is at most , thus the expected number of edges in is at most .
We introduced a framework for rigorously thinking about community structure that allows (a) overlapping communities (b) includes well-studied notions such as cliques and dense subgraphs as subcases and (c) yet allows efficient algorithms for recovering the communities. Our assumptions lie between worst-case and average-case, are based on a long line of research, and we suspect they hold in many generative models. Our sampling-based techniques infer global structure (socio-centric analysis) from the neighborhood of vertices (ego-centric analysis). This local versus global framework, familiar in computer science, may be useful in other settings in sociology, especially because ego-centric networks are empirically observed to be dense and thus amenable to our techniques. We think our techniques should meld well with existing heuristics and plan to do a performance study on real-world data, and also to test the validity of our assumptions.
Weakening our assumptions is another promising direction, and we made a start in Section 4. The Gap Assumption (Assumption 2) makes intuitive sense but probably cannot be guaranteed for all network nodes (eg, there will be an occasional node that knows more community members than some particular member, yet is not a community member). Our use of the expected degree model for the intracommunity edges (Assumption 1) can be weakened somewhat, but it still is a static model.
Arguably community evolution is a dynamic process that results in a more intricate, and possibly hierarchical structure. Researchers have started considering two-step models. For example the first step could generate an initial graph according to our assumptions and in the second step each node connects to each neighbor of a neighbor with some small probability. Making these models amenable to efficient community-detection is a good open problem.
We would like to thank Nikhil Srivastava for contributions in the early part of this work that helped this project find its final direction. Thanks to Balcan et al. for giving us a manuscript of their independent work . We also would like to thank Bernie Hogan for useful consultations about the sociology literature.
-  E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981–2014, 2008.
-  R. D. Alba. A graph-theoretic definition of a sociometric clique. Journal of Mathematical Sociology, 3:3–113, 1973.
-  S. Arora, D. Karger, and M. Karpinski. Polynomial time approximation schemes for dense instances of -hard problems. Journal of Computer and System Sciences, 58:193–210, 1995.
-  M.-F. Balcan, C. Borgs, M. Braverman, J. Chayes, and S.-H. Teng. I like her more than you: Self-determined communities. Manuscript, Fall 2011.
-  A. Barabasi and R. Albert. Cemergence of scaling in random networks. Science, (286):509–512, 1999.
-  C. Borgs, J. Chayes, J. Ding, and B. Lucier. The hitchhiker’s guide to affiliation networks: A game-theoretic approach. In Proceedings of the 2nd Symposium on Innovations in Computer Science (ICS 2011), 2011.
-  R. L. Breiger. The duality of persons and groups. Social Forces, 53(2):181–190, 1974.
-  R. S. Burt. Structural Holes, volume 137. Harvard University Press, 1992.
-  R. S. Burt. Structural holes and good ideas. American Journal of Sociology, 110(2):349–399, 2004.
-  D. Centola and M. Macy. Complex contagions and the weakness of long ties. American Journal of Sociology, 113(3):702–734, 2007.
-  D. Eppstein, M. Löffler, and D. Strash. Listing all maximal cliques in sparse graphs in near-optimal time. Algorithms and Computation, 6506:403–414, 2010.
-  S. L. . Feld. The focused organization of social ties. The American Journal of Sociology, (5):1015–1035, March 1981.
-  C. Fischer. To Dwell Among Friends. University of California Press, 1982.
-  A. Frieze and R. Kannan. Quick approximation to matrices and applications. Combinatorica, 19(2):175–220, 1999.
-  A. Friggeri, G. Chelius, and E. Fleury. Triangles to capture social cohesion. Social Science, 2011.
-  D. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. VLDB J., 8(3-4):222–236, 2000.
-  M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002.
-  O. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and approximation. Journal of the ACM, 45(4):653–750, 1998.
-  M. S. Granovetter. The strength of weak ties: A network theory revisited. Sociological Theory, 1(1983):201–233, 1983.
-  B. Hogan. Pinwheel layout to highlight community structure. Visualization Symposium, March 2010.
-  B. Hogan, J. A. Carrasco, and B. Wellman. Visualizing personal networks: Working with participant-aided sociograms. Field Methods, 19(2):116–144, 2007.
-  J. Hopcroft, O. Khan, B. Kulis, and B. Selman. Natural communities in large linked networks. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’03, pages 541–546, New York, NY, USA, 2003. ACM.
-  M. O. Jackson. Social and Economic Networks. Princeton University Press, 2008.
-  R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. Proceedings 41st Annual Symposium on Foundations of Computer Science, 41:57–65, 2000.
-  S. Lattanzi and D. Sivakumar. Affiliation networks. In Proceedings of the 41st annual ACM symposium on Theory of computing, STOC ’09, pages 427–434, 2009.
-  J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29–123, 2008.
-  A. Marin and B. Wellman. Social network analysis: An introduction. Book forth-coming, chapter available on-line.
-  F. McSherry. Spectral partitioning of random graphs. In FOCS, pages 529–537, 2001.
-  N. Mishra, R. Schreiber, I. Stanton, and R. E. Tarjan. Clustering social networks. Social Networks, 4863:56–67, 2007.
-  M. Rabinovich. Undergraduate Independent Work, Fall 2011.
-  Santo and Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 – 174, 2010.
-  J. Scott. Social Network Analysis: And handbook. Sage Publications Lt;, 2 edition, 2000.
-  B. Wellman. The community question: The intimate networks of east yorkers. American Journal of Sociology, 84(5):1201–1231, 1979.
-  B. Wellman. The network is personal: Introduction to a special issue of social networks. Social Networks, 29(3):349 – 356, 2007. Special Section: Personal Networks.
-  B. Wellman, B. Hogan, K. Berg, J. Boase, J.-A. Carrasco, R. Côté, J. Kayahara, T. L. M. Kennedy, and P. Tran. Connected lives: The project 1. interactions, pages 1–50, 2005.
-  H. C. White, S. A. Boorman, and R. L. Breiger. Social structure from multiple networks. i. blockmodels of roles and positions. American Journal of Sociology, 81(4):730–780, 1976.
-  J. Xie, S. Kelley, and B. K. Szymanski. Overlapping Community Detection in Networks: the State of the Art and Comparative Study. ArXiv e-prints, Oct. 2011.