Community detection and tracking on networks from a data fusion perspective

Community detection and tracking on networks from a data fusion perspective

James P. Ferry
Metron, Inc.
Reston, VA, U.S.A.
   J. Oren Bumgarner
Metron, Inc.
Reston, VA, U.S.A.

Abstract - Community structure in networks has been investigated from many viewpoints, usually with the same end result: a community detection algorithm of some kind. Recent research offers methods for combining the results of such algorithms into timelines of community evolution. This paper investigates community detection and tracking from the data fusion perspective. We avoid the kind of hard calls made by traditional community detection algorithms in favor of retaining as much uncertainty information as possible. This results in a method for directly estimating the probabilities that pairs of nodes are in the same community. We demonstrate that this method is accurate using the LFR testbed, that it is fast on a number of standard network datasets, and that it is has a variety of uses that complement those of standard, hard-call methods. Retaining uncertainty information allows us to develop a Bayesian filter for tracking communities. We derive equations for the full filter, and marginalize it to produce a potentially practical version. Finally, we discuss closures for the marginalized filter and the work that remains to develop this into a principled, efficient method for tracking time-evolving communities on time-evolving networks.

Keywords: Community detection, community tracking, Bayesian filter, co-membership probability, dynamic stochastic blockmodel.

1 Introduction

The science of networks has a large and multidisciplinary literature. Freeman traces the sociological literature on networks from its pre-cursors in the 1800s and earlier, through the sociometry of the 1930s and Milgram’s “Small Worlds” experiments in the 1960s, to its current form [29]. Sociologists and statisticians introduced the idea of defining network metrics: simple computations that one can perform on a network, accompanied by arguments that explain their significance: e.g., the clustering coefficient and various measures of network centrality [77]. What Lewis calls the “modern period” of network science [50] began in 1998 with the influx of physicists into the field (e.g., Barabási and Newman). The physicists brought novel interests and techniques (power laws, Hamiltonians, mean field approximation, etc.), particularly from statistical physics, along with an overarching drive toward universality—properties of network structure independent of the particular nature of the nodes and links involved [55]. Mathematicians have their own traditions of graph theory [9], and, in particular, random graph theory [37, 10] which emphasizes rigorously formulated models and what properties the graphs they produce have (with high probability) in different asymptotic regions of the models’ parameter spaces. Finally, computer scientists have developed a wide variety of efficient network algorithms [15], and continue to contribute broadly because ultimately the processing of data into usable results is always accomplished via an algorithm of some kind, and because solid computer science is needed for processing megascale, real-world networks.

Each of the above communities brings important, complementary talents to network science. The data fusion community has important perspectives to offer too, due both to the broad range of practical issues that it addresses, and to characteristics of the mathematical techniques it employs [2].

The defining problem of data fusion is to process data into useful knowledge. These data may be of radically different types. One might consider a single data point to be, e.g., the position estimate of a target, a database record containing entries for various fields, an RDF triple in some specified ontology, or a document such as an image, sound, or text file. Data mining deals with similar issues, but focuses on the patterns in and transformations of large data sets, whereas data fusion focuses on distilling effective situational awareness about the real world. A central paradigm in data fusion is the hierarchy of fusion levels needed to transform raw data into high-level knowledge, the most widespread paradigm being the JDL (Joint Directors of Laboratories) hierarchy [69]. In this paradigm, level 0 comprises all the pre-processing that must occur before one can even refer to “objects.” In some cases it is clear that a data point corresponds to some single object, but it is unclear which object: this is an entity resolution problem [6]. In other cases, a data point contains information about multiple objects, and determining which information corresponds to which object is a data association problem [53, 25]. Processing speech or images requires solving a segmentation problem to map data to objects [30], and natural language processing involves a further array of specific techniques (Named Entity Recognition, etc.). One benefit of a data-fusion approach to network science is its careful consideration, at level 0, of how to abstract representations from raw data. In the network context, this applies not just to nodes (i.e., objects), but to the links between them. Sometimes one imposes an arbitrary threshold for link formation; sometimes multi-node relationships (i.e., hypergraph edges) are replaced with edges between all nodes involved. Edges can have natural direction, weights, or types (and nodes may have attributes) that are retained or ignored in a graph representation. When data are inappropriately shoehorned into a network format, or important node or link attributes are ignored, then the results derived from that graph representation may be much less powerful than they could be, or even completely misleading.

Level-0 data fusion encompasses these pre-processing techniques, drawn from computer science, data mining, and the domains specific to the data being considered. Higher-level fusion (say, JDL levels 3–5) addresses another set of issues important to a complete theory of network science. These issues relate to human knowledge and intent. Just as level-0 fusion has similarities with the computer-science approach to networks, higher-level fusion has some overlap with the sociological approach. Levels 1 and 2, on the other hand, correspond loosely to the more theoretical approaches of mathematics and physics. Level 1 addresses the detection, state estimation, and tracking of individual objects [17]; whereas level 2 broadens the scope to tracking groups of objects [22] and to the general assessment of multiple-object situations [7]. In data fusion, however, the overriding problem is how to achieve cohesion between the various levels [32]. Achieving such cohesion would be a valuable contribution to network science.

This paper addresses a specific network problem from the data fusion perspective. Over the past decade, there has been a great deal of work on the community detection problem [28]: discerning how a graph’s nodes are organized into “communities.” There is no universally accepted definition of community structure: it can correspond to some unobserved, ground-truth organizational structure; it can refer to some attribute that nodes share that drives them to “flock” together [52]; or communities can be defined as sets of nodes more densely connected to each other than to the rest of the graph. Whatever the definition of community structure, it nearly always results in communities being densely connected subsets of nodes (the Newman–Leicht algorithm being a notable exception [57]). In practice, studies of community structure in graphs (e.g., [45]) define a community to be, in effect, the output of a community detection algorithm. Weighted and/or directed edges are allowed in some methods, but accounting for more general features on nodes and/or edges is problematic for network research because this information tends to be domain-specific.

Community detection is nearly always formulated in terms an algorithm which ingests a network and outputs some indication of its community structure. With a few exceptions, community detection algorithms produce a single, hard-call result. Most often this result is a partition of nodes into non-overlapping communities, but a few algorithms produce overlapping communities (e.g., CFinder [60]), and some produce a dendrogram—i.e., a hierarchy of partitions [66]. The dominant framework for finding the best partition of nodes is to specify some quality function of a partition relative to a graph and seek to maximize it. Methods that maximize modularity [56] (explicitly or implicitly) are among the most numerous and successful today.

From a data fusion perspective, however, it is important to assess the uncertainty associated with community detection. Quality functions such as modularity are only motivated by intuition or physical analogy, whereas probability is the language of logical reasoning about uncertainty [38]. The reason principled fusion of disparate data types is possible is that one can posit an underlying model for reality, along with measurement models that specify the statistics of how this reality is distorted in the data. One can then update one’s prior distribution on reality to a posterior via Bayesian inference [76].

There are some methods that formulate community detection as an inference problem: a prior distribution over all possible community structures is specified, along with a likelihood function for observing a graph given a community structure. Hastings, for example, formulated the community detection problem in terms of a Potts model that defines the Hamiltonian for a given graph–partition pair, and then converted this to a probability proportional to  [33]. Minimizing therefore yields the MAP (Maximum A posteriori Probability) partition for given structural parameters of the Potts model. Hofman and Wiggins extended this approach by integrating the structural parameters against a prior [34]. In both cases, if all one does with the posterior probability distribution is locate its maximum, then it becomes, in effect, just another quality function (albeit a principled one). On the other hand, the entire posterior distribution is vast, so one cannot simply return the whole thing. The question, then, is what such a probability distribution is good for.

Clauset et al. made greater use of the posterior distribution by devising a Monte Carlo method for generating dendrograms, and using it to estimate the probabilities of missing links [12]. Reichardt and Bornholdt employed a similar Monte Carlo method to estimate the pairwise co-membership probabilities between nodes [61, 62], where is defined to be the probability that nodes and are in the same community. The set of all is much smaller than the full posterior distribution, and thus provides a useful, if incomplete, summary of the uncertainty information. It is expensive to compute exactly, however. Therefore we will derive an accurate approximation with which to summarize uncertainty information for community structure more efficiently than Monte Carlo methods.

A key benefit of retaining uncertainty information is that it enables principled tracking [70]. We may track time-varying communities in time-varying graph data by deriving an efficient Bayesian filter for tracking time-varying communities from time-varying graph data. The term “filter” is somewhat strange in this context: the original, signal-processing context of filters (e.g., the Wiener filter [78]) was that of algorithms which filter out noise in order to highlight a desired signal. The Kalman filter changed this framework to one of distinct state and measurement spaces [40]. This was soon generalized to the concept of a Bayesian filter [39]. To develop a Bayesian filter, one constructs (a) an evolution model over a state space that specifies the probability distribution of the state at some future time given its value at the current time, and (b) a measurement model that specifies the probability distribution of the current measurement given the current state. Thus, despite the connotations of the word “filter,” a Bayesian filter can have quite different state and measurements spaces. To track communities, a model for the co-evolution of graphs and community structure will be constructed, and the measurement model will be that only the graph component of the state is observable.

In Section 2 we derive exact inference equations for the posterior probabilities of all possible community structures for a given graph. This result is essentially the same as can be found elsewhere (e.g., [33, 34, 41]), but is included here in order to introduce notation and clarify subsequent material. In Section 3 we derive an approximation of the co-membership probabilities based on using only the most important information from the graph. The matrix provides the uncertainty information that the usual hard-call algorithms lack. In Section 4 we demonstrate that the approximation is accurate and also surprisingly efficient: despite the fact that it provides so much information, it is significantly faster than the current, state-of-the-art community detection algorithm (Infomap [64, 43]). We also demonstrate the uses for this alternative or supplemental form of community detection, which are embodied in the software IGNITE (Inter-Group Network Inference and Tracking Engine). One benefit of maintaining uncertainty information is that it allows principled tracking. In Section 5 we present a continuous-time Markov process model for the time-evolution of both the community structure and the graph. We then derive an exact Bayesian filter for this model. The state space for this model is far too large to use in practice, so in Section 6 we discuss efficient approximations for the exact filter. The community tracking material is less developed than the corresponding detection material: there are several issues that must be resolved to develop accurate, efficient tracking algorithms. However, we believe that the principled uncertainty management of the data fusion approach provides a framework for the development of more reliable, robust community tracking methods.

2 Community Detection: Exact Equations

Suppose that out of the space of all possible networks on nodes we are given some particular network . If we have some notion of a space of all possible “community structures” on these nodes, then presumably the network provides some information about which structures are plausible. One way to formalize this notion is to stipulate a quality function that assigns a number to every network–structure pair . It would be natural, for example, to define quality as a sum over all node pairs of some metric for how well the network and community structure agree at . That is, in the network , if is a link (or a “strong” link, or a particular kind of link, depending on what we mean by “network”), then it should be rewarded if places and in the same community (or “nearby” in community space, or in communities consistent with the observed link type, depending on what we mean by “community structure”). Modularity is a popular, successful example of a quality function [56]. Quality functions are easy to work with and can be readily adapted to novel scenarios. However, the price of this flexibility is that unless one is guided by some additional structure or principle, the choice of quality function is essentially ad hoc. In addition, the output of a quality function is a number that provides nothing beyond an ordering of the community structures in . The “quality” itself has little meaning.

One way to give quality functions additional meaning is to let them represent an energy. In this case, the quality function may be interpreted as a Hamiltonian. The qualities assigned to various community structures are no longer arbitrary scores in this case: meaningful probabilities can be assigned to community structures can be computed from their energies. The language of statistical physics reflects the dominance of that field in network science [28], but from a fusion standpoint it is more natural to dispense with Hamiltonians and work directly with the probabilities. A probabilistic framework requires models: these necessarily oversimplify real-world phenomena, and one could argue that specifying a model is just as arbitrary as specifying a quality function directly. However, the space of probabilistic models is much more constrained than the space of quality functions, and, more importantly, formulating the problem in terms of a formal probability structure allows for the meaningful management of uncertainty. For this reason, modularity and other quality functions tend to be re-cast in terms of a probability model when possible. For example, the modularity function of Newman and Girven [56] was generalized and re-cast as the Hamiltonian of a Potts model by Reichardt and Bornholdt [62], while Hastings demonstrated that this is essentially equivalent to inference (i.e., the direct manipulation of probability) [33].

A probabilistic framework for this community structure problem involves random variables for the graph and for the community structure. We require models for the prior probabilities for all and for the conditional probability for all and . (We will typically use less formal notation such as and when convenient.) Bayes’ theorem then provides the probability of the community structure given the graph data . The models and typically have unknown input parameters , so that the probability given by Bayes’ theorem could be written . This must be multiplied by some prior probability over the parameter space and integrated out to truly give  [34]. A simpler, but non-rigorous, alternative to integrating the input parameters against a prior is to estimate them from the data. This can be accurate when they are strongly determined by the graph data: i.e., when is tightly peaked. The issue of integrating out input parameters will be addressed in Section 3, but for now we will not include them in the notation.

Section 2.1 will derive using a stochastic blockmodel [20] with multiple link types for . In Section 2.2, this will be simplified to the special case of a planted partition [14] model in which links are only “on” or “off.”

2.1 Stochastic blockmodel case

Let denote the number of communities, and be the number of edge types. The notation will denote the set of integers , and denote the zero-indexed set . We will let denote the set of nodes; , the set of communities; and , the set of edge types. Let denote the set of (unordered) pairs of a set so that denotes the set of node pairs. It is convenient to consider to be the set of edges: because there are an arbitrary number of edge types , one of them (type ) can be considered “white” or “off.” Thus, all graphs have edges, but in sparse graphs most of these are the trivial type .

The community structure will be specified by a community assignment , i.e., a function that maps every node to a community . The graph will be specified as a function , which maps every edge to its type . (This unusual notation will be replaced with the more usual when dealing with the case: i.e., when there is only edge type aside from “off.”)

The stochastic blockmodel is parametrized by the the number of nodes , the stochastic -vector , and a collection of stochastic -vectors  [41]. Here “stochastic -vector” simply means a vector of length whose components are non-negative and sum to one. The vector comprises the prior probabilities of a node belonging to the community —the communities for each node are drawn independently. For , the vector comprises the probabilities of an edge between nodes in communities and being of type —the types of each edge are drawn independently once the communities of the nodes are given. (For , let : i.e., the edges are undirected.) The model defines the random variables and whose instances are denoted and , respectively. The derivation of proceeds in six steps.

Step 1. The probability that a node belongs to the community is, by definition,


Step 2. The probability that an instance of is the community assignment equals


because the communities of each node are selected independently.

Step 3. For a fixed value of , the probability that the edge has type is, by definition,


Step 4. For a fixed value of of , the probability that an instance of is the graph equals


because the types of each edge are selected independently given .

Step 5. The probability of a specific assignment and graph equals


because .

Step 6. Finally, the posterior probability of for a given graph is


where the constant of proportionality is .

2.2 Planted partition case

In many applications one does not have any a priori knowledge about specific communities. In such cases, the community labels are arbitrary: the problem would be unchanged if the communities were labeled according to another permutation of . Thus, if one has a prior distribution over and (as in [34]), then that distribution must be invariant under permutations of . In the case of fixed input parameters and , this translates to and themselves being invariant under permutations. Making this simplification, and considering only edge types (“off” () and “on” ()) yields the special case called the planted partition model [14]. In this case, symmetry implies that for all , and that for and for . Here denotes the edge probability between nodes in the same community, and , the edge probability between nodes in different communities. Thus, the input parameters of reduce to only four to give the planted partition model .

Having only two edge types suggests using the standard notation to denote a graph, with denoting the set of (“on”) edges. The symmetry of the community labels implies that is invariant under permutations of , so that is more efficient to formulate the problem in terms of a partition of the nodes into communities rather than (because partitions are orbits of community assignments under permutations of ). We may then replace (2.5) by


Here denotes the number of (non-empty) communities in the partition , and denotes the falling factorial , which counts the number of assignments represented by the equivalence class . The number of edges between nodes in the same community is denoted (abbreviated to in (2.7)), and the number of non-edges (or “off” edges) between nodes in the same community is denoted . The analogous quantities for nodes in different communities are and . The posterior probability is proportional to .

3 Community Detection: Approximate Methods

Community detection methods that employ quality functions return hard calls: an optimization routine is applied to determine the community structure that maximizes the quality over all for a given graph . There is little else one can do with a quality function: one can return an ordered list of the -best results, but a probability framework is required to interpret the relative likelihoods of these.

In contrast, the formulas (2.5) and (2.7) provide the information necessary to answer any statistical question about the community structure implied by . Unfortunately, an algorithm that simply returns the full distribution is grossly impractical. The number of partitions of nodes is the Bell number , which grows exponentially with : e.g., . What, then, are these probabilities good for? One answer is that the formula for posterior probability can be used as a (more principled) quality function [33]. Another is that Monte Carlo methods can be used to produce a random sample of solutions [61, 12]. These random samples can be used to approximate statistics of . In this section we will consider how such statistics might be computed directly.

3.1 Stochastic blockmodel

The most natural statistical question to ask is this: what is the probability that a node is in community ? We may express this probability as , where the dependence on the graph is suppressed from the notation. For the model , we may compute from (2.5):


Unfortunately, this exact expression does not appear to simplify in any significant way. (Ironically, its dynamic counterpart does simplify: cf. Section 6.)

A strategy for approximating is to use only the most relevant information in the graph. For example, we could divide the edges into two classes: those that contain and those that do not. Edges in the former class have more direct relevance to the question of which community belongs to. If we let denote the restriction of the graph to edges containing , and be the corresponding random variable, then we may approximate by . By Bayesian inversion this equals


This equation exploits the statistical distribution of edge types that tend to emanate from a given community: if is such that this information is distinctive, then (3.2) will perform well. However, because it assesses each node in isolation, it does not exploit network structure and will not perform well when fails to produce distinctive edge-type distributions.

If there were multiple, conditionally independent graph snapshots for a given ground-truth , then one could replace with in (3.2), and with , to get an updated value . One could initialize these values to the prior and apply the update equation for each graph snapshot : this would introduce communication between the results for individual nodes and thus exploit network structure. The approach in Section 6 is a more sophisticated version of this, which allows the temporal sequence of graphs to be correlated and nodes to move between communities.

To derive useful probabilistic information that exploits network structure rather than just the statistical characteristics of edge-type distributions we turn to the second-order statistics . To approximate this, we may divide the edges into three classes: the edge , the edges containing either or (but not both), and the edges containing neither. One gets a rather trivial approximation using only the single edge , but using edges from the first two classes yields the approximation . This quantity has a formula similar to (3.2):


(The version that uses only the single edge as evidence is given by omitting the final product in (3.3).) This formula provides important statistical information even when is completely symmetric. Indeed, to exploit it is simpler to work with the symmetric case.

3.2 Planted partition model

When and are symmetric under permutations of , then (3.1) reduces to (and (3.2) to ). This is because in the symmetric case community labels have no meaning, so first-order statistics become trivial. The simplest, non-trivial quantities to compute are the second-order statistics . In the symmetric case, they reduce to the single probability that and are in the same community: i.e., . To compute exactly requires a summation over all partitions. Reichardt and Bornholdt estimated the matrix by a Monte Carlo sampling of the partition space, but this is slow [62]. Instead of this, we may approximate directly by simplifying (3.3). This leads to fairly simple expressions. The meaning of these expressions is opaque, however, when derived through straightforward mathematical manipulations, which creates problems when trying to adapt the results to engineering contexts. Therefore we proceed along more general lines to demonstrate which aspects of the partition–graph model lead to which aspects of the resulting expressions.

Suppose instances of some random process are partition–graph pairs on nodes. This process is not necessarily : we will later take to be a somewhat more complex process in which the parameters , , and are first drawn from some distribution, and then an instance of is generated. Let be the indicator random variable for the event that and are in the same community (i.e., when and are in the same community, and 0 otherwise), and be the indicator random variable for the existence of an edge between and . Now let indicate the presence or absence of the edge in some given graph (i.e., if is an edge of , and 0 otherwise). Thus, the are data, rather than instances of . We define to be the indicator random variable for agreeing with this datum (i.e., if , and 0 otherwise. We may express as


Now let be the indicator random variable for agreeing exactly with on all edges containing and/or . We may express this as


The approximation to based on using only local graph information may then be written . This can be expressed as


where the likelihood ratio is given by


To evaluate we would like to use (3.5), requiring that have suitably favorable properties. If , then the random variables , and each of the for are conditionally independent given . E.g., if (i.e., and are in the same community), then with probability , independent of the values of any other . However, if is a process in which a parameter vector is first drawn from some distribution, and then a draw is made from some process , then assumption of conditional independence is far too restrictive. In such a case the existence of many edges elsewhere in the graph would suggest a large value of a parameter like , and hence a larger value of , so this random variable would not be conditionally independent of the other given .

This problem is easily overcome, however. We simply decompose the expected value into the conditional expectation for a specified value of , followed by an expectation over . E.g., we write as


We then stipulate that and each of the for are conditionally independent given and . Then


We may express the factors in the product in terms of a covariance:


We make the further assumption that is conditionally independent of given (which, again, holds for ). Then, using (3.4) we have


We introduce the following notation


In this symmetric scenario all quantities are invariant under node permutations. Thus is the probability that two randomly chosen nodes are in the same community (for fixed parameters ), and is the probability that two random chosen nodes have an edge between them. We write in terms of these quantities:




We may use this to express (3.9) as




Here denotes the number of nodes (aside from and ) adjacent to exactly of , and .

Now to compute we substitute (3.16) into (3.8). To evaluate the expectation of (3.16) requires a specific random graph model . We will use the following : we will select the number of communities in a manner to be discussed below, and select and uniformly from . Then we shall make a draw from to generate a partition–graph pair . For this model we have


as well as


Finally, the leading factor in (3.16) is


We may split the expectation into an integral over and followed by an expectation with respect to . Then (3.8) becomes


We may change coordinates from to for and to for . This introduces complications due to Jacobians and complicated regions of integration and , but it is helpful to be in the natural coordinate system of :


In the case, the range is transformed into the following region : for and for . Similarly, in the case it is transformed into the following region : for and for . To compute (3.24) and (3.25) numerically one would transform the expressions (3.21) and (3.22) into space, although it seems to be more numerically stable to use the expressions (3.19) and (3.20) in (3.23). For small , this numerical integration is feasible. The following example employs numerical integration for a dataset with nodes.

Figure 1: plot for Zachary’s karate club: blue: ; red:

Figure 1 shows which pairs of nodes are particularly likely or unlikely to be in the same community for Zachary’s karate club data [79]. This is a social network of 34 members of a karate club at a university which split into two communities. Nodes 4 and 8 are most likely, with , and nodes 1 and 34 least likely to be in the same community with . Being adjacent does not guarantee a high value of : the node pairs and each have . Nor is it necessary for nodes to be adjacent to have a high value of : among nodes 15, 16, 19, 21, and 23 for every pair, and . Note that the node pairs and are each non-adjacent, and each has four common neighbors (i.e., ), but their values of differ by a factor of 150 because of the degrees of the nodes involved. Finally, although nodes 9 and 31 are in different ground-truth communities, .

When numerical integration is not feasible, it is difficult to obtain good asymptotic estimates as , so we will resort to heuristics. The function has a global maximum which is increasingly sharp as . This occurs at


If , then this peak lies within when Assuming the expectation has significant weight in this range, one can replace with a delta function at to estimate (i.e., parameter estimation is an appropriate approximation to the full, Bayesian integration). To estimate in this case, one could make the same argument, but with the maximum of constrained to . Conveniently, this constrained maximum occurs at . The analogous argument works when . Therefore, if the integrals in (3.24) and (3.25) contained nothing but we could approximate them by or as appropriate. Ignoring the expectation over as well, we could substitute these expressions into (3.7) to obtain the following approximation


We may decompose as , where encompasses all corrections to the crude approximation when there is an edge between and , and when there is no edge. For this, we need to specify the prior on . Here, we let vary uniformly from to . It is convenient to treat (or, equivalently, ) as a continuum variable here to avoid the accidents of discreteness. With this prior, we may compute the value of required in (3.6) as


When differs greatly from 0, the factor is very large or small and dominates the correction term . Therefore we seek to approximate the correction factor only in the critical case . Typically, real-world graphs are sparse, in which case the lack of an edge between and decreases their co-membership probability only slightly, but the presence of an edge greatly enhances it. Numerical experimentation confirms this intuition: the correction factor due to the absence of an edge is roughly constant, but the factor due to the presence of the edge increases rapidly as decreases until it hits a constant plateau (which varies with ):


The four-digit coefficients in these formulas are obtained from an asymptotic analysis of exact results obtained in the case. These exact results involve combinations of generalized hypergeometric functions (i.e., ), and are not particularly enlightening, although they can be used to obtain accurate coefficients, such as 0.56051044368284805729 rather than 0.5605 in (3.30).

Putting the above together into (3.6), we obtain the following approximation to :


This formula could certainly be improved. It often yields results such as : this figure might be accurate given the model assumptions, but such certainty could never be attained in the real world. To make it more accurate a more sophisticated model could be used, or the priors on , , and could be matched more closely to reality. Only limited improvement is possible, however, because in reality multiple, overlapping, fuzzily-defined community structures typically exist at various scales, and it is unclear what means in such a context. Certainly the integral approximations could be performed more rigorously and accurately. The broad outlines of the behavior of are captured in , , and , however. Finally, using only local evidence constitutes a rather radical pruning of the information in . However, it is because of this pruning that the approximation (3.31) can be implemented so efficiently.

4 Community Detection: Results

Direct visualizations like Figure 1 are impractical for larger networks. It can be useful to use the blue edges in Figure 1 (those with above a certain threshold) in place of a graph’s edges in network algorithms (such as graph layout): this is discussed in Section 4.3. However, a more direct use of the co-membership matrix for network visualization is simply to plot the matrix itself with the values of as intensities [61, 62]. An example of this is shown in Figure 2, using the approximation (3.31), for the Enron email communication network, which has 36,692 nodes and 367,662 edges [48]. The insets depict the hierarchical organization of community structure in networks [12, 44]: communities with various structures exist at all scales. Although the model does not account for hierarchical structure, a benefit of integrating over the number of communities (rather than estimating it) is that this accounts for co-membership at different scales.

Figure 2: Visualization of matrix for the Enron email network

4.1 Accuracy

One may rightly question whether the approximation is accurate, given the modeling assumptions and approximations that it is based on. To address this, we observe that the values of may be used to define a certain family expected utility functions (parameterized by a threshold probability : cf. (A.14) in Appendix A), and optimizing this expected utility yields a traditional community detection algorithm. Because a great many community detection algorithms have been developed, one can assess the quality of the approximation by comparing the performance of the resulting community detection algorithm to those in the literature.

The most comprehensive comparison to date is based on the LFR benchmark graphs which have power-law distributions both on degree and on community size [43]. The conclusion is that all algorithms prior to 2008 were eclipsed by a set of three more recent algorithms: the Louvain algorithm [8], the Ronhovde–Nussinov (RN) algorithm [63], and Infomap [64]. Infomap performed somewhat better than RN, and both somewhat better than Louvain, but all three were much better than the previous generation. Figure 3 compares our algorithm to the Infomap, RN, and Louvain algorithms, and to the other algorithms tested in [43]. (This figure is a correction of Figure 3 from [27]. Also, the Simulated Annealing method which works so well in panel (b) is highlighted (purple with shorter dashes) in all panels for comparison.) Our method is labeled because numerical optimization over all has been used to set to the value that maximizes NMI. Both the 1000- and 5000-node cases are shown, for small communities (10 to 50 nodes) and large ones (20 to 100). The -axis is the mixing parameter —the fraction of a node’s neighbors outside its community (not the expected edge probability of (3.12))—and the -axis is a particular version (cf. the appendix of [44]) of the Normalized Mutual Information (NMI) between the computed and the true partition. In all cases, our method exhibits the performance characteristic of the three state-of-the-art methods cited by [43]. The method has an unfair advantage in optimizing over all : in a deployable algorithm one would need a method for setting . On the other hand, the purpose of Figure 3 is simply to show that the computation retains enough information about community structure to reconstruct high-quality hard-call solutions. From this perspective, it is surprising that it does so well, because is based on (a) the simple utility function of Appendix A, (b) an approximation to based only on limited evidence, and (c) an approximation to based on a heuristic evalution of the required integral.

(a) 1000 nodes, small communities (b) 1000 nodes, large communities (c) 5000 nodes, small communities (d) 5000 nodes, large communities
Figure 3: Comparison of community detection algorithms: ; : Infomap; : RN; : Louvain; - - - : Other methods.

4.2 Efficiency

The algorithm for computing the begins with pre-computing the value of for all pairs of nodes for which , then creating a cache of values for triples . For any node pair , the value of can be computed by first looking up its value of , computing and from and the degrees of and , then looking up the value for its triple . Occasionally the value for this triple must be computed from (3.31) and cached, but the number of such distinct triples is relatively small in practice. An optional additional step one can perform is to loop over all node pairs with non-zero in order to both fill in the value of for each triple , and count the number of times each triple occurs. (Because only the pairs with are looped over, some additional bookkeeping is needed to fill in and provide a count for the triples without actually iterating over all node pairs.) These values and counts are useful for the statistical analysis of the distribution.

We tested the algorithm on five different Facebook networks (gathered from various universities) [74], and networks generated from Slashdot [49], Amazon [47], LiveJournal [49], and connections between .edu domains (Wb-edu) [19]. Table 1 shows various relevant network statistics. The sum of the values for each node pair is the number of calculations needed to compute the data structure, whereas the number of values of reflects its size. The next column is the number of distinct triples—this is the number of distinct values that must be computed, and the final one is the number of communities that a randomly chosen instance of the algorithm Infomap [64] found for the dataset. For the last two rows the data structure was too large to hold in memory, and the second step of counting the triples was not performed, nor could Infomap be run successfully on our desktop.

Table 1: Network datasets

Table 2 contains timing results based on a Dell desktop with 8GB of RAM, and eight 2.5GHz processors. The first column is the number of seconds it took the version of Infomap described in [64] to run. This code is in C++, runs single-threaded, includes a small amount of overhead for reading the network, and uses the Infomap default setting of picking the best result from ten individual Infomap trial partitions. The next four columns compare methods of using our Java code to compute . The first two are single-threaded, and the other two use all eight cores. The columns labeled are the timing for the computation only: results are nulled out immediately after computing them. These columns are included for two reasons. First, they show that the computation itself displays good parallelization: the speedup is generally higher than 6.5 for eight processors. Second, the computation itself for the two larger datasets is quite fast (just under three minutes on eight cores), but the algorithm is currently designed only to maintain all results in RAM, and these datasets are too large for this. The last two columns are the timing results for explicitly filling in the information ahead of time and providing the counts required for statistical analysis.

As Infomap is one of the fastest community detection algorithms available, these results are quite impressive. Comparing the first two columns of timing data, we find the speed-up ranges up to 107 and 862 times as fast as Infomap for the two largest networks on which we ran Infomap, respectively. The relative performance falls off rapidly for denser networks, but even in these cases or method performed roughly 10 times as fast as Infomap (i.e., as fast as an individual Infomap run). Computing the counts for statistics increases the run time, but only by a constant factor (of 2 to 3). It must be emphasized that the timing results are for producing a very different kind of output than Infomap does. However, the usual method for estimating  [62] is many times slower than producing partitions, rather than many times faster.