Community Detection in Networks with Node Features
Abstract
Many methods have been proposed for community detection in networks, but most of them do not take into account additional information on the nodes that is often available in practice. In this paper, we propose a new joint community detection criterion that uses both the network edge information and the node features to detect community structures. One advantage our method has over existing joint detection approaches is the flexibility of learning the impact of different features which may differ across communities. Another advantage is the flexibility of choosing the amount of influence the feature information has on communities. The method is asymptotically consistent under the block model with additional assumptions on the feature distributions, and performs well on simulated and real networks.
Community detection is a fundamental problem in network analysis, extensively studied in a number of domains – see korean1 () and schlitt2007current () for some examples of applications. A number of approaches to community detection are based on probabilistic models for networks with communities, such as the stochastic block model Holland83 (), the degreecorrected stochastic block model Karrer10 (), and the latent factor model Hoff2007 (). Other approaches work by optimizing a criterion measuring the strength of community structure in some sense, often through spectral approximations. Examples include normalized cuts shi00 (), modularity Newman&Girvan2004 (); NewmanPNAS (), and many variants of spectral clustering, e.g., qin2013regularized ().
Many of the existing methods detect communities based only on the network adjacency matrix. However, we often have additional information on the nodes (node features), and sometimes edges as well, for example, steglich2006applying (), snijders2006new () and hummon1990analyzing (). In many networks the distribution of node features is correlated with community structure Jure_ego (), and thus a natural question is whether we can improve community detection by using the node features. Several generative models for jointly modeling the edges and the features have been proposed, including the network random effects model hoff2003random (), the embedding feature model Zanghi:2010:CBR:1774858.1774960 (), the latent variable model RSSA:RSSA471 (), the discriminative approach Yang:2009:CLC:1557019.1557120 (), the latent multigroup membership graph model Jure_multiple_attr (), the social circles model for ego networks Jure_ego (), the communities from edge structure and node attributes (CESNA) model yang2013community (), the Bayesian Graph Clustering (BAGC) model xu2012model () and the topical communities and personal interest (TCPI) model hoang2014joint (). Most of these models are designed for specific feature types, and their effectiveness depends heavily on the correctness of model specification. Modelfree approaches include weighted combinations of the network and feature similarities viennet2012community (); binkiewicz2014covariate (), attributestructure mining Silva:2012:MAC:2140436.2140443 (), simulated annealing clustering Cheng:2011:CLA:1921632.1921638 (), and compressive information flow smith2014partitioning (). Most methods in this category use all the features in the same way without determining which ones influence the community structure and which do not, and lack flexibility in how to balance the network information with the information coming from its node features, which do not always agree. Including irrelevant node features can only hurt community detection by adding in noise, while selecting features that by themselves cluster strongly may not correspond to features that correlate with the community structure present in the adjacency matrix.
In this paper, we propose a new joint community detection criterion that uses both the network adjacency matrix and the node features. The idea is that by properly weighing edges according to feature similarities on their end nodes, we strengthen the community structure in the network thus making it easier to detect. Rather than using all available features in the same way, we learn which features are most helpful in identifying the community structure from data. Intuitively, our method looks for an agreement between clusters suggested by two data sources, the adjacency matrix and the node features. Numerical experiments on simulated and real networks show that our method performs well compared to methods that use either the network alone or the features alone for clustering, as well as to a number of benchmark joint detection methods.
1 The joint community detection criterion
Our method is designed to look for assortative community structure, that is, the type of communities where nodes are more likely to connect to each other if they belong to the same community, and thus there are more edges within communities than between. This is a very common intuitive definition of communities which is incorporated in many community detection criteria, for example, modularity NewmanPNAS (). Our goal is to use such a community detection criterion based on the adjacency matrix alone, and add featurebased edge weights to improve detection. Several criteria using the adjacency matrix alone are available, but having a simple criterion linear in the adjacency matrix makes optimization much more feasible in our particular situation, and we propose a new criterion which turns out to work particularly well for our purposes. Let denote the adjacency matrix with if there is no edge between nodes and , and otherwise which can be either 1 for unweighted networks or the edge weight for weighted networks. The community detection criterion we start from is a very simple analogue of modularity, to be maximized over all possible label assignments :
(1.1) 
Here is the vector of node labels, with if node belongs to community , for , , and is the number of nodes in community . We assume each node belongs to exactly one community, and the number of communities is fixed and known. Rescaling by is designed to rule out trivial solutions that put all nodes in the same community, and is a tuning parameter. When , the criterion is approximately the sum of edge densities within communities, and when , the criterion is the sum of average “within community” degrees, which both intuitively represent community structure. This criterion can be shown to be consistent under the stochastic block model by checking the conditions of the general theorem in Bickel&Chen2009 ().
The ideal use of features with this criterion would be to use them to upweigh edges within communities and downweigh edges between them, thus enhancing the community structure in the observed network and making it easier to detect. However, node features may not be perfectly correlated with community structure, different communities may be driven by different features, as pointed out by Jure_ego (), and features themselves may be noisy. Thus we need to learn the impact of different features on communities as well as balance the roles of the network itself and its features. Let denote the dimensional feature vector of node . We propose a joint community detection criterion (JCDC),
(1.2) 
where is a tuning parameter as in (1.1), is the coefficient vector that defines the impact of different features on the th community, and . The criterion is then maximized over both and . Having a different for each allows us to learn the roles different features may play in different communities. The balance between the information from and is controlled by , another tuning parameter which in general may depend on .
For the sake of simplicity, we model the edge weight as a function of the node features and via a dimensional vector of their similarity measures . The choice of similarity measures in depends on the type of (for example, on whether the features are numerical or categorical) and is determined on a case by case basis; the only important property is that assigns higher values to features that are more similar. Note that this trivially allows the inclusion of edge features as well as node features, as long as they are converted to some sort of similarity. To eliminate potential differences in units and scales, we standardize all along each feature dimension. Finally, the function should be increasing in , which can be viewed as the “overall similarity” between nodes, and for optimization purposes it is convenient to take to be concave. Here we use the exponential function,
(1.3) 
One can use other functions of similar shapes, for example, the logit exponential function, which we found empirically to perform similarly.
2 Estimation
The joint community detection criterion needs to be optimized over both the community assignments and the feature parameters . Using block coordinate descent, we optimize JCDC by alternately optimizing over the labels with fixed parameters and over the parameters with fixed labels, and iterating until convergence.
2.1 Optimizing over label assignments with fixed weights
When parameters are fixed, all edge weights ’s can be treated as known constants. It is infeasible to search over all possible label assignments, and, like many other community detection methods, we rely on a greedy label switching algorithm to optimize over , specifically, the tabu search Glover:tabu (), which updates the label of one node at a time. Since our criterion involves the number of nodes in each community , no easy spectral approximations are available. Fortunately, our method allows for a simple local approximate update which does not require recalculating the entire criterion. For a given node considered for label switching, the algorithm will assign it to community rather than if
(2.1) 
where is twice the total edge weights in community , and is the sum of edge weights between node and all the nodes in . When and are large, we can ignore in the denominators, and (2.1) becomes
(2.2) 
which allows for a “local” update for the label of node without calculating the entire criterion. This also highlights the impact of the tuning parameter : when , the two sides of (2.2) can be viewed as averaged weights of all edges connecting node to communities and , respectively. Then our method assigns node to the community with which it has the strongest connection. When , the left hand side of (2.2) is multiplied by a factor . Suppose is larger than ; then choosing indicates a preference for assigning a node to the larger community, while favors smaller communities. A detailed numerical investigation of the role of is provided in the Supplemental Material.
The edge weights involved in (2.2) depend on the tuning parameter . When , all weights are equal to . On the other hand, for all values of . Therefore, is the maximum amount by which our method can reweigh an edge. When is large, , and thus the information from the network structure dominates. When is close to , the ratio is large and the featuredriven edge weights have a large impact. See the Supplemental Material for more details on the choice of .
While the tuning parameter controls the amount of influence features can have on community detection, it does not affect the estimated parameters for a fixed community assignment. This is easy to see from rearranging terms in (1.2):
(2.3) 
where the function does not depend on . Note that the term containing does not depend on .
2.2 Optimizing over weights with fixed label assignments
Since we chose a concave edge weight function (1.3), for a given community assignment the joint criterion is a concave function of , and it is straightforward to optimize over by gradient ascent. The role of is to control the impact of different features on each community. One can show by a Taylorseries type expansion around the maximum (details omitted) and also observe empirically that for our method, the estimated ’s are correlated with the feature similarities between nodes in community . In other words, our method tends to produce a large estimated for a feature with high similarity values ’s for . However, in the extreme case, the optimal can be if all ’s are positive in community or if all ’s are negative (recall that similarities are standardized, so this cannot happen in all communities). To avoid these extreme solutions, we subtract a penalty term from the criterion (1.2) while optimizing over . We use a very small value of ( everywhere in the paper) which safeguards against numerically unstable solutions but has very little effect on other estimated coefficients.
3 Consistency
The proposed JCDC criterion (1.2) is not modelbased, but under certain models it is asymptotically consistent. We consider the setting where the network and the features are generated independently from a stochastic block model and a uniformly bounded distribution, respectively. Let where is a factor controling the overall edge density and is the vector of true labels. Assume the following regularity conditions hold:

There exist global constants and , such that and for all , and the tuning parameter satisfies .

Let . There exists a global constant such that for all .

For all , .
Condition 1 states that node feature similarities are uniformly bounded. This is a mild condition in many applications as the node features are often themselves uniformly bounded. In practice, for numerical stability the user may want to standardize node features and discard individual features with very low variance, before calculating the corresponding similarities . Condition 2 guarantees communities do not vanish asymptotically. Condition 3 enforces assortativity. Since the estimated labels are only defined up to an arbitrary permutation of communities, we measure the agreement betwee and by , where is the set of all permutations of .
Theorem 1 (Consistency of JCDC).
The proof is given in the Supplemental Material.
4 Simulation studies
We compare JCDC to three representative benchmark methods which use both the adjacency matrix and the node features: CASC (Covariate Assisted Spectral Clustering, binkiewicz2014covariate ()), CESNA (Communities from Edge Structure and Node Attributes, yang2013community ()), and BAGC (BAyesian Graph Clustering, xu2012model ()). In addition, we also include two standard methods that use either the network adjacency alone (SC, spectral clustering on the Laplacian regularized with a small constant , as in Amini.et.al2013 ()), or the node features alone (KM, means performed on the dimensional node feature vectors, with 10 random initial starting values). We generate networks with nodes and communities of sizes and from the degreecorrected stochastic block model as follows. The edges are generated independently with probability if nodes and are in the same community, and if nodes and are in different communities. We set and vary from to . We set of the nodes in each community to be “hub” nodes with the degree correction parameter , and for the remaining nodes set . All resulting products are thresholded at 0.99 to ensure there are no probability values over 1. These settings result in the average expected node degree ranging approximately from to .
For each node , we generate features, with one “signal” feature related to the community structure and one “noise” feature whose distribution is the same for all nodes. The “signal” feature follows the distribution for nodes in community 1 and for nodes in community 2, with varying from to (larger corresponds to stronger signal). For use with CESNA, which only allows categorical node features, we discretize the continuous node features by partitioning the real line into 20 bins using the th quantiles. For the JCDC, based on the study of the tuning parameters in the Supplemental Material, we use and compare two values of , and . Finally, agreement between the estimated communities and the true community labels is measured by normalized mutual information, a measure commonly used in the network literature which ranges between 0 (random guessing) and 1 (perfect agreement). For each configuration, we repeat the experiments 30 times, and record the average NMI over 30 replications.
Figure 1 shows the heatmaps of average NMI for all methods under these settings, as a function of and . As one would expect, the performance of spectral clustering (c), which uses only the network information, is only affected by (the larger is, the harder the problem), and the performance of means (d), which uses only the features, is only affected by (the larger is, the easier the problem). JCDC is able to take advantage of both network and feature information by estimating the coefficients from data, and its performance only deteriorates when neither is informative. The informative features are more helpful with a larger value of (a), and conversely uninformative features affect perfomance slightly more with a lower value of (b), but this effect is not strong. CASC (e) appears to inherit the sharp phase transition from spectral clustering, which forms the basis of CASC; the sharp transition is perhaps due to different community sizes and hub nodes, which are both challenging to spectral clustering; CESNA (f) and BAGC (g) do not perform as well overall, with BAGC often clustering all the hub nodes into one community.
5 Data applications
5.1 The world trade network
The world trade network de2011Pajek () connects 80 countries based on the amount of trade of metal manufactures between them in 1994, or when not available for that year, in 1993 or 1995. Nodes are countries and edges represent positive amount of import and/or export between the countries. Each country also has three categorical features: the continent (Africa, Asia, Europe, N. America, S. America, and Oceania), the country’s structural position in the world system in 1980 (core, strong semiperiphery, weak semiperiphery, periphery) and in 1994 (core, semiperiphery, periphery). Figures 2 (a) to (c) show the adjacency matrix rearranged by sorting the nodes by each of the features. The partition by continent (Figure 2(a)) clearly shows community structure, whereas the other two features show hubs (core status countries trade with everyone), and no assortative community structure. We will thus compare partitions found by all the competing methods to the continents, and omit the three Oceania countries from further analysis because no method is likely to detect such a small community. The two world position variables (’80 and ’94) will be used as features, treated as ordinal variables.
The results for all methods are shown in Figure 2, along with NMI values comparing the detected partition to the continents. All methods were run with the true value .
Community  Best match  Position ’80  Position ’94 

blue  Europe  0.000  0.143 
red  Asia  0.314  0.127 
green  Europe  0.017  0.204 
cyan  N. America  0.107  0.000 
purple  S. America  0.121  0.000 
The result of spectral clustering agrees much better with the continents than that of means, indicating that the community structure in the adjacency matrix is closer to the continents that the structure contained in the node features. JCDC obtains the highest NMI value, CASC performs similarly to spectral clustering, whereas CESNA and BAGC both fail to recover the continent partition. Note that no method was able to estimate Africa well, likely due to the disassortative nature of its trade seen in Figure 2 (a). Figure 2 (e) indicates that JCDC estimated N. America, S. America and Asia with high accuracy, but split Europe into two communities, since it was run with and could not pick up Africa due to its disassortative structure. Table 1 contains the estimated feature coefficients, suggesting that in 1980 the “world position” had the most influence on the connections formed by Asian countries, whereas in 1994 world position mattered most in Europe.
5.2 The lawyer friendship network
The second dataset we consider is a friendship network of 71 lawyers in a New England corporate law firm lazega2001collegial (). Seven node features are available: status (partner or associate), gender, office location (Boston, Hartford, or Providence, a very small office with only two nonisolated nodes), years with the firm, age, practice (litigation or corporate) and law school attended (Harvard, Yale, University of Connecticut, or other). Categorical features with levels are represented by dummy indicator variables. Figures 3 (a)(g) show heatmap plots of the adjacency matrix with nodes sorted by each feature, after eliminating six isolated nodes. Partition by status (Figure 3(a)) shows a strong assortative structure, and so does partition by office (Figure 3(c)) restricted to Boston and Hartford, but the small Providence office does not have any kind of structure. Thus we chose the status partition as a reference point for comparisons, though other partitions are certainly also meaningful.
Communities estimated by different methods are shown in Figure 3 (i)(o), all run with . Spectral clustering and means have equal and reasonably high NMI values, indicating that both the adjacency matrix and node features contain community information. JCDC obtains the highest NMI value, with performing slightly better than . CASC improves upon spectral clustering by using the feature information, with NMI just slightly lower than that of JCDC with . CESNA and BAGC have much lower NMI values, possibly because of hub nodes, or because they detect communities corresponding to something other than status.
The estimated feature coefficients are shown in Table 2. Office location, years with the firm, and age appear to be the features most correlated with the community structure of status, for both partners and associates, which is natural. Practice, school, and gender are less important, though it may be hard to estimate the influence of gender accurately since there are relatively few women in the sample.
Comm.  gender  office  years  age  practice  school 

partner  0.290  0.532  0.212  0.390  0.095  0.000 
associate  0.012  0.378  0.725  0.320  0.118  0.097 
6 Discussion
Our method incorporates featurebased weights into a community detection criterion, improving detection compared to using just the adjacency matrix or the node features alone, if the cluster structure in the features is related to the community structure in the adjacency matrix. It has the ability to estimate coefficients for each feature within each community and thus learn which features are correlated with the community structure. This ability guards against including noise features which can mislead community detection. The community detection criterion we use is designed for assortative community structure, with more connections within communities than between, and benefits the most from using features that have a similar clustering structure.
This work can be extended in several directions. Variation in node degrees, often modeled via the degreecorrected stochastic block model Karrer10 () which regards degrees as independent of community structure, may in some cases be correlated with node features, and accounting for degree variation jointly with features can potentially further improve detection. Another useful extension is to overlapping communities. One possible way to do that is to optimize each summand in JCDC (1.2) separately and in parallel, which can create overlaps, but would require careful initialization. Statistical models that specify exactly how features are related to community assignments and edge probabilities can also be useful, though empirically we found no such standard models that could compete with the nonmodelbased JCDC on real data. This suggests that more involved and perhaps dataspecific modeling will be necessary to accurately describe real networks, and some of the techniques we proposed, such as communityspecific feature coefficients, could be useful in that context.
Acknowledgments
E.L. is partially supported by NSF grants DMS1106772 and DMS1159005. J.Z. is partially supported by a NSF grant DMS1407698 and a NIH grant R01GM096194.
Appendix
A.1 Choice of tuning parameters
The JCDC method involves two userspecified tuning parameters, and . In this section, we investigate the impact of these tuning parameters on community detection results via numerical experiments.
First we study the impact of , which determines the algorithm’s preference for larger or smaller communities. We study its effect on the estimated community size as well as on the accuracy of estimated community labels. We generate data from a stochastic block model with nodes and communities of sizes and . We set the withincommunity edge probabilities to and betweencommunity edge probabilities to , and vary from 60 to 110. Since is not related to feature weights, we set features to a constant, resulting in unweighted networks. The results are averaged over 50 replications and shown in Figure 4.
We report the size of the larger estimated community in Figure 4(a), and the accuracy of community detection as measured by normalized mutual information (NMI) in Figure 4(b). For comparison, we also record the results from spectral clustering (horizontal lines in Figure 4), which do not depend on . When communities are balanced (), JCDC performs well for all values of , producing balanced communities and uniformly outperforming spectral clustering in terms of NMI. In general, larger values of in JCDC result in more balanced communities, while smaller ’s tend to produce a large and a small community. In terms of community detection accuracy, Figure 4(b) shows that the JCDC method outperforms spectral clustering over a range of values of , and this range depends on how unbalanced the communities are. For simplicity and ease of interpretation, we set for all the simulations and data analysis reported in the main manuscript; however, it can be changed by the user if information about community sizes is available.
Next, we investigate the impact of , which controls the influence of features. To study the tradeoff between the two sources of information (network and features), we generate two different community partitions. Specifically, we consider two communities of sizes and , with . We generate two label vectors and , with for and for , while the other label vector has for and for . Then the edges are generated from the stochastic block model based on , and the node features are generated based on . We generate two node features: one feature is sampled from the distribution if and if ; the other feature is sampled from if and if . We fix and set , as discussed above. We set the within and betweencommunity edge probabilities to 0.3 and 0.15, respectively, same as in the previous simulation, and vary the value of from to . Finally, we look at the the agreement between the estimated communities and and , as measured by normalized mutual information. The results are shown in Figure 5.
As we expect, smaller values of give more influence to features and thus the estimated community structure agrees better with than with . As increases, the estimated becomes closer to . In the manuscript, we compare two values of , 1.5 and 5.
A.2 Proofs
We start with summarizing notation. Let be the estimated communities corresponding to the label vector , and the true communities corresponding to the label vector . Recall we estimate by maximizing the criterion over and , where
and define
where and the corresponding are defined up to a permutation of community labels. Recall that we assumed and are conditionally independent given and defined , the “population version” of , as
The expectation in is taken with respect to the distribution of node features, which determine the similarities .
Proof of Lemma 2.
We first bound the difference between and for fixed and . By Hoeffding’s inequality and the fact that , where is the integer part of , we have
Taking and applying the union bound, we have
Next, we take the uniform bound over . Consider the set
It is straightforward to verify that is an net on , the space of ’s. For each , let be the best approximation to in . Then
Therefore, choosing , we have
where the first term becomes 0 because of the choice of and . Finally, taking a union bound over all possible community assignments, we have
where . Taking completes the proof of Lemma 2. ∎
We now proceed to investigate the “population version” of our criterion, . Define by , and let be a diagonal matrix with on the diagonal, where is the fraction of nodes in community . Roughly speaking, is the confusion matrix between and , and for a permutation matrix means the estimation is perfect. Define
Each estimated community assignment induces a unique . It is not difficult to verify that
Lemma 3.
Under conditions 1 and 2, there exists a constant such that
Proof of Lemma 3.
Lemma 4.
Under condition 3, if , then for all satisfying for , is uniquely maximized at for , where denotes the set of permutation matrices.
Proof of Lemma 4.
We have
(6.1) 
For , since for all and , we have . By midvalue theorem, there exists , such that
(6.2) 
Finally, we will need the following inequality: for and satisfying ,
(6.3) 
For , equality holds. To verify (6.3) when , dividing by we have
The first inequality above implies that a necessary condition for equality to hold in (6.3) is .
We now lower bound the first term on the right hand side of (6.1).
(6.4) 
where the last equality is obtained by applying (6.3) with , and . Plugging (6.4) into (6.1), we have
It remains to show that equality holds only if for some . Note that the last inequality in (6.4) is obtained from (6.3), where equality holds only when . The corresponding condition for equality to hold in (6.4) is thus for all , and . Therefore, for each , there is only one such that , i.e., for some . ∎
Proof of Theorem 1.
By Lemma 2 and Lemma 3, we have
(6.5) 
It is straightforward to verify that, for any , , where . Take a sequence of decreasing positive numbers and define
(6.6) 
We now show, by contradiction, that implies . First, note that is nonincreasing. Now if , by compactness of the set and continuity of the function , the supremum of over , which equals , is attained in . This contradicts Lemma 4.
Now let