Modularity based community detection in heterogeneous networks
MODULARITY BASED COMMUNITY DETECTION
IN HETEROGENEOUS NETWORKS
Jingfei Zhang and Yuguo Chen
University of Miami and University of Illinois at Urbana-Champaign
Abstract: Heterogeneous networks are networks consisting of different types of nodes and multiple types of edges linking such nodes. While community detection has been extensively developed as a useful technique for analyzing networks that contain only one type of nodes, very few community detection techniques have been developed for heterogeneous networks. In this paper, we propose a modularity based community detection framework for heterogeneous networks. Unlike existing methods, the proposed approach has the flexibility to treat the number of communities as an unknown quantity. We describe a Louvain type maximization method for finding the community structure that maximizes the modularity function. Our simulation results show the advantages of the proposed method over existing methods. Moreover, the proposed modularity function is shown to be consistent under a heterogeneous stochastic blockmodel framework. Analyses of the DBLP four-area dataset and a MovieLens dataset demonstrate the usefulness of the proposed method.
Key words and phrases: Heterogeneous network, modularity function, community detection, null model, consistency.
Network community detection has attracted much recent attention from various scientific communities, including statistics, physics, information technology, biology, social science and many others. A real-world network often displays a high level of inhomogeneity in its edge distribution, with high edge density within certain groups of nodes and low edge density between these groups. This feature is often referred to as the community structure (Fortunato, 2010). Community structures have been observed in networks in social science, biology, political science and so on. For example, in a gene regulation network, communities are groups of genes that function together in biological processes to carry out specific functions (Zhang and Cao, 2017). Detecting communities in real-world networks can help us better understand the architecture of the network. Further, it allows us to investigate the property in individual communities, which may be different from the aggregated property from the network as a whole.
Many community detection techniques have been proposed in recent years. See Fortunato (2010) for a comprehensive review. One class of methods involve maximizing some partition quality function over all possible partitions of the network (Shi and Malik, 2000; Newman and Girvan, 2004; Newman, 2006; Rohe et al., 2011). Another class includes model based approaches that estimate community structures through fitting probabilistic models to the observed networks (Airoldi et al., 2008; Bickel and Chen, 2009; Jin, 2015). In the second class of approaches, we need to know the number of communities a priori.
The existing community detection approaches primarily focus on homogeneous networks, i.e., networks with only one type of nodes. However, networks representing real-world complex systems often contain different types of nodes and different types of edges linking such nodes; we refer to such type of networks as heterogeneous networks. For example, in a healthcare network, nodes can be patients, diseases, doctors and hospitals. The edges can be in the type of patient-disease (patient treated for disease), patient-doctor (patient treated by doctor), doctor-hospital (doctor works at hospital). Figure 1.1 provides a simple illustration of a heterogeneous network. In this illustrative heterogeneous Facebook network, there are two types of nodes, users and events. Furthermore, there are two types of interactions in this network. A user is linked to another user through friendship and a user is linked to an event through attendance.
To find communities in a heterogeneous network using the existing methods developed for homogeneous networks, there are two possible approaches. One approach is to treat the heterogeneous network as a homogeneous network. In this approach, we do not differentiate the different types of nodes and edges. The other approach is to consider each type of nodes in the network separately, i.e., discard the information from the edges linking different types of nodes. In both approaches, we lose important information. In the first approach, we ignore the fact that different types of nodes may behave differently. For example, in Figure 1.1, users and events behave in different ways; a user can become friends with other users but an event can not link to other events. Using the first approach, the community detection algorithm does not distinguish the two different types of nodes. Losing such important information may lead to poor community detection results. In the second approach, the valuable information from edges that link different types of nodes are ignored. For example, in Figure 1.1, the user-event links show how different users are attracted to different events. Including such information can help us better identify the communities in users. Moreover, it provides important insights on the types of events that are clustered with each community of users.
To find community structures in heterogeneous networks, a preferable approach should take into account all the information contained in the heterogeneous network, including the different types of nodes, the homogeneous edges (edges that connect two nodes of the same type) and the heterogeneous edges (edges that connect two nodes of different types). The objective of the approach is to cluster the nodes in the heterogeneous network into several non-overlapping groups such that there are more homogeneous and heterogeneous edges within these groups and fewer homogeneous and heterogeneous edges between these groups; see Figure 1.1 for a simple illustration of a heterogeneous Facebook network with two communities.
Several methods have been proposed recently for detecting communities in heterogeneous networks (Sun and Han, 2012; Liu et al., 2014; Sengupta and Chen, 2015). One limitation of the existing methods is that they may have requirements on the number of node types or edge types in the network (see for example Sun and Han (2012) and Liu et al. (2014)). Another limitation of the existing methods is that they may require the number of communities in the network to be pre-specified (see for example Sengupta and Chen (2015)). This requirement could be difficult to meet in practice, since we generally do not know the number of communities in real-world networks. Lastly, very large networks can be computationally challenging for some existing methods, such as the spectral clustering approach proposed in Sengupta and Chen (2015).
In this paper, we propose a modularity based heterogeneous network community detection framework. Our contribution is threefold. First, we formally define a null model for a heterogeneous network. Under the proposed null model, we calculate the probabilities of having a homogeneous edge between two nodes of the same type and a heterogeneous edge between two nodes of different types. Second, we propose a Louvain type maximization method that can efficiently maximize the proposed modularity function. The application of the maximization method on a real-world network with about 20,000 nodes takes less than 20 seconds on a standard PC. Our proposed approach can be applied to heterogeneous networks of any type. Furthermore, the number of communities does not need to be specified and can be treated as an unknown quantity. Third, we show that the proposed modularity function for heterogeneous networks is consistent under a heterogeneous stochastic blockmodel framework. The consistency properties of modularity functions formulated for bipartite or multipartite networks follow as special cases. This theoretical result fills an existing gap in the literature.
The rest of this article is organized as follows. Section 2 introduces the null model for a heterogeneous network and the definition of a modularity function. Section 3 discusses the Louvain type modularity maximization technique. Section 4 shows the consistency of the modularity function under the a heterogeneous stochastic blockmodel framework. Section 5 demonstrates the advantages of our proposed method through simulation studies. Section 6 discusses the application of the proposed method on the DBLP four-area dataset and a MovieLens dataset. Section 7 provides some concluding remarks and discussions.
2. Modularity Function for Heterogeneous Networks
Let denote a simple heterogeneous network (no self loops or multiple edges) with types of nodes. Node set contains all the nodes of the th type, . Edge set denotes the set of edges between nodes of the same type and denotes the set of edges between nodes of different types. A homogeneous network can be formed within each node set , where is the set of edges between nodes in . By definition, we have . When , the heterogeneous network forms a multi-partite network, i.e., edges are only established between different types of nodes. In this paper, we use the terms network and graph interchangeably.
Newman and Girvan (2004) defined a quality function, usually referred to as the modularity function, for measuring the strength of a division of a homogeneous network into communities. Given a homogeneous network with nodes, edges and a community assignment , where is the community that node belongs to, the modularity function is defined as
where if and 0 otherwise. Here is the th entry of the adjacency matrix of the network and the expectation is calculated under some null model that describes networks with no community structure. It is easy to see that .
The modularity function for homogeneous networks measures the difference between the observed number of intra-community edges and the expected number of intra-community edges under the null model. If the observed number of intra-community edges in the network is close to the expected value, the modularity is close to 0. When approaches 1, the observed number of intra-community edges is much higher than the expected value, and this indicates a strong community structure. Since the modularity function measures the “strength” of community structure with respect to a network partition, the community membership of a network is identified by maximizing the modularity function with respect to . The number of communities does not need to be pre-specified in this approach and can be treated as an unknown quantity.
To introduce the modularity based community detection framework for heterogeneous networks, we focus on the case with only two types of nodes (). The framework can be easily generalized to networks that contain more than two types of nodes. For a heterogeneous network , let and denote the two homogeneous networks within node sets and ), respectively. Furthermore, let denote the bi-partite network formed between node sets and . We subsequently refer to nodes in () as type- (type-) nodes, edges in () as type- (type-) edges, and edges in as type- edges. We consider the following three matrices:
, the 0-1 adjacency matrix of , where if and only if there is an edge between and .
, the 0-1 adjacency matrix of , where if and only if there is an edge between and .
, the 0-1 matrix of , where if and only if there is an edge between and .
Note that is not the adjacency matrix of , but only a submatrix of it. The adjacency matrix of is
where . We use to denote the transpose of matrix . The matrix is usually referred to as the bi-adjacency matrix of . Since we only focus on networks with undirected edges, adjacency matrices and are both symmetric. The heterogeneous network can be uniquely represented by its adjacency matrix ,
2.1. Null Model for Heterogeneous Networks
The modularity function measures the difference between the observed network and the null model that characterizes networks with no community structure. To define the modularity function for a heterogeneous network, we need to formulate a null model for heterogeneous networks.
We introduce the following notations on degree sequences:
, where , , is the number of links incident to from .
, where , , is the number of links incident to from .
, where , , is the number of links incident to from .
, where , , is the number of links incident to from .
From the definitions, we see that is the vector of column (row) sums of , is the vector of row sums of , is the vector of column sums of , and is the vector of column (row) sums of . Write the number of edges in as /2, the number of edges in as , and the number of edges in as . Define .
An appropriate null model should satisfy the following two conditions. First, it should describe a random heterogeneous network with no community structure. Second, the networks from the null model should share basic structural properties with the observed network (Newman, 2006; Zhang and Chen, 2016). For the null model of a heterogeneous network, we propose to preserve the observed degree sequence . That is, the degrees and for each node , , are fixed. Similarly, the degrees and for each node , , are fixed.
Preserving the observed degree sequence has been considered in various homogeneous network models in the literature (Chung and Lu, 2002; Newman and Girvan, 2004; Perry and Wolfe, 2012). The edge distribution in real-world networks often displays high global inhomogeneity and local inhomogeneity. The global inhomogeneity refers to the feature that most nodes have low degrees while a few nodes have high degrees. The local inhomogeneity refers to the case that there is a high concentration of edges within certain groups of edges and low concentration of edges between these groups. The local inhomogeneity is also referred to as the community structure. To study the local inhomogeneity, it is important to control for the global inhomogeneity. That is, to study the community structure, it is important to control for the degree sequence.
The sample space in our null model is defined as
For a heterogeneous network from the sample space, the null distribution is defined as
Under the null model, every heterogeneous network from is equally likely to occur and there is no preference for any network configuration. With the defined null model, we need to calculate the expectations , and for the modularity function defined in the Section 2.2. Here the expectation is taken with respect to in (2.2).
To calculate under the null model, we notice that
where is the set of all simple heterogeneous networks in with =1, . Denote as the set of all simple homogeneous graphs with degree sequence , as the set of all simple homogeneous graphs with degree sequence , and as the set of all bipartite graphs with degree sequence for type- nodes and degree sequence for type- nodes. We have . It is easy to see that
where is the total number of simple homogeneous networks with degree sequence and a link between nodes and . Similarly, we can show that
where is the total number of bi-partite graphs with degree sequences for type- nodes, for type- nodes and a link between the th node of type- and the th node of type-.
Calculating the numerators and the denominators in (S1.2), (S1.3) and (2.5) is a difficult problem. Bender and Canfield (1978) and Bollobás and McKay (1986) derived asymptotic formulas for the number of simple graphs with a fixed degree sequence and pre-specified structure zeroes (a structure zero at means no edge can be placed between node and node ). Based on these asymptotic formulas, we have the following approximations for the expectations.
Define , , , , and assume that . Suppose for some , , , , , , and . Then is
We refer to the online supplementary material for the proof. The conditions in Theorem 1 describe the density of the network as the network size tends to infinity. Specifically, the conditions , , and characterize the rates that the maximum node degrees increase at; these conditions are to make sure the network does not become extremely dense as the network size grows. The conditions , and provide lower bounds for edge sums , and (=); these conditions are to make sure the network does not become extremely sparse as the network size grows.
The results in Theorem 1 indicate that can be well approximated by and can be well approximated by . As such, we use these approximations in the modularity function defined in the next section.
2.2. Modularity Function
We first consider heterogeneous networks with only two types of nodes (). Later in this section, we generalize the results to heterogeneous networks with any . We define the modularity matrix for the heterogeneous network as
where , , and . If there are no edges within the type- (or type-) nodes, we set (or ). Similarly, if there are no edges between type- and type- nodes, we set and .
Define a 0-1 assignment matrix of dimension as
where is an matrix with if node is in the -th community and 0 otherwise, and is an matrix with if node is in the -th community and 0 otherwise. The modularity function of a heterogeneous network is defined as
where denotes the trace of a square matrix. With some calculations, we can derive that
Here denotes the th row of matrix and is an indicator function. For example, only when nodes and are both of type- and they are in the same community. The first component and the third component calculate the differences between the observed number of intra-community edges and the expected number of intra-community edges in networks and , respectively. The second component calculates the difference between the observed number of intra-community edges and the expected number of intra-community edges in the bi-partite network .
From the definition, we can see the modularity function . When approaches 1, the observed numbers of type-, type- and type- intra-community edges are much higher than the expected values, which indicates a strong community structure. On the other hand, when approaches 0, the observed numbers of type-, type- and type- intra-community edges are close to the expected values, which indicates a weak community structure.
To generalize the modularity function to a heterogeneous network with types of nodes, we denote the adjacency matrix of as and the bi-adjacency matrix of as , . Write the number of nodes in each type as , . Further, write the number of edges in as and the number of edges in as , . The modularity function is defined as
Here , , . Matrix is a assignment matrix defined similarly as that in (2.6). The expectations in the modularity function are approximated using the following corollary.
Define , , , , and assume that . Suppose for some , , , , , and . Then is
The corollary is directly available from Theorem 1. Since a larger modularity value indicates a stronger community structure, the community assignment of nodes in the heterogeneous network is identified by maximizing the modularity function with respect to . In the next section, we introduce a Louvain type method for efficient modularity function maximization.
3. Modularity Maximization
Our goal is to find the community assignment matrix that maximizes the modularity function in (2.8), i.e.,
Maximizing this objective function is a very difficult problem, especially since the number of communities is generally unknown. Brandes et al. (2008) showed that finding the partition that maximizes the modularity function for a homogeneous network is NP-hard. Existing heuristic approaches for maximizing the modularity function come from various fields, including computer science, physics and sociology (Clauset et al., 2004; Massen and Doye, 2005; Newman, 2006; Reichardt and Bornholdt, 2006; Agrawal and Kempe, 2008). In this paper, we adopt a Louvain type maximization method.
The Louvain maximization method is a hierarchical clustering method proposed in Blondel et al. (2008). The technique was developed to maximize the modularity function of a homogeneous network. The optimization procedure is carried out in two phases that are repeated iteratively. The first phase starts by assigning each node in the network to its own community (each community contains one and only one node). Then each node is moved to the neighboring community that results in the largest increase in modularity (if no increase is possible, then node remains in its original community). A neighboring community of node is defined as a community that node is linked to. In the second phase, the algorithm aggregates nodes in the same community and “constructs” a new network whose nodes are the communities from the first phase. The edges between the new nodes are calculated using the edges connecting the two corresponding communities (see Blondel et al. (2008) for details). These steps are repeated iteratively until the modularity reaches its local maximum.
The Louvain method were successfully applied to various homogeneous networks of sizes up to 100 million nodes and billions of links. Using the Louvain method for community detection in a typical network with 2 million nodes only takes several minutes on a standard PC (Blondel, 2011). Fortunato (2012) noted that the modularity maximum found by the Louvain method often compares favorably with those found by using the methods in Clauset et al. (2004) and Wakita and Tsurumi (2007).
Similar to the Louvain method, finding the maximizer of the proposed heterogeneous network modularity function can also be carried out in two phases that are repeated iteratively. To ease the presentation, we focus on the case where , i.e., there are two types of nodes. First we define a term “unit”. A unit may contain one node of any type or two nodes of different types. A community consists of several units. To initialize, we assign each node in the network to its own unit. Therefore, if there are type- nodes and type- nodes, the algorithm starts with units. In the first phase, we start by assigning each unit to its own community. Then we calculate the change in modularity when unit is assigned to each one of its neighboring communities. A neighboring community of unit is defined as a community that unit is linked to. Once this value is calculated for every community that unit is linked to, we assign unit to the community that leads to the largest increase in modularity. If no move increases the modularity, unit remains in its original community. This step is applied repeatedly to the units in the network until no increase in modularity can be achieved. In the second phase, we examine each community from the first phase and merge nodes of the same type in each community. This community then becomes a new unit in the next step. If two communities are linked, then there is an edge between the two new units; if two communities are not linked, then there is no edge between the two new units. We repeat these two phases iteratively until there is no move possible and the modularity reaches a local maximum.
As an example, Figure 3.2 shows the application of the proposed algorithm to a heterogeneous network with 2 types of nodes. Each iteration contains two phases. In the first iteration, the number of communities changes from 11 to 4. After the first iteration, nodes 1 and 2 are merged and treated as one node, say , in the second iteration; similarly, nodes 7 and 8 are merged and treated as one node, say ; node 3 does not merge with any node and is treated as one node, say . In the second iteration, nodes form a unit and node is a unit. During the first phase in the second iteration, we compute the change in modularity when we place unit and unit in one community. If the modularity increases, we place and in one community; if the modularity decreases, the two units remain in their original communities. In the second iteration, the number of communities changes from 4 to 2. The algorithm outputs two communities with the first community including nodes 1, 2, 3, 7, 8 and the second community including nodes 4, 5, 6, 9, 10, 11.
The algorithm can be summarized as follows.
Take the modularity matrix as input:
1. Assign each node to its own unit.
2. Assign each unit to its own community.
3. For each unit , place it into the neighboring community that leads to the largest modularity increase. If no such move is possible, unit remains in its current community.
4. Apply Step 3 repeatedly to the units in the network until no units can be moved.
5. If the modularity is higher than the modularity from the previous iteration, then merge the nodes of the same type in each community; each community is treated as a unit and return to Step 2. If not, output the community assignment and the modularity value from the previous iteration.
The result of the algorithm depends on the initial ordering of the nodes. In addition, in Step 3, each node is assigned to the community that leads to the largest modularity increase. If there are several communities that all lead to the largest increase, one community is randomly selected. Hence, the Louvain method may not arrive at the same result in successive runs. In our analysis, we apply the Louvain method times with random node orderings to find the maximum of the modularity function. In general, should increase with the size and the complexity of the network. In our simulation and real data analysis, we set . We do not observe notable improvements in the maximized modularity function for . However, other networks of comparable or larger sizes may benefit if larger values of are selected.
In the implementation of the Louvain method, the decision of whether and where to move a node can be computed in time. Consequently, the complexity per iteration is , where is the total number of edges in the network. An upper bound on the total running time of the algorithm can be calculated as , where is the total number of iterations. A trivial upper bound on , which gives the worst case, is . While no nontrivial upper bound has been established on the number of iterations, the method needs only tens of iterations in practice to converge.
We note that the Louvain maximization method does not require the number of communities to be pre-specified. In cases where it is desirable to fix the number of communities at in the procedure, the Louvain method can still be applied. Specifically, if is reached during the iterations in the algorithm, we would stop the procedure and output the community assignment. If is not reached after the algorithm finishes, i.e., the algorithm finds , then we would continue with the algorithm and stop once is reached; when continuing the algorithm, a small modification is that in Step 3, we would move unit into the neighboring community that leads to the smallest modularity decrease.
The consistency of community detection approaches for homogeneous networks has been studied extensively (Bickel and Chen, 2009; Rohe et al., 2011; Zhao et al., 2012; Jin, 2015). However, theoretical properties of community detection methods for heterogeneous networks are largely unaddressed. In this section, we investigate the consistency property of our proposed method under a heterogeneous stochastic blockmodel framework. The consistency property of our method when applied to bipartite networks or multipartite networks follow as special cases.
Consider a heterogeneous network with latent community labels , , where is the community that the th node of type- belongs to. Write and . We assume that the sizes of , , are balanced, i.e., is bounded away from zero. We define a community detection criterion to be consistent if
This definition of consistency is a generalization of the one proposed in Zhao et al. (2012) for homogeneous networks. The definition requires that the error rate tends to zero in probability as the number of nodes goes to infinity.
Next we introduce the heterogeneous stochastic blockmodel, which serves as the framework of our theoretical development. Consider a heterogeneous network with latent community label . Write the adjacency matrix of as , , and the bi-adjacency matrix of as , . In a heterogeneous stochastic blockmodel, each is an independent Bernoulli random variable with
and each is an independent Bernoulli random variable with
where is a symmetric probability matrix specifying the connecting probabilities between different communities of type- nodes, and is a probability matrix specifying the connecting probabilities between type- nodes and type- nodes in different communities. Note that by definition, we have . Define where , .
To ensure sparsity, the entries in the probability matrices need to tend to zero as the network grows in size. Otherwise, the network is going to become unrealistically dense. Following Bickel and Chen (2009), we define the expected degree , where . We can reparameterize as , where is fixed as . This reparameterization allows us to separate from the structure of the network. See Bickel and Chen (2009) for a more detailed discussion of the reparameterization.
Consider the modularity function in (2.8). The assignment matrix and the assignment vector with , , have one to one correspondence. To simplify the notations, we write the modularity function as in this section. The consistency property of the proposed heterogeneous network community detection criterion is introduced in the following theorem.
Consider from a heterogeneous stochastic blockmodel with parameters and , , . Define
Write and . If the parameters satisfy
then the proposed modularity function is consistent as .
We refer to the online supplementary material for the proof. This result on consistency suggests that if networks are from a heterogeneous stochastic blockmodel with communities, the community labels obtained from maximizing the modularity function will approach the true community labels as the number of nodes goes to infinity.
Conditions (4.1) in Theorem 2 essentially require that, on average, edges are more likely to be established within communities than between communities, even though community structures may not exist in all different types of edges. One example is the parameters in Simulation Setting 3 (see Section 5). In that case, edges within type- or type- nodes have no community structure, but edges linking type- nodes and type- nodes have community structure .
In a homogeneous network () with , the conditions in (4.1) can be simplified to
which is satisfied if