Fast Hierarchy Construction for Dense Subgraphs
Discovering dense subgraphs and understanding the relations among them is a fundamental problem in graph mining. We want to not only identify dense subgraphs, but also build a hierarchy among them (e.g., larger but sparser subgraphs formed by two smaller dense subgraphs). Peeling algorithms (-core, -truss, and nucleus decomposition) have been effective to locate many dense subgraphs. However, constructing a hierarchical representation of density structure, even correctly computing the connected -cores and -trusses, have been mostly overlooked. Keeping track of connected components during peeling requires an additional traversal operation, which is as expensive as the peeling process. In this paper, we start with a thorough survey and point to nuances in problem formulations that lead to significant differences in runtimes. We then propose efficient and generic algorithms to construct the hierarchy of dense subgraphs for -core, -truss, or any nucleus decomposition. Our algorithms leverage the disjoint-set forest data structure to efficiently construct the hierarchy during traversal. Furthermore, we introduce a new idea to avoid traversal. We construct the subgraphs while visiting neighborhoods in the peeling process, and build the relations to previously constructed subgraphs. We also consider an existing idea to find the -core hierarchy and adapt for our objectives efficiently. Experiments on different types of large scale real-world networks show significant speedups over naive algorithms and existing alternatives. Our algorithms also outperform the hypothetical limits of any possible traversal-based solution.
Fast Hierarchy Construction for Dense Subgraphs
|Ahmet Erdem Sarıyüce|
|Sandia National Laboratories|
|Livermore, CA, USA|
|Sandia National Laboratories|
|Livermore, CA, USA|
Graphs are used to model relationships in many applications such as sociology, the WWW, cybersecurity, bioinformatics, and infrastructure. Although the real-world graphs are sparse (), vertex neighborhoods are dense [?]. Clustering coefficients [?], and transitivity [?] of real-world networks are also high and suggest the micro-scale dense structures. Literature is abundant with the benefits of dense subgraph discovery for various applications [?, ?]. Examples include finding communities in web [?, ?], and social networks [?], detecting spam groups in web [?], discovering migration patterns in stock market [?], improving software understanding by analyzing static structure of large-scale software systems [?], analyzing gene co-expression networks [?], finding DNA motifs [?], quantifying the significance of proteins [?] and discovering molecular complexes [?] in protein interaction networks, identifying real-time stories in microblogging websites [?], and improving the throughput of social-networking sites [?].
-core [?, ?], -truss [?, ?, ?, ?, ?, ?, ?], and their generic variant for larger cliques, nucleus decomposition [?], are deterministic algorithms which are effective and efficient solutions to find dense subgraphs and creating hierarchical relations among them. They also known as peeling algorithms due to their iterative nature to reach the densest parts of the graph. Hierarchy has been shown to be a central organizing principle of complex networks, which is useful to relate communities of a graph and can offer insight into many network phenomena [?]. Peeling algorithms do not aim to find a single optimum dense subgraph, but rather gives many dense subgraphs with varying sizes and densities, and hierarchy among them, if supported by a post-processing traversal step [?, ?].
We focus on undirected, unattributed graphs. Hierarchy of dense subgraphs is represented as the tree structure where each node is a subgraph, each edge shows a containment relation, and the root node is the entire graph. The aim is to efficiently find the hierarchy by using peeling algorithms.
Misconception in the literature: Recent studies on peeling algorithms has interestingly overlooked the connectivity condition of -cores and -trusses. In the original definition of -core, Seidman states that -core is the maximal and connected subgraph where any vertex has at least degree [?]. However, almost all the recent papers on -core algorithms [?, ?, ?, ?, ?, ?, ?, ?, ?, ?] did not mention that -core is a connected subgraph although they cite Seidman’s seminal work [?]. On the -truss side, the idea is introduced independently by Saito et al. [?] (as -dense), Cohen [?] (as -truss), Zhang and Parthasarathy [?] (as triangle -core), and Verma and Butenko [?] (as -community). They all define -truss as a subgraph where any edge is involved in at least triangles. Regarding the connectivity, Cohen [?], and Verma and Butenko [?] defined the -truss as a single component subgraph, while others [?, ?] ignored the connectivity. In practice, overlooking the connectedness limits the contributions of most previous work regarding the performance and semantic aspects. More details are given in Section Fast Hierarchy Construction for Dense Subgraphs.
Finding -cores requires traversal on the graph after the peeling process, where maximum -core values of vertices are found. It is same for -truss and nucleus decompositions where the traversal is done on higher order structures. Constructing the hierarchy is only possible after that. However, it is not easy to track nested structure of subgraphs during a single traversal over entire graph. Traversing -cores is cheap by a simple breadth-first search (BFS) in time. When it comes to -truss and higher order peeling algorithms, however, traversal becomes much costly due to the larger clique connectivity constraints.
Motivated by the challenging cost of traversals and hierarchy construction, we focus on efficient algorithms to find the -cores, -trusses or any nuclei in general. Our contributions are as follows:
Thorough literature review: We provide a detailed review of literature on peeling algorithms to point the misconception about -core and -truss definitions. We highlight the implications of these misunderstandings on the proposed solutions. We also stress the lack of understanding on the hierarchy construction and show that it is as expensive as the peeling process.
Hierarchy construction by disjoint-set forest: We propose to use disjoint-set forest data structure (DF) to track the disconnected substructures that appear in the same node of the hierarchy tree. Disjoint-set forest is incorporated into the hierarchy tree by selectively processing the subgraphs in a particular order. We show that our algorithm is generic, i.e., works for any peeling algorithm.
Avoiding traversal: We introduce a new idea to build the hierarchy without traversal. In the peeling process, we construct the subgraphs while visiting neighborhoods and bookkeep the relations to previously constructed subgraphs. Applying a lightweight post-processing operation to those tracked relations gives us all the hierarchy, and it works for any peeling algorithm.
Experimental evaluation: All the algorithms we proposed are implemented for -core, -truss and -nucleus decompositions, in which peeling is done on triangles and the four-clique involvements. Furthermore, we bring out an idea from Matula and Beck’s work [?], and adapt and implement it for our needs to solve the -core hierarchy problem more efficiently. Table Fast Hierarchy Construction for Dense Subgraphs gives a summary of the speedups we get for each decomposition. Our -core hierarchy algorithm adaptation outperforms naive baseline by times on uk-2005 graph. The best -truss and algorithms are significantly faster than alternatives. They also beat the hypothetically best possible algorithm (Hypo) that does traversal to find hierarchy. It is a striking result to show the benefit of our traversal avoiding idea.
This section presents building blocks for our work.
Let be an undirected and simple graph. We start by quoting the Definitions 1 and 2 from [?]. We use to denote an -clique.
Let be positive integers and be a set of s in .
is the set of s contained in some .
The number of containing is the -degree of .
Two s are -connected if there exists a sequence in such that for each , some contains .
These definitions are generalizations of the standard vertex degree and connectedness. Indeed, setting and (so is a set of edges) yields exactly that. The main definition is as follows.
Let , , and be positive integers such that . A - nucleus is a maximal union of s such that:
The -degree of any is at least .
Any are -connected.
Figure Fast Hierarchy Construction for Dense Subgraphs gives an example for 2-(2,3) and 2-(2,4) nucleus. For , a -(1,2) nucleus is a maximal (induced) connected subgraph with minimum vertex degree . This is exactly -core [?]. Setting gives maximal subgraphs where every edge participates in at least triangles, and edges are triangle-connected. This is almost the same definition of -dense [?], -truss [?], triangle-cores [?] and -community [?]. The difference is on the connectivity condition; [?, ?] defines -truss and -community as a connected component, whereas [?, ?] do not mention connectedness, implicitly allowing disconnected subgraphs. The -truss community defined by Huang et al. [?] is the same as the -(2,3) nucleus: both require any pair of edges to be triangle-connected. More details can be found in Section Fast Hierarchy Construction for Dense Subgraphs. In the rest of the paper, we will be generic and present all our findings for the - nucleus decomposition, which subsumes -core and -truss definitions.
For an r-clique , denotes the -degree of . For a subgraph , is defined as the minimum -degree of a in , i.e., .
|-clique; complete graph of vertices|
|-degree of ; number of -cliques containing|
|; min -degree of a in|
|max - nucleus associated with the|
|; max - number of the|
|; min of a in graph|
|sub- nucleus; maximal union of s of same|
The maximum k-(r,s) nucleus associated with a , denoted by , is the - nucleus that contains and has the largest (i.e., ).
The maximum - number of the r-clique , denoted by , is defined as .
Throughout the paper, implies that is a . We also abuse the notation as when and are obvious. The maximum - nucleus is same as the maximum -core, defined in [?]. For a vertex, , is also equal to the maximum -core number of [?], or core number of [?, ?]. Likewise, for an edge , is previously defined as the trussness of an edge in [?].
Building the nucleus decomposition of a graph is finding the of all s in and building the - nuclei for all . The following corollary shows that given the of all s, all nuclei of can be found.
Given for all and assuming for a , the maximum - nucleus of , denoted by , consists of as well as any that has and is reachable from via a path of s such that , where .
Corollary 1 is easy to see for the -core case, when . All the traversed vertices are in due to maximality property of -cores, and all the vertices in are traversed due to the connectivity condition, both mentioned in Definition 2. For the maximum - nucleus, we can also see the equality by Definition 2. For all edges , satisfies the first condition, and the path of triangles, which does not contain any edge whose is less than , implies the second condition of Definition 2.
can be found by traversing starting at the and including each to if
Repeating this traversal for all s gives all the - nuclei of .
Traversal is trivial for - nucleus (-core): include every vertex with greater or equal . For , maximum - nucleus is found by doing traversal on edges. Assuming the value of the initial edge is , the next edge in the traversal should have the value ; should be in the same triangle; and all the edges of this triangle should have s greater-than or equal to . Similar for ; traversal is done on triangles, and neighborhood condition is on the containment of triangles in four-cliques.
In summary, nucleus decomposition problem has two phases: (1) peeling process which finds the values of s, (2) traversal on the graph to find all the - nuclei. For case, the algorithm for finding of vertices is based on the following property, as stated in [?]: to find the vertices with the of , all vertices of degree less than and their adjacent edges are recursively deleted. For first phase, we provide the generic peeling algorithm in Alg. 1, which has been introduced in our earlier work [?], and for the second phase, we give the generic traversal algorithm in Alg. 2, which is basically the implementation of Corollary 2. The final algorithm, outlined in Alg. 3 combines the two.
Lastly, we define sub- nucleus and strong -connected ness to find the s with same values. We will use them to efficiently locate all the - nuclei of given graph.
Two s with are strongly -connected if there exists a sequence such that:
sub-(r,s) nucleus, denoted by , is a maximal union of s s.t. ,
and are strongly -connected
The sub- nucleus is defined as the subcore in [?, ?]. All the notations are given in Table Fast Hierarchy Construction for Dense Subgraphs.
Disjoint-set data structure, also known as union-find, keeps disjoint dynamic sets, and maintains upon the operations that modifies the sets [?]. Each set has a representative. There are two main operations: Union () merges the dynamic sets with ids and , and creates a new set, or just merge one of the sets into the other. Find () returns the representative of the set which contains .
Disjoint-set forest is introduced with two powerful heuristics [?]. In the disjoint-set forest, each set is a tree, each node in the tree is an element of the set, and the root of each tree is the identifier of that set. To keep the trees flat, two heuristics are used that complement each other. First is union-by-rank, which merges the shorter tree under the longer one. Second heuristic is path-compression that makes each node on the find path point directly to the root. Time complexity with union-by-rank and path-compression heuristics is , where is the inverse Ackermann function which is almost linear [?]. Pseudocode for Find and Union operations are given in Alg. 4.
In this section, we present a detailed review of related work on peeling algorithms. We point some misconceptions about the definitions and the consequences. Our focus is on peeling algorithms and their output, so we limit our scope to -core and -truss decompositions and their generalizations. Detailed literature review of dense subgraph discovery can be found in [?, ?].
The very first definition of a -core related concept is given by Erdős and Hajnal [?] in 1966. They defined the degeneracy as the largest maximum core number of a vertex in the graph. Matula introduced the min-max theorem [?] for the same thing, highlighting the relationship between degree orderings of the graph and the minimum degree of any subgraph, and its applications to graph coloring problem. Degeneracy number has been rediscovered numerous times in the context of graph orientations and is alternately called the coloring number [?], and linkage [?].
First definition of the -core subgraph is given by Seidman [?] for social networks analysis, and also by Matula and Beck [?], as -linkage, for clustering and graph coloring applications, in the same year of 1983. Seidman [?] introduced the core collapse sequence, also known as degeneracy ordering of vertices, as an important graph feature. He states that -cores are good seedbeds that can be used to find further dense substructures. Though, there is no algorithm in [?] on how to find the -cores. Matula and Beck [?], on the other hand, gives algorithms for finding values of vertices, and also finding all the -cores of a graph (and their hierarchy) by using these values, because there can be multiple -cores for same value. Both papers defined the -core subgraph as follows:
“A connected and maximal subgraph is -core (-linkage) if every vertex in has at least degree .” [?, ?]
The connectedness is an important detail in this definition because it requires a post-processing traversal operation on vertices to locate all the -cores of the graph. Figure Fast Hierarchy Construction for Dense Subgraphs shows this. There are two 3-cores in the graph, and there is no way to distinguish them at the end of the peeling process by just looking at the values of vertices.
Batagelj and Zaversnik introduced an efficient implementation that uses bucket data structure to find the values of vertices [?]. They defined the -core as a not necessarily connected subgraph, in contrast to previous work they cited [?, ?]. With this assumption, they claimed that their implementation finds all the -cores of the graph.
Finding the relationships between -cores of a graph has gained a lot of interest. Nested structure of -cores reveals a hierarchy, and it has been shown to be useful for visualization [?] and understanding the underlying structure of complex networks arising in many domains. Carmi et al. [?] and Alvarez-Hamelin et al. [?] investigated the -core hierarchy of internet topology at autonomous systems (AS) level. Healy et al. [?] compared the -core hierarchies of real-world graphs in different domains and some generative models.
Given the practical benefit and efficiency of -core decomposition, there has been a lot of recent work to adapt -core algorithms for different data types or setups. Out of memory computation is an important topic for many graph analytic problems that deal with massive graphs not fitting in memory. Cheng et al. [?] introduced the first external-memory algorithm. Wen et al. [?] and Khaouid et al. [?] provided further improvements in this direction. Regarding the different type of graphs, Giatsidis et al. adapted the -core decomposition for weighted [?] and directed [?] graphs. To handle the dynamic nature of real-world data, Sariyuce et al. [?] introduced the first streaming algorithms to maintain -core decomposition of graphs upon edge insertions and removals. They recently improved these algorithms further by leveraging the information beyond 2-hop [?]. Li et al. [?] also proposed incremental algorithms for the same problem. More recently, Wu et al. [?] approached dynamic data from a different angle, and adapted -cores for temporal graphs where possibly multiple interactions between entities occur at different times. Motivated by the incomplete and uncertain nature of the real network data, O’Brien and Sullivan [?] proposed new methods to locally estimate core numbers ( values) of vertices when entire graph is not known, and Bonchi et al. [?] showed how to efficiently do the -core decomposition on uncertain graphs, which has existence probabilities on the edges.
One common oversight in all those recent work (except [?, ?]) is that they ignore the connectivity of -cores. This oversight does not change their results, but limit their contributions: they adapt/improve the peeling part of -core decomposition, which finds the s of vertices, not the entire -core decomposition which also needs traversal to locate all the (connected) -cores. Considering the external memory -core decomposition algorithms [?, ?, ?], existing works only focused on how to compute the values of vertices. However, the additional traversal operation in external memory is not taken into consideration which is at least as expensive as finding values. Finding the (connected) -cores and constructing the hierarchy among them efficiently in the external memory computation model is not a trivial problem and will limit the performance of proposed algorithms for finding -core subgraphs and constructing the hierarchy. Similar argument can be considered for weighted [?], probabilistic [?], and temporal [?] -core decompositions, all of which have some kind of threshold-based adaptations on weights, probabilities and timestamps, respectively. On the other hand, connectedness definition is semantically unclear for some existing works like the directed graph core decomposition [?]. It is only defined that in- and out-degrees of vertices can be considered to find two values, but traversal semantic is not defined for finding subgraphs or constructing the hierarchy. One can think about building the hierarchy by considering the edges from lower level -cores to higher level ones, or the opposite. To remedy those misconceptions, we focus on the efficient computation of traversal part for -core decomposition and its higher-order variants.
-truss decomposition is inspired by the -core and can be thought as the same peeling problem in a higher level that deals with triangles. It is independently introduced, with subtle differences, by several researchers. Chronologically, the idea is first proposed by Saito et al. [?], to the best of our knowledge, in 2006:
“k-dense is a subgraph if each adjacent vertex pair in has more than or equal to (-2) common adjacent vertices in .”
In other words, each edge in should be involved in at least -2 triangles. Nothing is mentioned about the connectedness of the vertices and edges, which implies that a -dense subgraph might have multiple components. Saito et al. argue that -dense is a good compromise between easy to compute -cores and high quality -cliques, and it is useful to detect communities in social networks. In 2008, Jonathan Cohen introduced the -truss as a better model for cohesive subgraphs in social networks [?], which became the most popular naming in the literature:
“k-truss is a one-component subgraph such that each edge is reinforced by at least -2 pairs of edges making a triangle with that edge.”
In 2012, Zhang and Parthasarathy [?] proposed a new definiton for visualization purposes:
“triangle k-core is a subgraph that each edge is contained within at least triangles in the subgraph.”
Again there was no reference to the connectedness, implying multiple components can be observed in a triangle -core. In the same year, Verma and Butenko [?] introduced the following:
“k-community is a connected subgraph if every edge is involved in at least triangles.”
The subtle difference between those papers is the connectedness issue. -dense [?] and triangle -core [?] definitions allow the subgraph to be disconnected whereas the -truss [?] and -community [?] are defined to be connected. All of these works only provided algorithms to find the values of edges. -dense and triangle -cores can be found this way since they can be disconnected. However, finding the -truss and -community subgraphs requires a post-processing traversal operation, which increases the cost. As a stronger alternative to the -truss, Huang et al. [?] introduced the -truss community. The only difference is that each edge pair in a -truss community is directly or transitively triangle-connected, where two edges should reside in the same triangle to be triangle-connected. The generic - nucleus, proposed by Sariyuce et al. [?], for gives the exact same definition. This brings a stronger condition on the connectivity structure, and shown to result in denser subgraphs than the classical -truss definition [?]. However, it has an extra overhead of post-processing traversal operation that requires to visit triangles, which is more expensive than the traditional traversal. Authors devised TCP index, a tree structure at each vertex, to remedy this issue [?]. Figure Fast Hierarchy Construction for Dense Subgraphs highlights the difference between those definitions on a simple example.
-truss decomposition serves as a better alternative to the -core. For most applications that -core is useful for, -truss decomposition performs better. Gregori et al. [?] investigated the structure of internet AS-level topologies by looking at the -dense subgraphs, similar to Carmi et al. [?] and Alvarez-Hamelin et al. [?] who used -core for same purpose. Orsini et al. [?] also investigated the evolution of -dense subgraphs in AS-level topologies. It has been also used to understand the global organization of clusters in complex networks [?]. Colomer-de-Simon et al. used the hierarchy of -dense subgraphs to visualize real-world networks, as Healy et al. [?] used the -cores for the same objective.
Proven strength of -truss decomposition drew further interest for adapting to different data types and setups, similar to the -core literature. Wang and Cheng introduced external memory algorithms [?] and more improvements are provided by Zhao and Tung [?] for visualization purposes. More recently, Huang et al. [?] introduced probabilistic truss decomposition for uncertain graphs.
Similar to the -core case, overlooking the connectivity constraints limits the contributions in the -truss literature as well. For example, external memory -truss decomposition [?] would be more expensive and require more intricate algorithms if it is done to find connected subgraphs by doing the traversal in external memory model. We believe that our algorithms for efficiently finding the -trusses and constructing the hierarchy will be helpful to deal with this issue.
Given the similarity between -core and -truss decompositions, people have been interested in unified schemes to generalize the peeling process for a broader set of graph substructures.
Saito et al. pointed a possible direction of generalization in their pioneering work [?], where they defined -dense subgraphs. Their proposal is as follows:
“Subgraph is a h-level k-dense community if the vertices in every -clique of is adjacent to at least - common vertices.” [?]
In other words, -level -dense community is the set of -cliques where each -clique is contained in at least number of -cliques. Note that, there is no connectivity constraint in the definition. -level -dense community subsumes the disconnected -core, which contains multiple -cores, for . For , it is their -dense definition [?]. They claimed that -level -dense communities for are more or less same with and incurs higher computation cost. So they did not dive into more algorithmic details and stick with .
Sariyuce et al. [?] introduced a broader definition to unify the existing proposals, which can be found by a generic peeling algorithm. As explained in Section Fast Hierarchy Construction for Dense Subgraphs, their definition subsumes -core and -truss community [?] concepts. It is also more generic than -level -dense community of [?], since (1) it allows to look for involvement of cliques whose size can differ by more than one, (2) enforces a stronger connectivity constraint to get denser subgraphs. -level -dense community can be expressed as the - nucleus which does not have any connectivity constraint ( is actually and it does not matter). Well-defined theoretical notion of - nucleus enables to provide a unified algorithm to find all the nuclei in graph, as explained in Section Fast Hierarchy Construction for Dense Subgraphs.
Sariyuce et al. [?] also analyzed the time and space complexity of -nucleus decomposition. For the first phase, they report that finding values of nuclei (Alg. 1) requires time with space, where is enumeration time, and second part is searching each that a is involved in ( is the number of s containing vertex , is the degree of , and is the number of s in ). For the second phase, traversal on the entire graph needs to access each and examine all the s it is involved. Its time complexity is the same as the second part of first phase: which also gives the total time complexity.
In this part, we first highlight the challenging aspects of the traversal phase, then introduce two algorithms for faster computation of -nucleus decomposition to meet those challenges.
As mentioned in the previous section, time complexity of the traversal algorithm for nuclei is . However, designing an algorithm that constructs the hierarchy with this complexity is challenging. In [?], it is stated that finding the nuclei in the reverse order of is better since it enables to discover previously found components, thus avoiding repetition. No further details are given, though. This actually corresponds to finding all (sub- nuclei of Definition 5), connected s with the same value, and incorporating the relations among them. But, keeping track of all the in a structured way is hard. Figure Fast Hierarchy Construction for Dense Subgraphs shows a case for -core (). Traversal algorithm needs to understand the relation between s of equal that are not directly connected. For instance, A and E are in the same 2-core, but the traditional BFS will find 3 other (F, D, G) between those two. During the traversal operation, there is a need to detect each - nucleus, determine containment relations and construct the hierarchy. One solution that can be thought is to construct the - expectedly smaller - supergraph which takes all the as vertices and their connections as edges. Then, repetitive traversals can be performed on this supergraph to find each - nucleus and the hierarchy. However, it is not guaranteed to get a significantly smaller supergraph which can be leveraged for repetitive traversal. The structure of real-world networks, which are investigated in Section Fast Hierarchy Construction for Dense Subgraphs, also verify this concern. It is clear that there is a need for a lightweight algorithm/data structure that can be used on-the-fly, so that all the - nuclei can be discovered with the hierarchy during the traversal algorithm.
The other challenge with the traversal algorithm is the high computational cost for cases. Consider the case. We need to traverse on edges, and determine the adjacencies of edges by looking at their common triangles. At each step, it requires to find the common neighbors of the vertices on extremities (of the edge), check whether each adjacent edge is previously visited, and push to queue if not. As explained at the end of Section Fast Hierarchy Construction for Dense Subgraphs, complexity becomes . Cost is getting much higher if we look for nuclei, which is shown to give denser subgraphs with more detailed hierarchy. Ideally we are looking to completely avoid the costly traversal operation.
We propose to use disjoint-set forest data structure (DF) to track the disjoint (of equal ), and construct the hierarchy where with smaller is on the upper side, and greater is on the lower side. DF has been used to track connected components of a graph and fits perfectly to our problem where we need to find the connected components at multiple levels.
DF-Traversal algorithm, outlined in Alg. 5, is used to replace the naive Traversal (Alg. 2) in NucleusDecomposition (Alg. 3). Basically it finds all the in the decreasing order of . We construct the hierarchy-skeleton tree by using s. Each node in the hierarchy-skeleton is a . We define subnucleus struct to represent a . It consists of , rank, parent and root fields. field is the for , rank is the height of the node in the hierarchy-skeleton, parent is a pointer to the parent node and root is a pointer to the root node of the hierarchy-skeleton. Default values for parent and root are null, and rank is 0. Figure Fast Hierarchy Construction for Dense Subgraphs shows an example hierarchy-skeleton obtained by Alg. 5. Thin edges show the disjoint-set forests consisting of s of equal value. The hierarchy of all nuclei, the output we are interested, can be obtained by using the hierarchy-skeleton easily: we just take the child-parent links for which the values are different.
In the DF-Traversal algorithm, we keep all the subnuclei
in hrc list (line 5) which also represents the hierarchy-skeleton. s in a are stored by inverse-indices; comp keeps
subnucleus index of each in hrc (line 5). We also use visited to keep track of traversed s (line 5). Main idea is to find each in decreasing order of (in lines 5-5). We construct the hierarchy-skeleton in a bottom-up manner this way and it lets us to use DF to find the representative , i.e., the greatest ancestor, at any time. At each iteration we find an un-visited with the value in order (line 5) and find its by SubNucleus algorithm (line 5), which also updates the hierarchy-skeleton.
SubNucleus (Alg. 6) starts by creating a subnucleus, with of the of interest. We will store the discovered s in this subnucleus (by inverse indices). We put this subnucleus into hrc (line 6) and assign its comp id (line 6). We use marked (line 6) to mark the adjacent subnuclei encountered during traversal so that unnecessary computation is avoided in lines 6-6. We do traversal (lines 6-6) by using a queue. At each step of the traversal, we process the next in the queue. First, we assign its comp id as the new subnucleus (line 6) and then visit the adjacent s residing in same s, in which min of is equal to (line 6). This is exactly the condition for , given in Definition 5. For each adjacent , its is either equal (line 6) or greater (line 6) by definition of . If it is equal and not visited before, we visit it, put into queue (line 6) and also store in the current subnucleus (line 6). Otherwise, we find an adjacent subnucleus with greater (line 6), that is already in hierarchy-skeleton, and can update the hierarchy-skeleton (lines 6-6) unless we had encountered before (line 6).
Location of the subnucleus in the hierarchy-skeleton is important. If it is parentless, we can just make it a child of the current subnucleus we build. If not, it means subnucleus is a part of a larger structure and we should relate our current subnucleus to the representative of this large structure, which is the greatest ancestor of that is guaranteed to have greater or equal (by line 6). So, hierarchy-skeleton update starts by finding the greatest ancestor of the in line 6. Find-r procedure is defined in Alg. 7. Its difference from Find of Alg. 4 is that we use root field, not parent. root of a node implies its greatest ancestor in the hierarchy-skeleton, i.e., either it is the greatest ancestor or a few steps of root iterations would find the greatest ancestor. parent of a node, on the other hand, represents the links in hierarchy-skeleton, and not modified in Find-r. After finding the root and making sure that it is not processed before (line 6), we can merge the current subnucleus to the hierarchy-skeleton. If the root has greater , we make it a child of our current subnucleus (line 6), by assigning both root and parent fields. Otherwise, we defer merging to the end (line 6), where we merge all subnuclei with equal by Union-r operations (lines 6-6), defined in Alg. 7. Union-r is slightly different than Union of Alg. 4 in that it uses Find-r instead of Find and sets the root field of child node to the parent (in Link-r).
Figure Fast Hierarchy Construction for Dense Subgraphs displays the resulting hierarchy-skeleton for the regions shown on the left. We process in alphabetical order, which also conforms with decreasing order of . Consider the O, which is found and processed last. O finds the adjacent s I, J and K, during lines 6-6 of Alg. 6. All have greater values, so we focus on lines 6-6. Greatest ancestor of I is G, and we make G child of O (line 6) since its is greater. Greatest ancestors of J and K are L and N, respectively, and they have equal values. So, we merge L and N with O in lines 6-6. Say we merge O and N first and O becomes parent of N since its rank is higher. Then, we merge O and L. Their ranks are equal and we arbitrarily choose L as the parent.
After traversing all the , we create a root subnucleus to represent entire graph and make it parent to all the parentless nodes (lines 5-5 in Alg. 5). Time complexity of DF-Traversal does not change, i.e., . Additional space is required by the auxiliary data structures in DF-Traversal. hrc needs (for four fields), and comp and visited requires space each. In addition, SubNucleus might require at most for marked and merge, and at most for Q, but reaching those upper bounds in practice is quite unlikely. Overall, additional space requirement of DF-Traversal at any instant is between and . An upper bound for can be given as , when each is assumed to a subnucleus, but this case is also quite unlikely as we show in Section Fast Hierarchy Construction for Dense Subgraphs.
All the can be detected without doing a traversal. A is said to be processed, if its is assigned. During the peeling process, neighborhood of each is examined, see lines 1-1 of Alg. 1, but processed neighbors are ignored (line 1). We leverage those ignored neighbors to construct the . We introduce FastNucleusDecomposition algorithm (Alg. 8) to detect early in the peeling process so that costly traversal operation is not needed anymore.
At each iteration of the peeling process, a with the minimum is selected and of its unprocessed neighbors are decremented. No information about the surrounding processed s is used. If we check the processed neighbors, we can infer some connectivity information and use it towards constructing all the as well as the hierarchy-skeleton. For example, assume we are doing -core decomposition and a vertex with degree is selected. We assign and check the unprocessed neighbors of to decrement their degree, if greater than . We can also examine the processed neighbors. of any processed neighbor is guaranteed to be less than or equal to , by definition. Say is a neighbor with . Then, we can say that and are in the same . Say is another neighbor with . Then, we can infer that maximum - nucleus of contains , and of is an ancestor of of in the hierarchy-skeleton. Leveraging these pairwise relations enables us to find all the and construct the hierarchy-skeleton.
An important thing to note is that, it is not always possible to detect the of a by only looking at the processed neighbors. Consider -core decomposition on a star graph, for which all vertices has . Center vertex is processed in the last two steps of peeling, so it is not possible to infer two connected vertices with equal until that time. We find non-maximal s (denoted as ) and combine them by using the disjoint-set forest algorithm. The difference from the DF-Traversal algorithm is that our hierarchy-skeleton will have more nodes because of non-maximal .
Colored lines 8-8 in Alg. 8 implements our ideas. For each we encountered (line 8), processed neighbors are explored starting from line 8. Note that, there is no need to check every adjacent and processed in the same , since the relations among them are already checked in previous steps. It is enough to find and process the with minimum , as in line 8. If has an equal value (line 8), we need to either put our of interest to the subnucleus of (line 8) or merge to the subnucleus of by Union-r operation (line 8). In FastNucleusDecomposition algorithm, we only build disjoint-set forests during the peeling process (until line 8). If happens to have a smaller value (line 8), we put the pair of subnuclei to a list (ADJ), which will be used to build the hierarchy-skeleton after the peeling. We do not process the relations between subnuclei of different right away for two reasons: (1) subnucleus of the of interest might not be assigned yet (comp() is -1), (2) order of processing subnuclei relations is crucial to build the hierarchy-skeleton correctly and efficiently. Regarding (1), we take care of the s not belonging to a subnucleus in lines 8 and 8. For (2), we have the BuildHierarchy function (line 8), defined in Alg. 9.
In BuildHierarchy, we create number of bins to distribute the subnuclei pairs based on the smaller of the subnucleus pair. The reason is same with our reverse order discovery of subnuclei in DF-Traversal (Alg. 5): we construct the hierarchy-skeleton in a bottom-up manner and it enables us to use disjoint-set forest algorithm to locate the - nuclei that we need process. Distribution is done in lines 9-9. Then, we just process the binned list (binned_ADJ) in reverse order of values (line 9). We do the same operations to build the hierarchy-skeleton: lines 9-9 of BuildHierarchy and lines 6-6 of SubNucleus algorithm (Alg. 6) are almost same. Once we finish each list in binned_ADJ, we union the accumulated subnuclei of equal values (lines 9-9), as we did in lines 6-6 of SubNucleus algorithm. Finally, in FastNucleusDecomposition we create a subnucleus to represent entire graph, make it parent to all parentless subnuclei, and report the hierarchy.
Avoiding traversal does not change the time complexity of overall algorithm, since the peeling part was already taking more time. Auxiliary data structures in FastNucleusDecomposition requires additional space, though. hrc needs in which subnuclei are not necessarily maximal, and comp needs . ADJ structure corresponds to the connections from s with higher values to the ones with lower values, which we denote as . The upper bound for is , when each is assumed to be a and their values are adversary (see the end of Section Fast Hierarchy Construction for Dense Subgraphs for details). However, it is quite unlikely as we show in Section Fast Hierarchy Construction for Dense Subgraphs. binned_ADJ in BuildHierarchy is just an ordered version of ADJ, and needs the same amount of space; . Lastly, merge in BuildHierarchy might require another at most, but it is quite unlikely. Overall, additional space requirement of FastNucleusDecomposition at any instant is and additional might be needed. Section Fast Hierarchy Construction for Dense Subgraphs gives more details on structure of real-world networks and their impact to the memory cost.
We evaluated our algorithms on different types of real-world networks, obtained from SNAP [?], Network Repository [?] and UF Sparse Matrix Collection [?]. Our dataset includes an internet topology network (skitter), facebook friendship networks of some universities (Berkeley13, MIT, Stanford3, Texas84) [?], follower network of Twitter users tweeted about Higgs boson-like particle discovery (twitter-hb), web networks (Google, uk-2005) and network of wikipedia pages (wiki-0611). We ignore the directions for directed graphs. Important statistics of the networks are given in Table Fast Hierarchy Construction for Dense Subgraphs. / ratio gives an estimate for the - nucleus decomposition runtime, as explained at the end of Section Fast Hierarchy Construction for Dense Subgraphs, and we put them in columns 6-8 to show the challenging and diverse characteristics of the networks in our dataset. Note that most of the networks have relatively high edge density in the realm of real-world networks and it makes the computation more expensive. We also included graphs with various / ratios to diversify our dataset. Last eight columns are shown to explain the runtime and memory costs of our algorithms. All the algorithms are implemented in C++ and compiled using gcc 5.2.0 at -O2 optimization level. All experiments are done on a Linux operating system running on a machine with Intel Xeon Haswell E5-2698 2.30 GHz processor with 128 GB of RAM.
Nucleus decomposition has been shown to give denser subgraphs and more detailed hierarchies for =1 cases, for fixed [?]. We implemented and tested our algorithms for =1 cases where : giving us and nucleus decompositions. -nucleus decomposition is same as the -core decomposition [?] and corresponds to -truss community finding [?] (stronger definition of -truss decomposition [?]). We consider the - nucleus as a set of s. In our algorithms, we find the - nuclei for all values and determine the hierarchy tree among those nuclei. We report the total time of peeling and traversal (or post-processing) that takes the graph as input and gives all the nuclei with an hierarchy.
|speedups with respect to||time (s)|