A Tool Suite for Large-scale Complex Network Analysis
We introduce NetworKit, an open-source software package for analyzing the structure of large complex networks. Appropriate algorithmic solutions are required to handle increasingly common large graph data sets containing up to billions of connections. We describe the methodology applied to develop scalable solutions to network analysis problems, including techniques like parallelization, heuristics for computationally expensive problems, efficient data structures, and modular software architecture. Our goal for the software is to package results of our algorithm engineering efforts and put them into the hands of domain experts. NetworKit is implemented as a hybrid combining the kernels written in C++ with a Python front end, enabling integration into the Python ecosystem of tested tools for data analysis and scientific computing. The package provides a wide range of functionality (including common and novel analytics algorithms and graph generators) and does so via a convenient interface. In an experimental comparison with related software, NetworKit shows the best performance on a range of typical analysis tasks.
Keywords: complex networks, network analysis, network science, parallel graph algorithms, data analysis software
A great variety of phenomena and systems have been successfully modeled as complex networks [Costa et al., 2011, Boccaletti et al., 2006]. Accordingly, network analysis methods are quickly becoming pervasive in science, technology and society. On a closer look, the rallying cry of the emerging field of network science (”networks are everywhere”) is hardly surprising: What is being developed is a set of general methods for the statistics of relational data. Since promising large network data sets are increasingly common in the age of big data, it is an active current research project to develop scalable methods for the analysis of large networks. In order to process massive graphs, we need algorithms whose running time is essentially linear in the number of edges. Many analysis methods have been pioneered on small networks (e. g. for the study of social networks prior to the arrival of massive online social networking services), so that underlying algorithms with higher complexity were viable. As we shall see in the following, developing a scalable analysis tool suite often entails replacing them with suitable linear- or nearly-linear-time variants. Furthermore, solutions should employ parallel processing: While sequential performance is stalling, multicore machines become pervasive, and algorithms and software need to follow this development. Within the NetworKit project, scalable network analysis methods are developed, tested and packaged as ready-to-use software. In this process we frequently apply the following algorithm and software engineering patterns: parallelization; heuristics or approximation algorithms for computationally intensive problems; efficient data structures; and modular software architecture. With NetworKit, we intend to push the boundaries of what can be done interactively on a shared-memory parallel computer, also by users without in-depth programming skills. The tools we provide make it easy to characterize large networks and are geared towards network science research.
In this work we give an introduction to the tool suite and describe the methodology applied during development in terms of algorithm and software engineering aspects. We discuss methods to arrive at highly scalable solutions to common network analysis problems (Sections 2 and 3), describe the set of functionality (Sections 4 and 5), present example use cases (Section 6), compare with related software (Section 7), and evaluate the performance of analysis kernels experimentally (Section 8). Our experiments show that NetworKit is capable of quickly processing large-scale networks for a variety of analytics kernels, and does so faster and with a lower memory footprint than closely related software. We recommend NetworKit for the comprehensive structural analysis of massive complex networks (their size is primarily limited by the available memory). To this end, a new frontend supports exploratory data analysis with fast graphical reports on structural features of the network (Section 6.2).
2.1 Design Goals.
There is a variety of software packages which provide graph algorithms in general and network analysis capabilities in particular (see Section 7 for a comparison to related packages). However, NetworKit aims to balance a specific combination of strengths. Our software is designed to stand out with respect to the following areas:
Performance. Algorithms and data structures are selected and implemented with high performance and parallelism in mind. Some implementations are among the fastest in published research. For example, community detection in a billion edge web graph can be performed on a 16-core server with hyperthreading in less than three minutes [Staudt and Meyerhenke, 2015].
Usability and Itegration. Networks are as diverse as the series of questions we might ask of them – e. g., what is the largest connected component, what are the most central nodes in it and how do they connect to each other? A practical tool for network analysis should therefore provide modular functions which do not restrict the user to predefined workflows. An interactive shell, which the Python language provides, is one prerequisite for that. While NetworKit works with the standard Python 3 interpreter, calling the module from the IPython shell and Jupyter Notebook HTML interface [Perez et al., 2013] allows us to integrate it into a fully fledged computing environment for scientific workflows, from data preparation to creating figures. It is also easy to set up and control a remote compute server. As a Python module, NetworKit enables seamless integration with Python libraries for scientific computing and data analysis, e. g. pandas for data frame processing and analytics, matplotlib for plotting or numpy and scipy for numerical and scientific computing. For certain tasks, we provide interfaces to specialized external tools, e. g. Gephi [Bastian et al., 2009] for graph visualization.
In order to achieve the design goals described above, we implement NetworKit as a two-layer hybrid of performance-aware code written in C++ with an interface and additional functionality written in Python. NetworKit is distributed as a Python package, ready to be used interactively from a Python shell, which is the main usage scenario we envision for domain scientists. The code can be used as a library for application programming as well, either at the Python or C++ level. Throughout the project we use object-oriented and functional concepts. Shared-memory parallelism is realized with OpenMP, providing loop parallelization and synchronization constructs while abstracting away the details of thread creation and handling. The roughly 45 000 lines of C++ code include core implementations and unit tests. As illustrated in Figure 1, connecting these native implementations to the Python world is enabled by the Cython toolchain [Behnel et al., 2011]. Currently we use Cython to integrate native code by compiling it into a Python extension module. The Python layer comprises about 4 000 lines of code. The resulting Python module networkit is organized into several submodules for different areas of functionality, such as community detection or node centrality. A submodule may bundle and expose C++ classes or exist entirely on the Python layer.
2.3 Framework Foundations.
As the central data structure, the Graph class implements a directed or undirected, optionally weighted graph using an adjacency array data structure with memory requirement for a graph with nodes and edges. Nodes are represented by 64 bit integer indices from a consecutive range, and an edge is identified by a pair of nodes. Optionally, edges can be indexed as well. This approach enables a lean graph data structure, while also allowing arbitrary node and edge attributes to be stored in any container addressable by indices. While some algorithms may benefit from different data layouts, this lean, general-purpose representation has proven suitable for writing performant implementations. In particular, it supports dynamic modifications to the graph in a flexible manner, unlike the compressed sparse row format common in high-performance scientific computing. Our graph API aims for an intuitive and concise formulation of graph algorithms on both the C++ and Python layer (see Fig. 3 for an example). In general, a combinatorial view on graphs – representing edges as tuples of nodes – is used. However, part of NetworKit is an algebraic interface that enables the implementation of graph algorithms in terms of various matrices describing the graph, while transparently using the same graph data structure.
3 Algorithm and Implementation Patterns
As explained in Section 1, our main focus are scalable algorithms in order to support network analysis on massive networks. We identify several algorithm and implementation patterns that help to achieve this goal and present them below by means of case studies. For experimental results we express processing speed in ”edges per second”, an intuitive way to aggregate real running time over a set of graphs and normalize by graph size.
Our first case study concerns the core decomposition of a graph, which allows a fine-grained subdivision of the node set according to connectedness. More formally, the -core is the maximal induced subgraph whose nodes have at least degree . The decomposition also categorizes nodes according to the highest-order core in which they are contained, assigning a core number to each node (the largest for which the node belongs to the -core). The sequential kernel implemented in NetworKit runs in time, matching other implementations [Batagelj and Zaveršnik, 2011]. The main algorithmic idea we reuse for computing the core numbers is to start with and increase iteratively. Within each iteration phase, all nodes with degree are successively removed (thus, also nodes whose degree was larger at the beginning of the phase can become affected by a removal of a neighbor). Our implementation uses a bucket priority queue. From this data structure we can extract the nodes with a certain minimum residual degree in amortized constant time. The same time holds for updates of the neighbor degrees, resulting in in total.
While the above implementation already scales to large inputs, it can still make a significant difference if a user needs to wait minutes or seconds for an answer. Thus, we also provide a parallel implementation. The sequential algorithm cannot be made parallel easily due to its sequential access to the bucket priority queue. For achieving a higher degree of parallelism, we follow [Dasari et al., 2014]. Their ParK algorithm replaces the extract-min operation in the above algorithm by identifying the node set with nodes of minimum residual degree while iterating in parallel over all (active) nodes. is then further processed similarly to the node retrieved by extract-min in the above algorithm, only in parallel again. ParK thus performs more sequential work, but with thread-local buffers it relies on a minimal amount of synchronization. Moreover, its data access pattern is more cache-friendly, which additionally contributes to better performance.
Fig. 2 is the result of running time measurements on a test set of networks (see Sec. 8 for the setup). We see that on average, processing speed is increased by almost an order of magnitude through parallelization. Some overhead of the parallel algorithm implies that speedup is only noticeable on large graphs, hence the large variance. For example, processing time for the 260 million edge uk-2002 web graph is reduced from 22 to 2 seconds.
3.2 Heuristics and Approximation Algorithms
In this example we illustrate how inexact methods deliver appropriate solutions for an otherwise computationally impractical problem. Betweenness centrality is a well-known node centrality measure that has an intuitive interpretation in transport networks: Assuming that the transport processes taking place in the network are efficient, they follow shortest paths through the network, and therefore preferably pass through nodes with high betweenness. For instance, their removal would interfere strongly with the function of the network. It is clear that network analysts would like to be able to identify such nodes in networks of any size. NetworKit comes with an implementation of the currently fastest known algorithm for betweenness [Brandes, 2001], which has running time in unweighted graphs.
With a closer look at the algorithm, opportunities for parallelization are apparent: Several single-source shortest path searches can be run in parallel to compute the intermediate dependency values whose sum yields a node’s betweenness. Figure 3 shows C++ code for the parallel version, which is simplified to focus on the core algorithm, but the actual implementation is similarly concise. To avoid race conditions, each thread works on its own dependency array, which need to be aggregated into one betweenness array in the end (lines 35-39).
We now evaluate the performance of the implementations experimentally (see Section 8 for settings). Figure 4 shows aggregated running speed over a set of smaller networks (from Table 3). In practice, this means that the sequential version of Brandes’ algorithm (BetweennessSeq) takes almost 8 hours to process the 600k edge graph caidaRouterLevel (representing internet router-level topology [CAIDA, 2003]). Parallelism with 32 (hyper)threads (Betweenness) reduces the running time to ca. 90 minutes. Still, parallelization does not change the algorithm’s inherent complexity. This means that running times rise so steeply with the size of the input graph that computing an exact solution to betweenness is not viable on the large networks we want to target. In typical use cases, obtaining exact values for betweenness is not necessary, though. An approximate result is likely good enough to appreciate the structure of the network for exploratory analysis, and to identify a set of top betweenness nodes. Therefore, we use a heuristic approach based on computing a relatively small number of randomly chosen shortest-path trees [Geisberger et al., 2008]. In contrast to the exact algorithm, running the approximative algorithm with 42 samples takes 6 seconds sequentially. Applying this algorithm cuts running time by orders of magnitude, but still yields a ranking of nodes that is highly similar to a ranking by exact betweenness values. We observe that the distribution of relative rank errors (exact rank divided by approximated rank) has little variance around 1.0. Nodes on average maintain the rank they would have according to exact betweenness even with such a small number of samples. Concretely we see, for instance, that the top ten nodes in the exact betweenness ranking are and in the approximate ranking. Experiments of this type (see [Geisberger et al., 2008]) confirm that in typical cases betweenness can be closely approximated with a relatively small number of shortest-path searches. Therefore we can replace an algorithm with one of time complexity in many use cases. The inexact algorithm offers the same opportunities for parallelization, yielding additional speedups: In the example above, parallel running time is down to 1.5 seconds on 32 (hyper)threads.
If a true approximation with a guaranteed error bound is desired, NetworKit users can apply another inexact algorithm [Riondato and Kornaropoulos, 2015] which accepts an error bound parameter . It sacrifices some computational efficiency but allows a proof that the resulting betweenness scores have at most difference from the exact scores (with a user-supplied probability).
3.3 Efficient Data Structures
The case study on data structures deals with a generative network model. Such models are important as they simplify complex network research in several respects (see Section 5). Random hyperbolic graphs (RHGs) [Krioukov et al., 2010] are very promising in this context, since theoretical analyses have shown that RHGs have many features also found in real complex networks [Bode et al., 2014, Gugelmann et al., 2012, Kiwi and Mitsche, 2015]. The model is based on hyperbolic geometry, into which complex networks can be embedded naturally. During the generation process vertices are distributed randomly on a hyperbolic disk of radius and edges are inserted for every vertex pair whose distance is below . The straightforward RHG generation process would probe the distance of all pairs, yielding a quadratic time complexity. This impedes the creation of massive networks. NetworKit provides the first generation algorithm for RHGs with subquadratic running time ( with high probability) [von Looz et al., 2015]. The acceleration stems primarily from the reduction of distance computations through a polar quadtree adapted to hyperbolic space. Instead of probing each pair of nodes, the generator performs for each node one efficient range query supported by the quadtree. In practice this leads to an acceleration of at least two orders of magnitude. With the quadtree-based approach networks with billions of edges can be generated in parallel in a few minutes [von Looz et al., 2015]. By exploiting previous results on efficient Erdős-Rényi graph generation [Batagelj and Brandes, 2005], the quadtree can be extended to use more general neighborhoods [von Looz and Meyerhenke, 2015].
3.4 Modular Design
In terms of software design, we aim at a modular architecture with proper encapsulation of algorithms into software components (classes and modules). This requires extensive software engineering work but has clear benefits. Among them are extensibility and code reuse: For example, new centrality measures can be easily added by implementing a subclass with the code specific to the centrality computation, while code applicable to all centrality measures and a common interface remains in the base class. Through these and other modularizations, developers can add a new centrality measure and get derived measures almost ”for free”. These include for instance the centralization index [Freeman, 1979] and the assortativity coefficient [Freeman, 1979], which can be defined with respect to any node centrality measure and may in each case be a key feature of the network.
Modular design also allows for optimizations on one algorithm to benefit other client algorithms. For instance, betweenness and other centrality measures (such as closeness) require the computation of shortest paths, which is done via breadth-first search in unweighted graphs and Dijkstra’s algorithm in weighted graphs, decoupled to avoid code redundancy (see lines 10-14 in Fig. 3).
The following describes the core set of network analysis algorithms implemented in NetworKit. In addition, NetworKit also includes a collection of basic graph algorithms, such as breadth-first and depth-first search or Dijkstra’s algorithm for shortest paths. Table 1 summarizes the core set of algorithms for typical problems.
|ap. betweeenness||[Geisberger et al., 2008],[Riondato and Kornaropoulos, 2015]|
|closeness||shortest-path search from each node|
|ap. closeness||[Eppstein and Wang, 2004]|
|PageRank||power iteration||typical (Sec. 4.2)|
|eigenvector centrality||power iteration||typical|
|Katz centrality||[Katz, 1953]||typical|
|-path centrality||[Alahakoon et al., 2011]||see [Alahakoon et al., 2011]|
|local clustering coefficient||parallel iterator|
|-core decomposition||[Dasari et al., 2014]|
|community detection||PLM, PLP [Staudt and Meyerhenke, 2015]||,|
|global||diameter||iFub [Crescenzi et al., 2013]||typical|
4.1 Global Network Properties
Global properties include simple statistics such as the number of nodes and edges and the graph’s density, as well as properties related to distances: The diameter of a graph is the maximum length of a shortest path between any two nodes. We use the iFUB algorithm [Crescenzi et al., 2013] both for the exact computation as well as an estimation of a lower and upper bound on the diameter. iFub has a worst case complexity of but has shown excellent typical-case performance on complex networks, where it often converges on the exact value in linear time.
4.2 Node Centrality
Node centrality measures quantify the structural importance of a node within a network. More precisely, we consider a node centrality measure as any function which assigns to each node an attribute value of (at least) ordinal scale of measurement. The assigned value depends on the position of a node within the network as defined by a set of edges.
The simplest measure that falls under this definition is the degree, i. e. the number of connections of a node. The distribution of degrees plays an important role in characterizing a network. Eigenvector centrality and its variant PageRank [Page et al., 1999] assign relative importance to nodes according to their connections, incorporating the idea that edges to high-scoring nodes contribute more. Both variants are implemented in NetworKit based on parallel power iteration, whose convergence time depends on a numerical error tolerance parameter and spectral properties of the network, but is among the fast linear-time algorithms for typical inputs. For betweenness centrality we provide the solutions discussed in Sec. 3.2. Similar techniques are applied for computing closeness centrality exactly and approximately [Eppstein and Wang, 2004]. Our current research extends the former approach to dynamic graph processing [Bergamini and Meyerhenke, 2015, Bergamini et al., 2015]. The local clustering coefficient expresses how many of the possible connections between neighbors of a node exist, which can be treated as a node centrality measure according to the definition above [Newman, 2010]. In addition to a parallel algorithm for custering coefficients, NetworKit also implements a sampling approximation algorithm [Schank and Wagner, 2005], whose constant time complexity is independent of graph size. Given NetworKit’s modular architecture, further centrality measures can be easily added.
4.3 Edge Centrality, Sparsification and Link Prediction
The concept of centrality can be extended to edges: Not all edges are equally important for the structure of the network, and scores can be assigned to edges depending on the graph structure such that they can be ranked (e. g. edge betweenness, which depends on the number of shortest paths passing through an edge).
While such a ranking is illuminating in itself, it can also be used to filter edges and thereby reduce the size of data. NetworKit includes a wide set of edge ranking methods, with a focus on sparsification techniques meant to preserve certain properties of the network. For instance, we show that a method that ranks edges leading to high-degree nodes (hubs) closely preserves many properties of social networks, including diameter, degree distribution and centrality measures. Other methods, including a family of Simmelian backbones, assign higher importance to edges within dense regions of the graph and hence preserve or emphasize communities. Details are reported in our recent experimental study [Lindner et al., 2015]. While currently experimental and focused on one application, namely structure-preserving sparsification, the design is extensible so that general edge centrality indices can be easily implemented.
A somewhat related problem, conceptually and through common methods, is the task of link prediction. Link prediction algorithms examine the edge structure of a graph to derive similarity scores for unconnected pairs of nodes. Depending on the score, the existence of a future or missing edge is inferred. NetworKit includes implementations for a wide variety of methods from the literature [Esders, 2015].
4.4 Partitioning the Network
Another class of analysis methods partitions the set of nodes into subsets depending on the graph structure. For instance, all nodes in a connected component are reachable from each other. A network’s connected components can be computed in linear time using breadth-first search. Community detection is the task of identifying groups of nodes in the network which are significantly more densely connected among each other than to the rest of nodes. It is a data mining problem where various definitions of the structure to be discovered – the community – exist. This fuzzy task can be turned into a well-defined though NP-hard optimization problem by using community quality measures, first and foremost modularity [Girvan and Newman, 2002]. We approach community detection from the perspective of modularity maximization and engineer parallel heuristics which deliver a good tradeoff between solution quality and running time [Staudt and Meyerhenke, 2015]. The PLP algorithm implements community detection by label propagation [Raghavan et al., 2007], which extracts communities from a labelling of the node set. The Louvain method for community detection [Blondel et al., 2008] can be classified as a locally greedy, bottom-up multilevel algorithm. We recommend the PLM algorithm with optional refinement step as the default choice for modularity-driven community detection in large networks. For very large networks in the range of billions of edges, PLP delivers a better time to solution, albeit with a qualitatively different solution and worse modularity.
5 Network Generators
|model [and algorithm]||description|
|Erdős-Rényi [P. Erdős, 1960] [[Batagelj and Brandes, 2005]]||random edges with uniform probability|
|planted partition / stochastic blockmodel||dense areas with sparse connections|
|Barabasi-Albert [Albert and Barabási, 2002]||preferential attachment process resulting in power-law degree distribution|
|Recursive Matrix (R-MAT) [Chakrabarti et al., 2004]||power-law degree distribution, small-world property, self-similarity|
|Chung-Lu [Aiello et al., 2000]||replicate a given degree distribution|
|Havel-Hakimi [Hakimi, 1962]||replicate a given degree distribution|
|hyperbolic unit-disk model [Krioukov et al., 2010] [[von Looz et al., 2015]]||large networks, power-law degree distribution and high clustering|
|LFR [Lancichinetti and Fortunato, 2009]||complex networks containing communities|
Generative network models aim to explain how networks form and evolve specific structural features. Such models and their implementations as generators have at least two important uses: On the one hand, algorithm or software engineers want generators for synthetic datasets which can be arbitrarily scaled and parametrized and produce graphs which resemble the real application data. On the other hand, network scientists employ models to increase their understanding of network phenomena. NetworKit provides a versatile collection of graph generators for this purpose, summarized in Table 2.
6 Example Use Cases.
In the following, we present possible workflows and use cases, highlighting the capabilities of NetworKit as a data analysis tool and a library.
6.1 As a Library in an Analysis Pipeline
A master’s thesis [Flick, 2014] provides an early example of NetworKit as a component in an application-specific data mining pipeline (Fig. 6). This pipeline performs analysis of protein-interaction (PPI) networks. and implements a preprocessing stage in Python, in which networks are compiled from heterogeneous data sets containing interaction data as well as expression data about the occurrence of proteins in different cell types. During the network analysis stage, preprocessed networks are retrieved from a database, and NetworKit is called via the Python frontend. The C++ core has been extended to enable more efficient analysis of tissue-specific PPI networks, by implementing in-place filtering of the network to the subgraphs of proteins that occur in given cell types. Finally, statistical analysis and visualization is applied to the network analysis data. The system is close to how we envision NetworKit as a high-performance algorithmic component in a real-world data analysis scenario, and we therefore place emphasis on the toolkit being easily scriptable and extensible.
6.2 Exploratory Network Analysis with Network Profiles
Making the most of NetworKit as a library requires writing some amount of custom code and some expertise in selecting algorithms and their parameters. This is one reason why we also provide an interface that makes exploratory analysis of large networks easy and fast even for non-expert users, and provides an extensive overview. The underlying module assembles many algorithms into one program, automates analysis tasks and produces a graphical report to be displayed in the Jupyter Notebook or exported to an HTML or LaTeX report document. Such a network profile gives a statistical overview over the properties of the network. It consists of the following parts: First global properties such as size and density are reported. The report then focuses on a variety of node centrality measures, showing an overview of their distributions in the network (see Fig. 7). Detailed views for centrality measures (see Fig. 8) follow: Their distributions are plotted in histograms and characterized with standard statistics, and network-specific measures such as centralization and assortativity are shown. We propose that correlations between centralities are per se interesting empirical features of a network. For instance, betweenness may or may not be positively correlated with increasing node degree. The prevalence of low-degree, high-betweenness nodes may influence the resilience of a transport network, as only few links then need to be severed in order to significantly disrupt transport processes following shortest paths. For the purpose of studying such aspects, the report displays a matrix of Spearman’s correlation coefficients, showing how node ranks derived from the centrality measures correlate with each other (see Fig. 8(b)). Furthermore, scatter plots for each combination of centrality measure are shown, suggesting the type of correlation (see Fig. 9(a)). The report continues with different ways of partitioning the network, showing histograms and pie charts for the size distributions of connected components, modularity-based communities (see Fig. 9(b)) and -shells, respectively. Absent on purpose is a node-edge diagram of the graph, since graph drawing (apart from being computationally expensive) is usually not the preferred method to explore large complex networks. Rather, we consider networks first of all to be statistical data sets whose properties should be determined via graph algorithms and the results summarized via statistical graphics. The default configuration of the module is such that even networks with hundreds of millions of edges can be characterized in minutes on a parallel workstation. Furthermore, it can be configured by the user depending on the desired choice of analytics and level of detail, so that custom reports can be generated.
To pick an example from a scientific domain, the human connectome network con-fiber_big maps brain regions and their anatomical connections at a relatively high resolution, yielding a graph with ca. 46 million edges. As the resolution of brain imaging technology improves, connectome analysis is likely to yield ever more massive network data sets, considering that the brain at the neuronal scale is a complex network on the order of nodes and edges. On a first look, the network has properties similar to a social network, with a skewed degree distribution and high clustering. The pattern of correlations (Fig. 8(b)) differs from that of a typical friendship network (Fig. 8(a)), with weaker positive correlations across the spectrum of centrality measures. As one observation to focus on, we may pick the strong negative correlation between the local clustering coefficient on the one hand and the PageRank and betweenness centrality on the other. High betweenness nodes are located on many shortest paths, and high PageRank results from being connected to neighbors which are themselves highly connected. Thus, the correlations point to the presence of structural hub nodes that connect different brain regions which are not directly linked. Also, a look at a scatter plot generated (Fig. 9(a)) reveals more details on the correlations: We see that the local clustering coefficient steadily falls with node degree, a majority of nodes having high clustering and low degree, a few nodes having low clustering and high degree. Both observations are consistent with the finding of connector hub regions situated along the midline of the brain, which are highly connected and link otherwise separated brain modules organized around smaller provincial hubs [Sporns and Betzel, 2015].
Another aspect we can focus on is community structure. There has been extensive research on the modular structure of brain networks, indicating that communities in the connectivity network can be interpreted as functional modules of the brain [Sporns and Betzel, 2015]. The communities found by the PLM modularity-maximizing heuristic in the con-fiber_big graph can be interpreted accordingly. Their size distribution (Fig. 9(b), in which a green pie slice represents the size of a community) shows that a large part of the network consists of about 30 communities of roughly equal size, in addition to a large number of very small communities (grey). Of course, such interpretations of the network profile contain speculation, and a thorough analysis – linking network structure to brain function – would require the knowledge of a neuroscientist. Nonetheless, these examples illustrate how NetworKit’s capability to quickly generate an overview of structural properties can be used to generate hypotheses about the network data.
7 Comparison to Related Software
Recent years have seen a proliferation of graph processing and network analysis software which vary widely in terms of target platform, user interface, scalability and feature set. We therefore locate NetworKit relative to these efforts. Although the boundaries are not sharp, we would like to separate network analysis toolkits from general purpose graph frameworks (e. g. Boost Graph Library and JUNG [O’Madadhain et al., 2003]), which are less focused on data analysis workflows.
As closest in terms of architecture, functionality and target use cases, we see igraph [Csardi and Nepusz, 2006] and graph-tool [Peixoto, 2015]. They are packaged as Python modules, provide a broad feature set for network analysis workflows, and have active user communities. NetworkX [Hagberg et al., 2008] is also a mature toolkit and the de-facto standard for the analysis of small to medium networks in a Python environment, but not suitable for massive networks due to its pure Python implementations. (Due to the similar interface, users of NetworkX are likely to move easily to NetworKit for larger networks.) Like NetworKit, igraph and graph-tool address the scalability issue by implementing core data structures and algorithms in C or C++. graph-tool builds on the Boost Graph Library and parallelizes some kernels using OpenMP. These similarities make those packages ideal candidates for an experimental comparison with NetworKit (see Section 8.2).
Other projects are geared towards network science but differ in important aspects from NetworKit. Gephi [Bastian et al., 2009], a GUI application for the Java platform, has a strong focus on visual network exploration. Pajek [Batagelj and Mrvar, 2004], a proprietary GUI application for the Windows operating system, also offers analysis capabilities similar to NetworKit, as well as visualization features. The variant PajekXXL uses less memory and thus focuses on large datasets.
The SNAP [Leskovec and Sosič, 2014] network analysis package has also recently adopted the hybrid approach of C++ core and Python interface. Related efforts from the algorithm engineering community are KDT [Lugowski et al., 2012] (built on an algebraic, distributed parallel backend), GraphCT [Ediger et al., 2013] (focused on massive multithreading architectures such as the Cray XMT), STINGER (a dynamic graph data structure with some analysis capabilities) [Ediger et al., 2012] and Ligra [Shun and Blelloch, 2013] (a recent shared-memory parallel library). They offer high performance through native, parallel implementations of certain kernels. However, to characterize a complex network in practice, we need a substantial set of analytics which those frameworks currently do not provide.
Among solutions for large-scale graph analytics, distributed computing frameworks (for instance GraphLab [Low et al., 2012]) are often prominently named. However, graphs arising in many data analysis scenarios are not bigger than the billions of edges that fit into a conventional main memory and can therefore be processed far more efficiently in a shared-memory parallel model [Shun and Blelloch, 2013], which we confirm experimentally in a recent study [Koch et al., 2015]. Distributed computing solutions become necessary for massive graph applications (as they appear, for example, in social media services), but we argue that shared-memory multicore machines go a long way for network science applications.
8 Performance Evaluation
|fb-Caltech36||social (friendship)||769||16656||[Traud et al., 2012]|
|PGPgiantcompo||social (trust)||10680||24316||Boguña et al. 2014|
|coAuthorsDBLP||coauthorship (science)||299067||977676||[Bader et al., 2014]|
|fb-Texas84||social (friendship)||36371||1590655||[Traud et al., 2012]|
|Foursquare||social (friendship)||639014||3214986||[Zafarani and Liu, 2009]|
|Lastfm||social (friendship)||1193699||4519020||[Zafarani and Liu, 2009]|
|wiki-Talk||social||2394385||4659565||[Leskovec and Krevl, 2014]|
|Flickr||social (friendship)||639014||55899882||[Zafarani and Liu, 2009]|
|in-2004||web||1382908||13591473||[Boldi et al., 2004]|
|actor-collaboration||collaboration (film)||382219||15038083||[Kunegis, 2013]|
|eu-2005||web||862664||16138468||[Boldi et al., 2004]|
|flickr-growth-u||social (friendship)||2302925||33140017||[Kunegis, 2013]|
|social (followership)||15395404||85331845||[Zafarani and Liu, 2009]|
|uk-2002||web||18520486||261787258||[Boldi et al., 2004]|
|uk-2007-05||web||105896555||3301876564||[Boldi et al., 2004]|
This section presents an experimental evaluation of the performance of NetworKit’s algorithms. Our platform is a shared-memory server with 256 GB RAM and 2x8 Intel(R) Xeon(R) E5-2680 cores (32 threads due to hyperthreading) at 2.7 GHz.
Fig. 11 shows results of a benchmark of the most important analytics kernels in NetworKit. The algorithms were applied to a diverse set of 15 real-world networks in the size range from 16k to 260M edges, including web graphs, social networks, connectome data and internet topology networks (see Table 3 for a description). Kernels with quadratic running time (like Betweenness) were restricted to the subset of the 4 smallest networks. The box plots illustrate the range of processing rates achieved (dots are outliers). The benchmark illustrates that a set of efficient linear-time kernels, including ConnectedComponents, the community detectors, PageRank, CoreDecomposition and ClusteringCoefficient, scales well to networks in the order of edges. The iFub [Crescenzi et al., 2013] algorithm demonstrates its surprisingly good performance on complex networks, moving diameter calculation effectively into the class of linear-time kernels. Fig. 12 breaks its processing rate down to the particular instances, in decreasing order of size, illustrating that performance is often strongly dependent on the specific structure of complex networks. Algorithms like BFS and ConnectedComponents actually scan every edge at a rate of to edges per second. Betweenness calculation remains very time-consuming in spite of parallelization, but approximate results can be obtained two order of magnitudes faster.
8.2 Comparative Benchmark.
NetworKit, igraph and graph-tool rely on the same hybrid architecture of C/C++ implementations with a Python interface. igraph uses non-parallel C code while graph-tool also features parallelism. We benchmarked typical analysis kernels for the three packages in comparison on the aforementioned parallel platform and present the measured performance in Fig. 13. Where applicable, algorithm parameters were selected to ensure a fair comparison. In this respect it should be mentioned that graph-tool’s implementation of Brandes’ betweenness algorithm does more work as it also calculates edge betweenness scores during the run. (Anyway, performance differences in the implementation quickly become irrelevant for a non-linear-time algorithm as the input size grows.) graph-tool also takes a different approach to community detection, hence the comparison is between igraph and NetworKit only. We summarize the benchmark results as follows: In our benchmark, NetworKit was the only framework that could consistently run the set of kernels (excluding the quadratic-time betweenness) on the full set of networks in the timeframe of an overnight run. For some of igraph’s and graph-tool’s implementations the test set had to be restricted to a subset of smaller networks to make it possible to run the complete benchmark over night. NetworKit has the fastest average processing rate on all of these typical analytics kernels. Our implementations have a slight edge over the others for breadth-first search, connected components, clustering coefficients and betweenness. Considering that the algorithms are very similar, this is likely due to subtle differences and optimizations in the implementation. For PageRank, core decomposition and the two community detection algorithms, our parallel methods lead to a larger speed advantage. The massive difference for the diameter calculation is due to our choice of the iFub algorithm [Crescenzi et al., 2013], which has better running time in the typical case (i. e. complex networks with hub structure) and enables the processing of inputs that are orders of magnitudes larger.
Another scalability factor is the memory footprint of the graph data structure. NetworKit provides a lean implementation in which the 260M edges of the uk-2002 web graph occupy only 9 GB, compared with igraph (93GB) and graph-tool (14GB). After indexing the edges, e. g. in order to compute edge centrality scores, NetworKit requires 11 GB for the graph.
A third factor that should not be ignored for real workflows is I/O. Getting a large graph from hard disk to memory often takes far longer than the actual analysis. For our benchmark, we chose the GML graph file format for the input files, because it is supported by all three frameworks. We observed that the NetworKit parser is significantly faster for these non-attributed graphs.
9 Open-Source Development and Distribution
Through open-source development we would like to encourage usage and contributions by a diverse community, including data mining users and algorithm engineers. While the core developer team is located at KIT, NetworKit is becoming a community project with a growing number of external users and contributors. The code is free software licensed under the permissive MIT License. The package source, documentation, and additional resources can be obtained from http://networkit.iti.kit.edu. The package networkit is also installable via the Python package manager pip. For custom-built applications, the Python layer may be omitted by building a subset of functionality as a native library.
The NetworKit project exists at the intersection of graph algorithm research and network science. Its contributors develop and collect state-of-the-art algorithms for network analysis tasks and incorporate them into ready-to-use software. The open-source package is under continuous development. The result is a tool suite of network analytics kernels, network generators and utility software to explore and characterize large network data sets on typical multicore computers. We detailed techniques that allow NetworKit to scale to large networks, including appropriate algorithm patterns (parallelism, heuristics, data structures) and implementation patterns (e. g. modular design). The interface provided by our Python module allows domain experts to focus on data analysis workflows instead of the intricacies of programming. This is especially enabled by a new frontend that generates comprehensive statistical reports on structural features of the network. Specialized programming skills are not required, though users familiar with the Python ecosystem of data analysis tools will appreciate the possibility to seamlessly integrate our toolkit.
Among similar software packages, NetworKit yields the best performance for common analysis workflows. Our experimental study showed that NetworKit is capable of quickly processing large-scale networks for a variety of analytics kernels in a reliable manner. This translates into faster workflow and extended analysis capabilities in practice. We recommend NetworKit for the comprehensive structural analysis of large complex networks, as well as processing large batches of smaller networks. With fast parallel algorithms, scalability is in practice primarily limited by the size of the shared memory: A standard multicore workstation with 256 GB RAM can therefore process up to edge graphs.
This work was partially supported by the project Parallel Analysis of Dynamic Networks – Algorithm Engineering of Efficient Combinatorial and Numerical Methods, which is funded by the Ministry of Science, Research and the Arts Baden-Württemberg and by DFG grant ME-3619/3-1 (FINCA) within the SPP 1736 Algorithms for Big Data. Aleksejs Sazonovs acknowledges support by the RISE program of the German Academic Exchange Service (DAAD). We thank Maximilian Vogel and Michael Hamann for continuous algorithm and software engineering work on the package. We also thank Lukas Barth, Miriam Beddig, Elisabetta Bergamini, Stefan Bertsch, Pratistha Bhattarai, Andreas Bilke, Simon Bischof, Guido Brückner, Mark Erb, Kolja Esders, Patrick Flick, Lukas Hartmann, Daniel Hoske, Gerd Lindner, Moritz v. Looz, Yassine Marrakchi, Mustafa Özdayi, Marcel Radermacher, Klara Reichard, Matteo Riondato, Marvin Ritter, Arie Slobbe, Florian Weber, Michael Wegner and Jörg Weisbarth for contributing to the project.
- [Aiello et al., 2000] Aiello, W., Chung, F., and Lu, L. (2000). A random graph model for massive graphs. In Proceedings of the thirty-second annual ACM symposium on Theory of computing, pages 171–180. Acm.
- [Alahakoon et al., 2011] Alahakoon, T., Tripathi, R., Kourtellis, N., Simha, R., and Iamnitchi, A. (2011). K-path centrality: A new centrality measure in social networks. In Proceedings of the 4th Workshop on Social Network Systems, page 1. ACM.
- [Albert and Barabási, 2002] Albert, R. and Barabási, A. (2002). Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47.
- [Bader et al., 2014] Bader, D. A., Meyerhenke, H., Sanders, P., Schulz, C., Kappes, A., and Wagner, D. (2014). Benchmarking for graph clustering and partitioning. In Encyclopedia of Social Network Analysis and Mining, pages 73–82.
- [Bastian et al., 2009] Bastian, M., Heymann, S., and Jacomy, M. (2009). Gephi: an open source software for exploring and manipulating networks. In International Conference on Weblogs and Social Media, pages 361–362.
- [Batagelj and Brandes, 2005] Batagelj, V. and Brandes, U. (2005). Efficient generation of large random networks. Physical Review E, 71(3):036113.
- [Batagelj and Mrvar, 2004] Batagelj, V. and Mrvar, A. (2004). Pajek—analysis and visualization of large networks, volume 2265 of the series Lecture Notes in Computer Science pp 477-478. Springer.
- [Batagelj and Zaveršnik, 2011] Batagelj, V. and Zaveršnik, M. (2011). Fast algorithms for determining (generalized) core groups in social networks. Advances in Data Analysis and Classification, 5(2):129–145.
- [Behnel et al., 2011] Behnel, S., Bradshaw, R., Citro, C., Dalcin, L., Seljebotn, D. S., and Smith, K. (2011). Cython: The best of both worlds. Computing in Science & Engineering, 13(2):31–39.
- [Bergamini and Meyerhenke, 2015] Bergamini, E. and Meyerhenke, H. (2015). Fully-dynamic approximation of betweenness centrality. In Algorithms - ESA 2015 - 23rd Annual European Symposium, Patras, Greece, September 14-16, 2015, Proceedings, pages 155–166.
- [Bergamini et al., 2015] Bergamini, E., Meyerhenke, H., and Staudt, C. (2015). Approximating betweenness centrality in large evolving networks. In Proceedings of the Seventeenth Workshop on Algorithm Engineering and Experiments, ALENEX 2015, San Diego, CA, USA, January 5, 2015, pages 133–146.
- [Blondel et al., 2008] Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008.
- [Boccaletti et al., 2006] Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., and Hwang, D.-U. (2006). Complex networks: Structure and dynamics. Physics reports, 424(4):175–308.
- [Bode et al., 2014] Bode, M., Fountoulakis, N., and Müller, T. (2014). The probability that the hyperbolic random graph is connected. Preprint available at http://www.staff.science.uu.nl/~muell001/Papers/BFM.pdf.
- [Boldi et al., 2004] Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004). Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience, 34(8):711–726.
- [Brandes, 2001] Brandes, U. (2001). A faster algorithm for betweenness centrality. J. Mathematical Sociology, 25(2):163–177.
- [CAIDA, 2003] CAIDA (2003). Caida skitter router-level topology measurements. http://www.caida.org/data/router-adjacencies/.
- [Chakrabarti et al., 2004] Chakrabarti, D., Zhan, Y., and Faloutsos, C. (2004). R-MAT: A recursive model for graph mining. Computer Science Department, page 541.
- [Costa et al., 2011] Costa, L. d. F., Oliveira Jr, O. N., Travieso, G., Rodrigues, F. A., Villas Boas, P. R., Antiqueira, L., Viana, M. P., and Correa Rocha, L. E. (2011). Analyzing and modeling real-world phenomena with complex networks: a survey of applications. Advances in Physics, 60(3):329–412.
- [Crescenzi et al., 2013] Crescenzi, P., Grossi, R., Habib, M., Lanzi, L., and Marino, A. (2013). On computing the diameter of real-world undirected graphs. Theoretical Computer Science, 514:84–95.
- [Csardi and Nepusz, 2006] Csardi, G. and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695(5).
- [Dasari et al., 2014] Dasari, N. S., Ranjan, D., and Zubair, M. (2014). ParK: An efficient algorithm for k-core decomposition on multicore processors. In Lin, J., Pei, J., Hu, X., Chang, W., Nambiar, R., Aggarwal, C., Cercone, N., Honavar, V., Huan, J., Mobasher, B., and Pyne, S., editors, 2014 IEEE International Conference on Big Data, Big Data 2014, Washington, DC, USA, October 27-30, 2014, pages 9–16. IEEE.
- [Ediger et al., 2013] Ediger, D., Jiang, K., Riedy, E. J., and Bader, D. A. (2013). GraphCT: Multithreaded algorithms for massive graph analysis. Parallel and Distributed Systems, IEEE Transactions on, 24(11):2220–2229.
- [Ediger et al., 2012] Ediger, D., McColl, R., Riedy, J., and Bader, D. (2012). STINGER: High performance data structure for streaming graphs. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on, pages 1–5.
- [Eppstein and Wang, 2004] Eppstein, D. and Wang, J. (2004). Fast approximation of centrality. J. Graph Algorithms Appl., 8:39–45.
- [Esders, 2015] Esders, K. (2015). Link prediction in large-scale complex networks. Master’s thesis, Karlsruhe Institute of Technology, http://parco.iti.kit.edu/attachments/Kolja%20Esders%20-%20Thesis.pdf.
- [Flick, 2014] Flick, P. (2014). Analysis of human tissue-specific protein-protein interaction networks. Master’s thesis, Karlsruhe Institute of Technology.
- [Freeman, 1979] Freeman, L. C. (1979). Centrality in social networks conceptual clarification. Social networks, 1(3):215–239.
- [Geisberger et al., 2008] Geisberger, R., Sanders, P., and Schultes, D. (2008). Better approximation of betweenness centrality. In ALENEX, pages 90–100. SIAM.
- [Girvan and Newman, 2002] Girvan, M. and Newman, M. (2002). Community structure in social and biological networks. Proc. of the National Academy of Sciences, 99(12):7821.
- [Gugelmann et al., 2012] Gugelmann, L., Panagiotou, K., and Peter, U. (2012). Random hyperbolic graphs: Degree sequence and clustering - (extended abstract). In Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012, Proceedings, Part II, pages 573–585.
- [Hagberg et al., 2008] Hagberg, A., Swart, P., and S Chult, D. (2008). Exploring network structure, dynamics, and function using NetworkX. Technical report, Los Alamos National Laboratory (LANL).
- [Hakimi, 1962] Hakimi, S. L. (1962). On realizability of a set of integers as degrees of the vertices of a linear graph. i. Journal of the Society for Industrial & Applied Mathematics, 10(3):496–506.
- [Katz, 1953] Katz, L. (1953). A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43.
- [Kiwi and Mitsche, 2015] Kiwi, M. and Mitsche, D. (2015). A bound for the diameter of random hyperbolic graphs. preprint available at http://arxiv. org/abs/1408.2947.
- [Koch et al., 2015] Koch, J., Staudt, C. L., Vogel, M., and Meyerhenke, H. (2015). Complex network analysis on distributed systems: An empirical comparison. In International Symposium on Foundations and Applications of Big Data Analytics.
- [Krioukov et al., 2010] Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A., and Boguñá, M. (2010). Hyperbolic geometry of complex networks. Physical Review E, 82:036106.
- [Kunegis, 2013] Kunegis, J. (2013). Konect: the koblenz network collection. In Proceedings of the 22nd international conference on World Wide Web companion, pages 1343–1350. International World Wide Web Conferences Steering Committee.
- [Lancichinetti and Fortunato, 2009] Lancichinetti, A. and Fortunato, S. (2009). Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Physical Review E, 80(1):016118.
- [Leskovec and Krevl, 2014] Leskovec, J. and Krevl, A. (2014). SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data.
- [Leskovec and Sosič, 2014] Leskovec, J. and Sosič, R. (2014). SNAP: A general purpose network analysis and graph mining library in C++. http://snap.stanford.edu/snap.
- [Lindner et al., 2015] Lindner, G., Staudt, C. L., Hamann, M., Meyerhenke, H., and Wagner, D. (2015). Structure-preserving sparsification of social networks. ASONAM.
- [Low et al., 2012] Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. M. (2012). Distributed graphlab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716–727.
- [Lugowski et al., 2012] Lugowski, A., Alber, D., Buluç, A., Gilbert, J. R., Reinhardt, S., Teng, Y., and Waranis, A. (2012). A flexible open-source toolbox for scalable complex graph analysis. In Proceedings of the Twelfth SIAM International Conference on Data Mining (SDM12), pages 930–941.
- [Newman, 2010] Newman, M. (2010). Networks: an introduction. Oxford University Press.
- [O’Madadhain et al., 2003] O’Madadhain, J., Fisher, D., White, S., and Boey, Y. (2003). The JUNG (java universal network/graph) framework. University of California, Irvine, California.
- [P. Erdős, 1960] P. Erdős, A. R. (1960). On the Evolution of Random Graphs. Publication of the Mathematical Institute of the Hungarian Academy of Sciences.
- [Page et al., 1999] Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The pagerank citation ranking: Bringing order to the web.
- [Peixoto, 2015] Peixoto, T. P. (2015). graph-tool. http://graph-tool.skewed.de.
- [Perez et al., 2013] Perez, F., Granger, B. E., and Obispo, C. (2013). An open source framework for interactive, collaborative and reproducible scientific computing and education.
- [Raghavan et al., 2007] Raghavan, U. N., Albert, R., and Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 76(3):036106.
- [Riondato and Kornaropoulos, 2015] Riondato, M. and Kornaropoulos, E. (2015). Fast approximation of betweenness centrality through sampling. Data Mining and Knowledge Discovery, pages 1–38.
- [Schank and Wagner, 2005] Schank, T. and Wagner, D. (2005). Approximating clustering coefficient and transitivity. Journal of Graph Algorithms and Applications, 9(2):265–275.
- [Shun and Blelloch, 2013] Shun, J. and Blelloch, G. E. (2013). Ligra: a lightweight graph processing framework for shared memory. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, Shenzhen, China, February 23-27, 2013, pages 135–146.
- [Sporns and Betzel, 2015] Sporns, O. and Betzel, R. F. (2015). Modular brain networks. Annual review of psychology, 67(1).
- [Staudt and Meyerhenke, 2015] Staudt, C. and Meyerhenke, H. (2015). Engineering parallel algorithms for community detection in massive networks. Parallel and Distributed Systems, IEEE Transactions on, PP(99):1–1.
- [Traud et al., 2012] Traud, A. L., Mucha, P. J., and Porter, M. A. (2012). Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications, 391(16):4165–4180.
- [von Looz and Meyerhenke, 2015] von Looz, M. and Meyerhenke, H. (2015). Querying probabilistic neighborhoods in spatial data sets efficiently. arXiv preprint arXiv:1509.01990.
- [von Looz et al., 2015] von Looz, M., Meyerhenke, H., and Prutkin, R. (2015). Generating random hyperbolic graphs in subquadratic time. In Proc. 26th Int’l Symp. on Algorithms and Computation (ISAAC 2015), LNCS. Springer. To appear.
- [Zafarani and Liu, 2009] Zafarani, R. and Liu, H. (2009). Social computing data repository at ASU. http://socialcomputing.asu.edu.