Abstract
The field of network science is a highly interdisciplinary area; for the empirical analysis of network data,
it draws algorithmic methodologies from several research fields. Hence, research procedures and descriptions
of the technical results often differ, sometimes widely.
In this paper we focus on methodologies for the experimental part of algorithm engineering for network
analysis – an important ingredient for a research area with empirical focus.
More precisely, we unify and adapt existing recommendations from different
fields and propose universal guidelines – including statistical analyses – for the systematic evaluation
of network analysis algorithms. This way, the behavior of newly proposed algorithms can be properly assessed and comparisons to existing solutions become meaningful.
Moreover, as the main technical contribution,
we provide SimexPal, a highly automated tool to perform and analyze experiments following our guidelines.
To illustrate the merits of SimexPal and our guidelines, we apply them in a case study:
we design, perform, visualize and evaluate experiments of a recent algorithm for
approximating betweenness centrality, an important problem in network analysis.
In summary, both our guidelines and SimexPal shall modernize and complement previous efforts in experimental
algorithmics; they are not only useful for network analysis, but also in related contexts.
1 \pubvolumexx \issuenum1 \articlenumber5 2019 \copyrightyear2019 \historyReceived: date; Accepted: date; Published: date \TitleGuidelines for Experimental Algorithmics in Network Analysis \AuthorEugenio Angriman, Alexander van der Grinten, Moritz von Looz, Henning Meyerhenke\orcidA, Martin Nöllenburg\orcidB, Maria Predari and Charilaos Tzovas \AuthorNamesEugenio Angriman, Alexander van der Grinten, Moritz von Looz, Martin Nöllenburg, Maria Predari, Charilaos Tzovas and Henning Meyerhenke \corresCorrespondence: meyerhenke@huberlin.de \secondnoteThese authors contributed equally to this work. \pdfstringdefDisableCommands
1 Introduction
The traditional algorithm development process in theoretical computer science typically involves (i) algorithm design based on abstract and simplified models and (ii) analyzing the algorithm’s behavior within these models using analytical techniques. This usually leads to asymptotic results, mostly regarding the worstcase performance of the algorithm. (While averagecase Bogdanov and Trevisan (2006) and smoothed analysis Spielman and Teng (2001) exist and gained some popularity, worstcase bounds still make up the vast majority of running time results.) Such worstcase results are, however, not necessarily representative for algorithmic behavior in realworld situations, both for complete problems Heule et al. (2018); Applegate et al. (2007) and polytime ones Mehlhorn and Sanders (2008); Puglisi et al. (2007). In case of such a discrepancy, deciding upon the bestfitted algorithm solely based on worstcase bounds is illadvised.
Algorithm engineering has been established to overcome such pitfalls Johnson (1999); MullerHannemann and Schirra (2010); Moret (2002). In essence, algorithm engineering is a cyclic process that consists of five iterative phases: (i) modeling the problem (which usually stems from realworld applications), (ii) designing an algorithm, (iii) analyzing it theoretically, (iv) implementing it, and (v) evaluating it via systematic experiments (also known as experimental algorithmics). Note that not all phases have to be reiterated necessarily in every cycle Sanders (2010). This cyclic approach aims at a symbiosis: the experimental results shall yield insights that lead to further theoretical improvements and vice versa. Ideally, algorithm engineering results in algorithms that are asymptotically optimal and have excellent behavior in practice at the same time. Numerous examples where surprisingly large improvements could be made through algorithm engineering exist, e. g., routing in road networks Bast et al. (2016) and mathematical programming Applegate et al. (2006).
In this paper, we investigate and provide guidance on the experimental algorithmics part of algorithm engineering – from a network analysis viewpoint. It seems instructive to view network analysis, a subfield of network science, from two perspectives: on the one hand, it is a collection of methods that study the structural and algorithmic aspects of networks (and how these aspects affect the underlying application). The research focus here is on efficient algorithmic methods. On the other hand, network analysis can be the process of interpreting network data using the above methods. We briefly touch upon the latter; yet, this paper’s focus is on experimental evaluation methodology, in particular regarding the underlying (graph) algorithms developed as part of the network analysis toolbox.^{1}^{1}1We use the terms network and graph interchangeably.
In this view, network analysis constitutes a subarea of empirical graph algorithmics and statistical analysis (with the curtailment that networks constitute a particular data type) Brandes et al. (2013). This implies that, like general statistics, network science is not tied to any particular application area. Indeed, since networks are abstract models of various realworld phenomena, network science draws applications from very diverse scientific fields such as social science, physics, biology and computer science Newman (2018). It is interdisciplinary and all fields at the interface have their own expertise and methodology. The description of algorithms and their theoretical and experimental analysis often differ, sometimes widely – depending on the target community. We believe that uniform guidelines would help with respect to comparability and systematic presentation. That is why we consider our work (although it has limited algorithmic novelty) important for the field of network analysis (as well as network science and empirical algorithmics in general). After all, emerging scientific fields should develop their own best practices.
To stay focused, we concentrate on providing guidelines for the experimental algorithmics part of the algorithm engineering cycle – with emphasis on graph algorithms for network analysis. To this end, we combine existing recommendations from fields such as statistical analysis and data mining / machine learning and adjust them to fit the idiosyncrasies of networks. Furthermore, and as main technical contribution, we provide SimexPal, a highly automated tool to perform and analyze experiments following our guidelines. For illustration purposes, we use this tool in a case study – the experimental evaluation of a recent algorithm for approximating betweenness centrality, a wellknown network analysis task. The target audience we envision consists of network analysts who develop algorithms and evaluate them empirically. Experienced undergraduates and young PhD students will probably benefit most, but even experienced researchers in the community may benefit substantially from SimexPal.
2 Common Pitfalls (and How to Avoid Them)
2.1 Common Pitfalls
Let us first of all consider a few pitfalls to avoid.^{2}^{2}2Such pitfalls are probably more often on a reviewer’s desk than one might think. Note that we do not claim that we never stumbled into such pitfalls ourselves, nor that we always followed our guidelines in the past. We leave problems in modeling the underlying realworld problem aside and focus on the evaluation of the algorithmic method for the problem at hand. Instead, we discuss a few examples from two main categories of pitfalls: (i) inadequate justification of claims on the paper’s results and (ii) repeatability/replicability/reproducibility^{3}^{3}3 The terms will be explained below; for more detailed context please visit ACM’s webpage on artifact review and badging: https://www.acm.org/publications/policies/artifactreviewbadging. issues.
Clearly, a paper’s claims regarding the algorithm’s empirical behavior need an adequate justification by the experimental results and/or their presentation. Issues in this regard may happen for a variety of reasons; not uncommon is a lack of instances in terms of their number, variety and/or size. An inadequate literature search may lead to not comparing against the current state of the art or if doing so, choosing unsuitable parameters. Also noteworthy is an inappropriate usage of statistics. For example, arithmetic averages over a large set of instances might be skewed towards the more difficult instances. Reporting only averages would not be sufficient then. Similarly, if the difference between two algorithms is small, it is hard to decide whether one of the algorithms truly performs better than the other one or if the perceived difference is only due to noise. Even if no outright mistakes are made, potential significance can be wasted: Coffin and Saltzmann Coffin and Saltzman (2000) discuss papers whose claims could have been strengthened by an appropriate statistical analysis, even without gathering new experimental data.
Reproducibility (the algorithm is implemented and evaluated independently by a different team) is a major cornerstone of science and should receive sufficient attention. Weaker notions include replicability (different team, same source code and experimental setup) and repeatability (same team, same source code and experimental setup). An important weakness of a paper would thus be if the description does not allow the (independent) reproduction of the results. First of all, this means that the proposed algorithm needs to be explained adequately. If the source code is not public, it is all the more important to explain all important implementation aspects – the reader may judge whether this is the current community standard. Pseudocode allows to reason about correctness and asymptotic time and space complexity and is thus very important. But the empirical running time of a reimplementation may deviate from the expectation when the paper omitted crucial implementation details.
Even the repetition of one’s own results (which is mandated by major funding organizations for time spans such as 10 years) can become cumbersome if not considered from the beginning. To make this task easier, not only source code needs to documented properly, but also input and output data of experiments as well as the scripts containing the parameters. Probably every reader has heard about this one project where this documentation part has been neglected to some extent…
2.2 Outline of the Paper
How do we avoid such pitfalls? We make this clear by means of a use case featuring betweenness approximation; it is detailed in Section 3. A good start when designing experiments is to formulate a hypothesis (or several ones) on the algorithm’s behavior (see Section 4.1). This approach is not only part of the scientific method; it also helps in structuring the experimental design and evaluation, thereby decreasing the likelihood of some pitfalls. Section 4.2 deals with how to select and describe input instances to support a certain generality of the algorithmic result. With a similar motivation, Section 4.3 provides guidance on how many instances should be selected. It also discusses how to deal with variance in case of nondeterminism, e. g., by stating how often experiments should be repeated.
If the algorithm contains a reasonable amount of tunable parameters, a division of the instances into a tuning set and an evaluation set may be advisable, which is discussed in Section 4.4.
In order to compare against the best competitors, one must have defined which algorithms and/or software tools constitute the state of the art. Important aspects of this procedure are discussed in Section 4.5. One aspect can be the approximation quality: while an algorithm may have the better quality guarantee in theory, its empirical solution quality may be worse – which justifies to consider not only the theoretical reference in experiments. The claim of superiority refers to certain measures – typically related to resources such as running time or memory and/or related to solution quality. Several common measures of this kind and how to deal with them are explained in Section 4.6. Some of them are hardwareindependent, many of them are not.
A good experimental design can take you far – but it is not sufficient if the experimental pipeline is not efficient or lacks clarity. In the first case, obtaining the results may be tedious and timeconsuming. Or the experiments simply consume more computing resources than necessary and take too long to generate the results. In the second case, in turn, the way experimental data are generated or stored may not allow easy reproducibility. Guidelines on how to setup your experimental pipeline and how to avoid these pitfalls are thus presented in Section 5. The respective subsections deal with implementation issues (5.1), repeatability/replicability/reproducibility (5.2), job submission (5.3), output file structure for longterm storage (5.4), retrieval and aggregation of results (5.5).
As mentioned, betweenness approximation and more precisely the KADABRA Borassi and Natale (2016) algorithm will serve as our prime example. We have implemented this algorithm as part of the NetworKit toolbox Staudt et al. (2016). As part of our experimental evaluation, a meaningful visualization (Section 6) of the results highlights many major outcomes to the human reader. Since visualization is typically not enough to show statistical significance in a rigorous manner, an appropriate statistical analysis (Section 7) is recommended. Both, visualization and statistical analysis, shall lead to a justified interpretation of the results.
3 Use Case
Typically, algorithmic network analysis papers (i) contribute a new (graph) algorithm for an already known problem that improves on the state of the art in some respect or (ii) they define a new algorithmic problem and present a method for its solution. To consider a concrete example contribution, we turn towards betweenness centrality, a widely used and very popular measure for ranking nodes (or edges) in terms of their structural position in the graph Boldi and Vigna (2014).
3.1 Betweenness Approximation as Concrete Example
Betweenness centrality Freeman (1977) measures the participation of nodes in shortest paths. More precisely, let be a graph; the (normalized) betweenness centrality of a node is defined as
(1) 
where is the number of shortest paths from to , and is the number of shortest paths from to that cross (as intermediate node). Computing the exact betweenness scores for all nodes of an unweighted graph can be done in time with Brandes’s algorithm Brandes (2001), where is the number of nodes and the number of edges in . Since this complexity is usually too high for graphs with millions of nodes and edges, several approximation algorithms have been devised Bader et al. (2007); Geisberger et al. (2008); Riondato and Kornaropoulos (2016); Riondato and Upfal (2018); Borassi and Natale (2016). These algorithms trade solution quality for speed and can be much faster.
For illustration purposes, we put ourselves now in the shoes of the authors of the most recent of the cited algorithms, which is called KADABRA Borassi and Natale (2016): we describe (some of) the necessary steps in the process of writing an algorithm engineering paper on KADABRA ^{4}^{4}4We select KADABRA not because the authors of the original paper did a bad job in their presentation – au contraire. Our main reasons are (i) the high likelihood that the betweenness problem is already known to the reader due to its popularity in the field and (ii) the fact that approximation algorithms display important aspects of experimental algorithmics quite clearly. – with a focus on the design and evaluation of the experiments.
3.2 Overview of the Kadabra Algorithm
The KADABRA algorithm approximates the betweenness centrality of (un)directed graphs within a given absolute error of at most with probability Borassi and Natale (2016). The main rationale of the algorithm is to iteratively select two nodes , uniformly at random and then sample a shortest path from to (again uniformly at random). This leads to a sequence of randomly selected shortest paths . The betweenness centrality of each node is then estimated as:
(2) 
where is 1 iff occurs in and 0 otherwise.
Compared to earlier approximation algorithms that employ similar sampling techniques (e. g., Riondato and Kornaropoulos (2016)), the novelty of KADABRA relies on the clever stopping condition, used to determine the number of rounds . Clearly, there is a number , depending on the input size but not the graph itself, such that if , the algorithm achieves the desired solution quality with probability .^{5}^{5}5For example, if is chosen so that almost all vertex pairs are sampled. In reality, can be chosen to be much smaller than that. KADABRA, however, avoids to run a fixed number of rounds by using adaptive sampling. At each round of the algorithm, it is guaranteed that
(3) 
where and are (rather lengthy) expressions depending on , pervertex probabilities and , the current number of rounds and the static round count . and are chosen such that . Once during some round, the algorithm terminates.^{6}^{6}6 Note also that for each round, the algorithm draws a number of samples and performs occurrence updates without checking the stopping condition. This number is determined by parameter . In the original implementation of KADABRA (but not reported in the paper), is fixed to (without further explanation). Algorithm 1 displays the corresponding pseudocode with some expository comments. Besides adaptive sampling, KADABRA relies on a balanced bidirectional BFS to sample shortest paths. For details we refer the interested reader to the original paper.
4 Guidelines for the Experimental Design
Now, we want to set up experiments that give decisive and convincing empirical evidence whether KADABRA is better than the state of the art. This section discusses the most common individual steps of this process.
4.1 Determining Your Hypotheses
Experiments are mainly used for two reasons; as an exploratory tool to reveal unknown properties and/or as a means to answer specific questions regarding the proposed algorithm. The latter suggests the development of certain hypotheses on the behavior of our algorithm that can be confirmed or contradicted by experiments. In the following, we group common hypotheses of algorithm engineering papers in network analysis into two categories:

Hypotheses on how our algorithm performs compared to the state of the art in one of the following metrics: running time, solution quality or (less often) memory usage. Ideally, an algorithm is deemed successful when it outperforms existing algorithms in terms of all three metrics. However, in practice, we are content with algorithms that exhibit a good tradeoff between two metrics, often running time performance and solution quality. As an example, in realtime applications, we may be willing to sacrifice the solution quality (until a certain threshold), in order to meet running time bounds crucial for the application.

Hypotheses on how the input instances or a problem/algorithmspecific parameter affects our algorithm in terms of the aforementioned metrics. For example, a new algorithm might only outperform the existing algorithms on a certain type of graphs or an approximation algorithm might only be fast(er) if the desired approximation quality is within a certain range. If a hypothesis involves such restrictions, it should still be general enough for the algorithmic result to be relevant – overtuning of algorithms to uninteresting classes of inputs should be avoided. Other aspects of investigation may be decoupled from the state of the art to some extent and explore the connection between instance and algorithm properties: for instance, a hypothesis on how the empirical approximation error behaves over a range of input sizes.
Kadabra Example
In the context of our betweenness approximation study, we formulate three basic hypotheses:

KADABRA has a better running time and scalability (with respect to the graph size) than other algorithms, specifically the main competitor RK Riondato and Kornaropoulos (2016).

There is a significant difference between the solution quality of KADABRA and that of RK. (In Section 4.6, we explain how to evaluate the solution quality for approximation algorithms in more detail.)

The diameter of input graphs has an effect on the running time of KADABRA: The KADABRA algorithm computes the betweenness values faster for graphs with low diameter than for graphs with large diameter.
The first two hypotheses belong to the first category, since we compare our implementation of the KADABRA algorithm to a number of related algorithms. The other hypothesis belongs to the second category and is related to the performance of the KADABRA algorithm itself. Namely, we test how a dataspecific parameter may influence the running time in practice. To evaluate these hypotheses experimentally, we need to select instances with both low and high diameter values, of course.
4.2 Gathering Instances
Selecting appropriate data sets for an experimental evaluation is a crucial design step. For sake of comparability, a general rule is to gather data sets from public sources; for network analysis purposes, wellknown repositories are KONECT Kunegis (2013), SNAP Leskovec and Krevl (2014), DIMACS10 Bader et al. (2012), SuiteSparse Davis and Hu (2011), LWA Boldi and Vigna (2004) and Network Repository Rossi and Ahmed (2016).
An appropriate data collection contains the type(s) of networks which the algorithm is designed for, and, for every type, a large enough number of instances to support our conclusions Johnson (1999). For instance, the data collection of an influence maximization algorithm should include social networks rather than street networks; a data collection to evaluate distributed algorithms should include instances that exceed the memory of a typical sharedmemory machine.
The selection of an appropriate subset of instances for our experimental analysis is simpler if we first categorize the available networks in different classes. There exist several ways to do this: first, we can simply distinguish realworld and synthetic (= (randomly) generated) networks. Even though network analysis algorithms generally target realworld networks, one should also use synthetic networks, e. g., to study the asymptotic scalability of an algorithm, since we can easily generate similar synthetic networks of different scales.
Classification of realworld networks generally follows the phenomena they are modeling. Common examples are social networks, hyperlink networks or citation networks, which also fall under the umbrella of complex networks.^{7}^{7}7As the name suggests, complex networks have highly nontrivial (complex) topological features. Most notably, such features include a small diameter (smallworld effect) and a skewed degree distribution (many vertices with low degree and a few vertices with high degree). Examples of realworld noncomplex networks are certain infrastructure networks such as road networks. If our algorithm targets realworld data, we want to carefully build a diverse collection of instances that is representative of the graph classes we are targeting.
Another interesting classification is one based on specific topological features of the input data such as the diameter, the clustering coefficient, the triangle count, etc. Classifying based on a certain property and reporting it should definitely be done when the property in question is relevant to the algorithm and may have an impact on its behavior. For instance, reporting the clustering coefficient could help with the interpretation of results concerning algorithms that perform triangle counting.
A source of instances especially suited for scalability experiments are synthetic graphs generated with graph generators Goldenberg et al. (2010). An advantage is easier handling especially for large graphs, since no large files need to be transferred and stored. Also, desired properties can often be specified in advance and in some models, a ground truth beneficial to test analysis algorithms is available. Important drawbacks include a lack of realism, especially when choosing an unsuitable generative model.
Network name  # of nodes  # of edges  Diameter  Class 

moreno_blogs  1 224  16 715  8  Hyperlink 
petsterhamster  2 426  16 631  10  Social 
egofacebook  2 888  2 981  9  Social 
openflights  3 425  19 256  13  Infrastructure 
opsahlpowergrid  4 941  6 594  46  Infrastructure 
p2pGnutella08  6 301  20 777  9  Peertopeer 
advogato  6 539  39 285  9  Social 
wikiVote  7 115  100 762  7  Social 
p2pGnutella05  8 846  31 839  9  Peertopeer 
p2pGnutella04  10 876  39 994  10  Peertopeer 
foldoc  13 356  91 471  8  Hyperlink 
twin  14 274  20 573  25  Intl. Relations 
cfindergoogle  15 763  148 585  7  Hyperlink 
caAstroPh  18 771  198 050  14  Coauthorship 
cacitHepTh  22 908  2 444 798  9  Coauthorship 
subelj_cora  23 166  89 157  20  Citation 
egotwitter  23 370  32 831  15  Social 
egogplus  23 628  39 194  8  Social 
p2pGnutella24  26 518  65 369  11  Peertopeer 
cacitHepPh  28 093  3 148 447  9  Citation 
citHepPh  34 546  420 877  14  Citation 
facebookwosnwall  46 952  183 412  18  Social 
editfrwikibooks  47 905  139 141  8  Authorship 
dblpcite  49 789  49 759  2  Citation 
locbrightkite_edges  58 228  214 078  18  Social 
editfrwikinews  59 546  157 970  7  Authorship 
dimacs9BAY  321 270  397 415  837  Road 
dimacs9COL  435 666  521 200  1 255  Road 
roadNetPA  1 088 092  1 541 898  794  Road 
roadNetTX  1 379 917  1 921 660  1 064  Road 
Kadabra Example
In a realworld scenario, we want to show the improvement of the KADABRA algorithm with respect to the competition. Clearly, the graph class of highest relevance should be most prominent in our data collection. With this objective in mind, we show a collection for the KADABRA experiments in Table 1. Note that in Table 1 we report the diameter of each graph, along with its size and application class. We include the diameter because it is part of our hypotheses, i. e., the performance of KADABRA depends on the diameter of the input graph (see Section 4.1). In a different scenario, such as triangle counting or community detection, a more pertinent choice would be the average local clustering coefficient. Finally, we focus on complex networks, but include a minority of noncomplex infrastructure networks. All of these instances were gathered from one of the aforementioned public repositories.
4.3 Scale of Experiments
Clearly, we cannot test all possible instances, even if they are small McGeoch et al. (2000). On the other hand, we want to draw significant conclusions about our new algorithm (and here not about a particular data set). This means that our experimental results should justify the later conclusions and allow to clearly distinguish between hypotheses. While Section 4.2 discusses which instances to select, we proceed with the question “how many?”. Also, if the results are affected by randomness, how many repeated runs should we make for each?
Too few instances may be insufficient to support a conclusion. Plots then look sparse, hypothesis tests (Section 7.3.1) are inconclusive and inferred parameters are highly uncertain, which is reflected in unusably wide confidence and credible intervals (Sections 7.3.2 and 7.4). Choosing too many instances, though, costs unnecessary time and expense. As a very rough rule of thumb reflecting community custom, we recommend at least 1015 instances for an experimental paper. If you want to support specific conclusions about algorithmic behavior within a subclass of instances, you should have that many instances in that subclass. Yet, the noisier the measurements and the stronger the differences of output between instances, the more experiments are needed for a meaningful (statistical) analysis. More formally, the uncertainty of many test statistics and inferred parameters^{8}^{8}8Most importantly the tstatistic used in confidence intervals and ttests. In Bayesian statistics, the marginal posterior distribution of a parameter with a Gaussian likelihood also follows a tdistribution, at least when using a conjugate prior Bernardo and Smith (2009). scales with , where is the sample variance. Generally speaking, if we want a result that is twice as precise, we need to quadruple the number of measurements.
The same applies for the number of repetitions for each instance. When plotting results affected by randomness, it is often recommended to average over the results of multiple runs. The expected deviation of such an average from the true mean^{9}^{9}9The mean that would result given infinite samples. is called the standard error and also scales with Altman and Bland (2005). For plotting purposes, we suggest that this standard error for the value of one plotted point (i. e., the size of the error bars) should be one order of magnitude smaller than the variance between plotted points. In the companion notebook, we give an example of code which calculates the number of repetitions required for a desired smoothness. Frameworks exist that automate the repeated measurements, for example Google Benchmark.^{10}^{10}10https://github.com/google/benchmark.
In the context of binary hypothesis tests, the statistical power of a test is the probability that it correctly distinguishes between hypotheses, i. e., rejects a false null hypothesis in the classical approach. A larger sample size, a stronger effect or less variance in measurements all lead to a more powerful test. For further reading, see the extensive literature about statistical power Ellis (2010); Cohen (1988).
All these calculations for the number of necessary experiments require an estimate of the variance of results – which might be unavailable before conducting the experiments. It might thus be necessary to start with a small initial set of instances to plan and prepare for the main experiments.^{11}^{11}11Also called pilot study, as opposed to the main experiments called workhorse study, see McGeoch (2012) and Rardin and Uzsoy (2001).
4.4 Parameter Tuning
Experiments are not only a necessary ingredient to validate our hypothesis, but also a helpful discovery tool in a previous stage Moret (2002). Some algorithm’s behavior depends on tunable parameters. For example, the wellknown hybrid implementation of Quicksort switches between Insertionsort and Quicksort depending on whether the current array size is above or below a tunable threshold Sedgewick (1978). An example from network analysis includes a plethora of approximation methods that calculate some centrality measure using sampling Geisberger et al. (2008); Eppstein and Wang (2001); Riondato and Kornaropoulos (2016). In such cases the number of selected samples may highly impact the performance of the algorithm. Experiments are thus used to determine the adequate number of samples that achieves a good approximation.
The KADABRA algorithm consists of two phases, in which bounds for a sampling process are first computed (Line 3 in Algorithm 1 of Section 3.2) and then used (when checking the stopping condition in line 9). Tighter bounds are more expensive to compute but save time in the later phase; their quality is thus an important tradeoff. When evaluating a newly developed algorithm, there is the risk (or even temptation) to tune its parameters for optimal performance in the experiments while using general default parameters for all competitors. This may often happen with no ill intent – researchers usually know their own algorithm best. To ensure generalizability of results, we recommend to split the set of instances into a tuning set and an evaluation set. The tuning set is used to find good tuning parameters for all algorithms, while the main experiments are performed on the evaluation set. Only results from the evaluation set should be used to draw and support conclusions. Note that the tuning set can also be used to derive an initial variance estimate, as discussed in Section 4.3.
How to Create the Tuning Set and the Evaluation Set
The tuning set should be structurally similar to the whole data set, so that parameter tuning yields good general performance and is representative for the evaluation set.^{12}^{12}12Due to symmetry, this also requires that the tuning set is structurally similar to the evaluation set. Tuning and evaluation sets should be disjoint. For large data sets, simple random sampling (i. e., simply picking instances from the data set uniformly at random and without replacement) yields such a representative split in expectation. Stratified sampling (i. e., partitioning the data set, applying simple random sampling to each partition and merging the result) James et al. (2013) guarantees it.
In our example, note that our data set is already partitioned into different classes of networks (hyperlink, social, infrastructure, etc.), and that those classes are fairly balanced (see Table 2). We can thus select a certain fraction of instances in each network class as tuning set.
In case of computationally expensive tuning, it is advantageous to keep the tuning set small, for large data sets it can be much smaller than the evaluation set. In our example, we select a single instance per network class, as seen in Table 3. Note that these considerations are similar to the creation of training, test and validation sets in machine learning. While some other sampling methods are equally applicable, more sophisticated methods like statistical learning theory optimize for different objectives and are thus not considered here Hinton (2012); Larochelle et al. (2007); LeCun et al. (1998).
Class  Frequency 

Social  26.67 % 
Citation  13.33 % 
Peertopeer  13.33 % 
Road  13.33 % 
Hyperlink  10.00 % 
Infrastructure  6.67 % 
Coauthorship  6.67 % 
Authorship  6.67 % 
Intl. Relationship  3.33 % 
Tuning set  Evaluation set  

Network name  Network name  
opsahlpowergrid  4 941  6 594  moreno_blogs  1 224  16 718 
advogato  6 539  43 277  petsterhamster  2 426  16 631 
foldoc  13 356  91 471  egofacebook  2 888  19 257 
p2pGnutella24  26 518  65 369  openflights  3 425  2 981 
dblpcite  12 591  49 635  p2pGnutella08  6 301  20 777 
editfrwikinews  25 042  68 679  wikiVote  7 115  100 762 
dimacs9BAY  321 270  397 415  p2pGnutella05  8 846  31 839 
p2pGnutella04  10 876  39 994  
twin  14 274  20 573  
cfindergoogle  15 763  149 456  
caAstroPh  18 771  198 050  
cacitHepTh  22 908  2 444 798  
subelj_cora  23 166  89 157  
egotwitter  23 370  32 831  
egogplus  23 628  39 194  
editfrwikibooks  27 754  67 584  
cacitHepPh  28 093  3 148 447  
citHepPh  34 546  420 921  
locbrightkite_edges  58 228  214 078  
dimacs9COL  435 666  521 200  
roadNetPA  1 088 092  1 541 898  
roadNetTX  1 379 917  1 921 660 
4.5 Determining Your Competition
Competitor algorithms solve the same or a similar problem and are not dominated in the relevant metrics by other approaches in the literature. Since we focus on the design of the experimental pipeline, we consider only competitors that are deemed implementable. The best ones among them are considered the state of the art (SotA), here with respect to the criteria most relevant for our claims about KADABRA.
Other considerations are discussed next.
Unavailable Source Code
The source code of a competing algorithm may be unavailable and sometimes even the executable is not shared by its authors. If one can implement the competing algorithm with reasonable effort (to be weighted against the importance of the competitor), you should do so. If this is not possible, a viable option is to compare experimental results with published data. For a fair comparison, the experimental settings should then be replicated as closely as possible. In order to avoid this scenario, we recommend opensource code; it offers better transparency and replicability (also see 5.2).
Solving a Different Problem
In some cases, the problem our algorithm solves is different from established problems and there is no previous algorithm solving it. Here, it is often still meaningful to compare against the SotA for the established problem. For example, if we consider the dynamic case of a static problem for the first time Bergamini et al. (2015); Green et al. (2012); Kourtellis et al. (2014) or an approximation version for an exact problem Brandes and Pich (2007); Bader et al. (2007); Geisberger et al. (2008) or provide the first parallel version of an algorithm Bader and Madduri (2006). Another example can be an optimization problem with a completely new constraint. While this may change optimal solutions dramatically compared to a previous formulation, a comparison to the SotA for the previous formulation may still be the best assessment. If the problem is so novel that no comparison is meaningful, however, there is no reason for an experimental comparison.^{13}^{13}13Nonetheless, experiments can still reveal empirical insights into the behavior of the new algorithm.
Comparisons on Different Systems
Competitors may be designed to run on different systems. Recently, many algorithms take advantage of accelerators, mainly GPUs, and this trend has also affected the algorithmic network analysis community McLaughlin and Bader (2014); Sariyüce et al. (2013); Shi and Zhang (2011). A direct comparison between CPU and GPU implementations is not necessarily meaningful due to different architectural characteristics and memory sizes. Yet, such a comparison may be the only choice if no other GPU implementation exists. In this case, one should compare against a multithreaded CPU implementation that should be tuned for high performance as well. Indirect comparisons can, and should, be done using system independent measures (4.6).
Kadabra Example
The most relevant candidates to compare against KADABRA are the SotA algorithms RK Riondato and Kornaropoulos (2016) and ABRA Riondato and Upfal (2018). The same algorithms are chosen as competitors in the original KADABRA paper. However, in our paper we focus only on a single comparison, the one between KADABRA and RK. This is intended in order to highlight the main purpose of our work – to demonstrate the benefits of a thoughtful experimental design for meaningful comparisons of networkrelated algorithms. Reporting the actual results of such a comparison is of secondary importance. Furthermore, RK and ABRA exhibit similar behavior in the original KADABRA paper, with RK being overall slightly faster. Again, for the purpose of our paper, this is an adequate reason to choose RK over ABRA.^{14}^{14}14Of course, if we were to perform experiments for a new betweenness approximation algorithm, we would include all relevant solutions in the experiments.
4.6 Metrics
The most common metric for an algorithm’s performance is its wallclock running time. For solution quality, the metrics are usually problemspecific; often, the gap between an algorithm’s solution and the optimum is reported as a percentage (if known). In the following, we highlight situations that require more specific metrics. First, we discuss metrics for evaluating an algorithm’s running time.
4.6.1 Running Time
CPU Time vs. Wallclock Time
For sequential algorithms, evaluations should prefer CPU time over wallclock time. Indeed, this is the running time metric that we use in our KADABRA experiments. Compared to wallclock time, CPU time is less influenced by external factors such as OS scheduling or other processes running on the same machine. The exception to this rule are algorithms that depend on those external factors such as external memory algorithms (where disregarding the time spent in systemlevel I/O routines would be unfair). In the same line of reasoning, evaluations of parallel algorithms would be based on wallclock time as they depend on specifics of the OS scheduler.
Architecturespecific Metrics
CPU and wallclock times heavily depend on the microarchitecture of the CPU the algorithm is executed on. If comparability with data generated on similar CPUs from other vendors or microarchitectures is desired, other metrics should be taken into account. Such metrics can be accessed by utilizing the performance monitoring features of modern CPUs: these allow determining the number of executed instructions, the number of memory accesses (both are mostly independent of the CPU microarchitecture) or the number of cache misses (which depends on the cache size and cache strategy, but not necessarily the CPU model). If one is interested in how efficient algorithms are in exploiting the CPU architecture, the utilization of CPU components can be measured (example: the time spent on executing ALU instructions compared to the time that the ALU is stalled while waiting for results of previous instructions or memory fetches). Among some common opensource tools to access CPU performance counters are the perf profiler, Oprofile (on linux only) and CodeXL.
Systemindependent Metrics
It is desirable to compare different algorithms on the same system. However, in reality, this is not always possible, e. g., because systems required to run competitors are unavailable. For such cases, even if we do not run into such issues ourselves, we should also consider systemindependent metrics. For example, this can be the speedup of a newly proposed algorithm against some base algorithm that can be tested on all systems. As an example, for betweenness centrality, some papers reimplement the Brandes Brandes (2001) algorithm and compare their performance against it Riondato and Upfal (2018); Bergamini et al. (2015); Crescenzi et al. (2015). As this metric is independent of the hardware, it can even be used to compare implementations on different systems, e. g., CPU versus GPU implementations. Systemindependent metrics also include algorithmspecific metrics, like the number of iterations of an algorithm or the number of edges visited. Those are particularly useful when similar algorithms are compared, e. g., if the algorithms are iterative and only differ in their choice of the stopping condition.
Aggregation and Algorithmic Speedup
Running time measurements are generally affected by fluctuations, so that the same experiment is repeated multiple times. To arrive at one value per instance, one usually computes the arithmetic mean over the experiments (unless the data are highly skewed). A comparison between the running times (or other metrics) of two algorithms and on different data sets may result in drastically different values, e. g., because the instances differ in size or complexity. For a concise evaluation, one is here also interested in aggregate values. In such a case it is recommended to aggregate over ratios of these metrics; regarding running time, this would mean to compute the algorithmic speedup^{15}^{15}15The parallel speedup of an algorithm is instead the speedup of the parallel execution of against its sequential execution, more precisely the ratio of the running time of the fastest sequential algorithm and the running time of the parallel algorithm. It can be used to analyze how efficiently an algorithm has been parallelized. of with respect to .^{16}^{16}16To achieve a fair comparison of the algorithmic aspects of and , the algorithmic speedup is often computed over their sequential executions Hennessy and Patterson (2011). In view of today’s ubiquitous parallelism, this perspective may need to be reconsidered, though. To summarize multiple ratios, one can use the geometric mean Bixby (2002):
as it has the fundamental property that . Which mean is most appropriate for which aggregation is a matter of some discussion Mitchell (2004); Smith (1988); Fleming and Wallace (1986).
4.6.2 Solution Quality
Next, we discuss metrics for solution quality. Here, the correct measurements are naturally problemspecific; often, there is no single quality indicator but multiple orthogonal indicators are used.
Empirical vs. Worstcase Solution Quality
As mentioned in the introduction, worstcase guarantees proven in theoretical models are rarely approached in realworld settings. For example, the accuracy of the ABRA algorithm for betweenness approximation has been observed to be always below the fixed absolute error, even in experiments where this was only guaranteed for 90% of all instances Riondato and Upfal (2018). Thus, experimental comparisons should include also metrics for which theoretical guarantees are known.
Comparing Against (Unknown) Ground Truth
For many problems and instances beyond a certain size, ground truth (in the sense of the exact value of a centrality score or the true community structure of a network) is neither known nor feasible to compute. For betweenness centrality, however, AlGhamdi et al. AlGhamdi et al. (2017) have computed exact scores for large graphs by investing into considerable supercomputing time. The absence of ground truth or exact values, in turn, clearly requires the comparison to other algorithms in order to evaluate an algorithm’s solution quality.
5 Guidelines for the Experimental Pipeline
Organizing and running all the required experiments can be a timeconsuming activity, especially if not planned and carried out carefully. That is why we propose techniques and ideas to orchestrate this procedure efficiently. The experimental pipeline can be divided into four phases. In the first one, we finalize the algorithm’s implementation as well as the scripts/code for the experiments themselves.^{17}^{17}17It is important to use scripts or some external tool in order to automate the experimental pipeline. This also helps to reduce human errors and simplifies repeatability and replicability. Next, we submit the experiments for execution (even if the experiments are to be executed locally, we advise to use some scheduling/batch system). In the third phase, the experiments run and create the output files. Finally, we parse the output files to gather the information about the relevant metrics.
5.1 Implementation Aspects
Techniques for implementing algorithms are beyond the scope for this paper; however, we give an overview of tooling that should be used for developing algorithmic code.
Source code should always be stored in version control systems (VCS); nowadays, the most commonly used VCS is Git git (2005). For scientific experiments, a VCS should also be used to version scripts that drive experiments, valuable raw experimental data and evaluation scripts. Storing instance names and algorithm parameters of experiments in VCS is beneficial, e. g., when multiple iterations of experiments are done due to the AE cycle.
The usual principles for software testing (e. g., unit tests and assertions) should be applied to ensure that code behaves as expected. This is particularly important in growing projects where seemingly local changes can affect other project parts with which the developer is not very familiar. It is often advantageous to opensource code.^{18}^{18}18We acknowledge that opensourcing code is not always possible, e. g., due to intellectual property or political reasons. The Open Source Initiative keeps a list^{19}^{19}19https://opensource.org/licenses/alphabetical. of approved open source licenses. An important difference is whether they require users to publish derived products under the same license. If code is opensourced, we suggest wellknown platforms like Github GitHub (2007), Gitlab GitLab (2011), or Bitbucket Bitbucket (2008) to host it. An alternative is to use a VCS server within one’s own organization, which reduces the dependence on commercial interests. In an academic context, a better accessibility can have the benefit of a higher scientific impact of the algorithms. For longterm archival storage, in turn, institutional repositories may be necessary.
Naturally, code should be wellstructured and documented to encourage further scientific participation. Code documentation highly benefits from documentation generator tools such as Doxygen.^{20}^{20}20http://www.doxygen.nl. Profiling is usually used to find bottlenecks and optimize implementations, e. g., using tools such as the perf profiler on Linux, Valgrind Nethercote and Seward (2007) or a commercial profiler such as VTune Reinders (2005).
5.2 Repeatability, Replicability and Reproducibility
Terminology differs between venues; the Association of Computing Machinery defines repeatability as obtaining the same results when the same team repeats the experiments, replicability for a different team but the same programs and reproducibility for the case of a reimplementation by a different team. Our recommendations are mostly concerned with replicability.
In a perfect world scenario, the behavior of experiments is completely determined by their code version, command line arguments and configuration files. From that point of view, the ideal case for replicability, which is increasingly demanded by conferences and journals^{21}^{21}21For example, see the Replicated Computational Results Initiative of the Journal on Experimental Algorithms, http://jea.acm.org. in experimental algorithms, looks like this: A single executable program automatically downloads or generates the input files, compiles the programs, runs the experiments and recreates the plots and figures shown in the paper from the results.
Unfortunately, in reality some programs are nondeterministic and give different outputs for the same settings. If randomization is used, this problem is usually avoided by fixing an initial seed of a pseudorandom number generator. This seed is just another argument to the program and can be handled like all others. However, parallel programs might still cause genuine nondeterminism in the output, e. g., if the computation depends on the order in which a large search space is explored Hamadi and Wintersteiger (2012); Kimmig et al. (2017) or on the order in which messages from other processors arrive.^{22}^{22}22As an example, some associative calculations are not associative when implemented with floating point numbers Goldberg (1991). In such a case, the order of several, say, additions, matters. If these effects are of a magnitude that they affect the final result, these experiments need to be repeated sufficiently often to cover the distribution of outputs. A replication would then aim at showing that its achieved results are, while not identical, equivalent in practice. For a formal way to show such practical equivalence, see Section 7.4.1.
Implementations often depend on libraries or certain compiler versions. This might lead to complications in later replications when dependencies are no longer available or incompatible with modern systems. Providing a virtual machine image or container addresses this problem.
5.3 Running Experiments
Running experiments means to take care of many details: Instances need to be generated or downloaded, jobs need to be submitted to batch systems^{23}^{23}23Note that the exact submission mechanism is beyond the scope of this paper, as it heavily depends on the batch system in question. Nevertheless, our guidelines and tooling suggestions can easily be adapted to all common batch system, such as Slurm (https://slurm.schedmd.com/) or PBS (https://www.pbspro.org/). or executed locally, running jobs need to be monitored, crashes need to be detected, crashing jobs need to be restarted without restarting all experiments, etc. To avoid human errors, improve reproducibilty and accelerate those tasks, scripts and tooling should be employed.
To help with these recurring tasks, we provide as a supplement to this paper SimexPal, a commandline tool to automate the aforementioned tasks (among others).^{24}^{24}24SimexPal can be found at https://github.com/humacsy/simexpal. This tool allows the user to manage instances and experiments, launch jobs and monitor the status of those jobs. While our tool is not the only possible way to automate these tasks, we do hope that it improves over the state of writing custom scripts for each individual experiment. SimexPal is configured using a simple YAML YAML (2002) file and only requires a minimal amount of usersupplied code. To illustrate this concept, we give an example configuration in Figure 1. Here, run is the only piece of usersupplied code that needs to be written. run executes the algorithm and prints the output (e.g. running times) to stdout. Given such a configuration file, the graph instances can be downloaded using simex instances download. After that is done, jobs can be started using the command simex experiments launch. SimexPal takes care of not launching experiments twice and never overwrites existing output files. simex experiments list monitors the progress of all jobs. If a job crashes, simex experiments purge can be used to remove the corresponding output files. The next launch command will rerun that particular job.
5.4 Structuring Output Files
Output files typically store three kinds of data: (i) experimental results, e. g., running times and measures of solution quality, (ii) metadata that completely specify the parameters and the environment of the run, so that the run can be replicated (see Section 5.2), and (iii) supplementary data, e. g., the solution to the input problem that can be used to understand the algorithm’s behavior and to verify its correctness. Care must be taken to ensure that output files are suitable for longterm archival storage (which is mandated by good scientific practices (2010) (DFG)). Furthermore, carefully designing output files helps to accelerate experiments by avoiding unnecessary runs that did not produce all relevant information (e. g., if the focus of experiments changes after exploration).
Choosing which experimental results to output is problemspecific but usually straightforward. For metadata, we recommend to include enough information to completely specify the executed algorithm version, its parameters and input instance, as well as the computing environment. This usually involves the following data: The VCS commit hash of the implementation and compiler(s) as well as all libraries the implementation depends on^{25}^{25}25Experiments should never run uncommitted code. If there are any uncommitted changes, we suggest to print a comment to the output file to ensure that the experimental data in question does enter a paper. name (or path) of the input instance, values of parameters of the algorithm (including random seeds^{26}^{26}26If the implementation’s behavior can be controlled by a large number of parameters, it makes sense to print command line arguments as well as relevant environment variables and (excerpts from) configuration files to the output file.), host name of the machine and current date and time.^{27}^{27}27Date and time help to identify the context of the experiments based on archived output data. Implementations that depend on hardware details (e. g., parallel algorithms or externalmemory algorithms) want to log CPU, GPU and/or memory configurations, as well as versions of relevant libraries and drivers.
The relevance of different kinds of supplementary data is highly problemdependent. Examples include (partial) solutions to the input problem, numbers of iterations that an algorithm performs, snapshots of the algorithm’s state at key points during the execution or measurements of the time spent in different parts of the algorithm. Such supplementary data is useful to better understand an algorithm’s behavior, to verify the correctness of the implementation or to increase confidence in the experimental results (i. e., that the running times or solution qualities reported in a paper are actually correct). If solutions are stored, automated correctness checks can be used to find and debug problems or to demonstrate that no such problems exist.
The output format itself should be chosen to be both human readable and machine parsable. Human readability is particularly important for longterm archival storage, as parsers for proprietary binary formats (and the knowledge of how to use them) can be lost over time. Machine parsability enables automated extraction of experimental results; this is preferable over manual extraction, which is inefficient and errorprone. Thus, we recommend structured data formats like YAML (or JSON JSON (2001)). Those formats can be easily understood by humans; furthermore, there is a large variety of libraries to process them in any commonly used programming language. If plain text files are used, we suggest to ensure that experimental results can be extracted by simple regular expressions or similar tools.
Kadabra Example
Let us apply these guidelines to our example of the KADABRA algorithm using SimexPal with the YAML file format. For each instance, we report KADABRA’s running time, the values of the parameters and , the random seed that was used for the run, the number of samples that the run required and the top25 nodes of the resulting betweenness centrality ranking and their betweenness scores. The number 25 here is chosen arbitrarily, as a good balance between the amount of information that we store to verify the plausibility of the results and the amount of space consumed by the output. To fully identify the benchmark setting, we also report the hostname, our git commit hash, our random generator seed and the current date and time. Figure 2 gives an example how the resulting output file looks like.
5.5 Gathering Results
When the experiments are done, one has to verify that all runs were successful. Then, the output data has to be parsed to extract the desired experimental data for later evaluation. Again, we recommend the use of tools for parsing. In particular, SimexPal offers a Python package to collect output data. Figure 3 depicts a complete script that computes average running times for different algorithms on the same set of instances, using only seven lines of Python code. Note that SimexPal takes care of reading the output files and checking that all runs indeed succeeded (using the function collect_successful_results()).
In our example, we assume that the output files are formatted as YAML (thus we use the function yaml.load() to parse them). In case a custom format is used (e. g., when reading output from a competitor where the output format cannot be controlled), the user has to supply a small function to parse the output. Fortunately, this can usually be done using regular expressions (e. g., with Python’s regex module).
Now would also a good time to aggregate data appropriately (unless this has been taken care of before, also see Section 4.6).
6 Visualizing Results
After the experiments have finished, the recorded data and results need to be explored and analyzed before they can finally be reported. Data visualization in the form of various different plots is a helpful tool both in the exploration phase and also in the final communication of the experimental results and the formal statistical analysis. The amount of data collected during the experiments is typically too large to report in its entirety and hence meaningful and faithful data aggregations are needed.^{28}^{28}28For future reference and repeatability, it may make sense to include relevant raw data in tables in the appendix. But raw data tables are rarely good for illustrating trends in the data. While descriptive summary statistics such as means, medians, quartiles, variances, standard deviations, or correlation coefficients, as well as results from statistical testing like pvalues or confidence intervals provide a wellfounded summary of the data, they do so, by design, only at a very coarse level. The same statistics can even originate from very different underlying data: a famous example is Anscombe’s quartet Anscombe (1973), a collection of four sets of eleven points in with very different point distributions, yet (almost) the same summary statistics such as mean, variance, or correlation.^{29}^{29}29For example plots of the four point sets see https://commons.wikimedia.org/w/index.php?curid=9838454. It is a striking example of how important it can be not to rely on summary statistics alone, but to visualize experimental data graphically. A more recent example is the datasaurus.^{30}^{30}30https://www.autodeskresearch.com/publications/samestats.
Here we discuss a selection of plot types for data visualization with focus on algorithm engineering experiments, together with guidelines when to use which type of plot depending on the properties of the data. For a more comprehensive introduction to data visualization, we refer the reader to some of the indepth literature on the topic Healy (2018); Sanders (2002); Tufte (2001). Furthermore, there are many powerful libraries and tools for generating data plots, e. g., R^{31}^{31}31https://www.rproject.org. and the ggplot2 package^{32}^{32}32https://ggplot2.tidyverse.org., gnuplot^{33}^{33}33http://www.gnuplot.info., matplotlib.^{34}^{34}34https://matplotlib.org. Also mathematical software such as MATLAB^{35}^{35}35https://www.mathworks.com/products/matlab.html. (or even spreadsheet tools) can generate various types of plots from your data. For more details about creating plots in one of these tools, we refer to the respective user manuals and various available tutorials.
When presenting data in twodimensional plots, marks are the basic graphical elements or geometric primitives to show data. They can be points (zerodimensional), lines (1D), or areas (2D). Each mark can be further refined and enriched by visual variables or channels such as their position, size, shape, color, orientation, texture, etc. to encode different aspects of the data. The most important difference between those channels is that some are more suitable to represent categorical data (e. g., graph properties, algorithms, data sources) by assigning different shapes or colors to them, whereas others are well suited for showing quantitative and ordered data (e. g., input sizes, timings, quantitative quality measures) by mapping them to a spatial position, size, or lightness. For instance, in Figure 6 we use blue circles as marks for one of the algorithms, while for the other we use orange crosses. Using different shapes makes sure that the plots are still readable if printed in greyscale. Not all channels are equally expressive and effective in encoding information for the human visual system, so that one has to carefully select which aspects of the data to map to which channels, possibly also using redundancy. For more details see the textbooks Munzner (2014) and Ware (2012).
As discussed in Section 4.6, the types of metrics from algorithmic experiments comprise two main aspects: running time data and solution quality data. Both types of metrics can consist of absolute or relative values. Typically, further attributes and parameters of the experimental data, of the algorithms, and of the input instances are relevant to include in a visualization to communicate the experimental findings. These can range from hardwarespecific parameters, over the set of algorithms and possible algorithmspecific parameters, to instancedependent parameters such as certain graph properties. Depending on the experiment’s focus, one needs to decide on the parameters to show in a plot and on how to map them to marks and channels. Typically, the most important metric to answer the guiding research question (e. g., running time or solution quality) is plotted along the yaxis. The xaxis, in turn, is used for the most relevant parameter describing the instances (e. g., instance size for a scalability evaluation or a graph parameter for evaluating its influence on the solution quality). Additional parameters of interest can then be represented by using distinctly colored or shaped marks for different experimental conditions such as the respective algorithm, an algorithmic parameter or properties of the used hardware, see for example that, in Figures 6 instance roadNetTX is plotted differently because RK did not finish within the allocated time frame of 7 hours.
Before starting to plot the data, one needs to decide whether raw absolute data should be visualized (e. g., running times or objective function values) or whether the data should be shown relative to some baseline value, or be normalized prior to plotting. This decision typically depends on the specific algorithmic problem, the experimental setup and the underlying research questions to be answered. For instance, when the experiment is about a new algorithm for a particular problem, the algorithmic speedup or possible improvement in solution quality with respect to an earlier algorithm may be of interest; hence, running time ratios or quality ratios can be computed as a variable to plot – as shown in Figure 5 with respect to KADABRA and RK. Another possibility of data preprocessing is to normalize certain aspects of the experimental data before creating a plot. For example, to understand effects caused by properties of the hardware, such as cache sizes and other memory effects, one may normalize running time measurements by the algorithm’s time complexity in terms of and , the number of vertices and edges of the input graph, and examine if the resulting computation times are constant or not. A wide range of meaningful data processing and analysis can be done before showing the resulting data in a visualization. While we just gave a few examples, the actual decision of what to show in a plot needs to be made by the designers of the experiment after carefully exploring all relevant aspects of the data from various perspectives. For the remainder of this section, we assume that all data values to be visualized have been selected.
A very fundamental plot is the scatter plot, which maps two variables of the data (e. g., size and running time) onto the x and yaxis, see Figure 6. Every instance of the experiment produces its own point mark in the plot, by using its values of the two chosen variables as coordinates. Further variables of the data can be mapped to the remaining channels such as color, symbol shape, symbol size, or texture. If the number of instances is not too large, a scatter plot can give an accurate visualization of the characteristics and trends of the data. However, for large numbers of instances, overplotting can quickly lead to scatter plots that become hard to read.
In such cases, the data needs to be aggregated before plotting, by grouping similar instances and showing summaries instead. The simplest way to show aggregated data, such as repeated runs of the same instances, is to plot a single point mark for each group, e. g., using the mean or median, and then optionally putting a vertical error bar on top of it showing one standard deviation or a particular confidence interval of the variable within each group. Such a plot can be well suited for showing how running times scale with the instance size. If the sample sizes in the experiment have been chosen to cover the domain well, one may amplify the salience of the trend in the data by linking the point marks by a line plot. However, this visually implies a linear interpolation between neighboring measurements and therefore should only be done if sufficiently many sample points are included and the plot is not misleading. Obviously, if categorical data are represented on the xaxis, one should never connect the point marks by a line plot.
At the same time, the scale of the two coordinate axes is also of interest. While a linear scale is the most natural and least confusing for human interpretation, some data sets contain values that may grow exponentially. On a linear scale, this results in the large values dominating the entire plot and the small values disappear in a very narrow band. In such cases, axes with a logarithmic scale can be used; however, this should always be made explicit in the axis labeling and the caption of the plot.
A more advanced summary plot is the box plot (or boxandwhiskers plot), where all repeated runs of the same instance or sets of instances with equal or similar size form a group, see Figure 6(a). This group is represented as a single glyph showing simultaneously the median, the quartiles, the minimum and maximum values or another percentile, as well as possibly outliers as individual marks. If the xaxis shows an ordered variable such as instance size, one can still clearly see the scalability trend in the data as well as the variability within each homogeneous group of instances.
Violin plots take the idea of box plots even further and draw the contour of the density distribution of the variable mapped to the yaxis within each group, see Figure 6(b). It is thus visually more informative than a simple box plot, but also more complex and thus possibly more difficult to read. When deciding for a particular type of plot, one has to explore the properties of the data and choose a plot type that is neither oversimplifying them nor more complex than needed.
Data from algorithmic experiments may also often be aggregated by some attribute into groups of different sizes. In order to show the distribution of the instances into these groups, bar charts (for categorical data) or histograms (for continuous data) can be used to visualize the cardinality of each group. Such diagrams are often available in public repositories like KONECT.^{36}^{36}36For example, one can find this information for moreno_blogs in http://konect.unikoblenz.de/networks/moreno_blogs. For a complex network, for example, one may want to plot the degree distribution of its nodes with the degree (or bins with a specific degree range) on the xaxis and the number of vertices with a particular degree (or in a particular degree range) on the yaxis. Such a histogram can then quickly reveal how skewed the degree distribution is. Similarly, histograms can be useful to show solution quality ratios obtained by one or more algorithms by defining bins based on selected ranges of quality ratios. Such a plot quickly indicates to the reader what percentage of instances could be solved within a required quality range, e. g., at least of the optimum. of the algorithm(s) and the axis displays the quality. This allows
A single plot can contain multiple experimental conditions simultaneously. For instance, when showing running time behavior and scalability of a new algorithm compared to previous approaches, a single plot with multiple trend lines in different colors or textures or with visually distinct mark shapes can be very useful to make comparisons among the competing algorithms. Clearly, a legend needs to specify the mapping between the data and the marks and channels in the plot. Here it is strongly advisable to use the same mapping if multiple plots are used that all belong together. But, as a final remark, bear in mind the size and resolution of the created figures. Avoid clutter and ensure that your conclusion remains clearly visible!
7 Evaluating Results with Statistical Analysis
Even if a result looks obvious^{37}^{37}37It fulfills the “interocular trauma test”, as the saying goes, the evidence hitting you between the eyes., it can benefit from a statistical analysis to quantify it, especially if random components or heuristics are involved. The most common questions for a statistical analysis are:

Do the experimental results support a given hypothesis, or is the measured difference possibly just random noise? This question is addressed by hypothesis testing.

How large is a measured effect, i. e., which underlying real differences are plausible given the experimental measurements? This calls for parameter estimation.

If we want to answer the first two questions satisfactorily, how many data points do we need? This is commonly called power analysis. As this issue affects the planning of experiments, we discussed it already in Section 4.3.
The two types of hypotheses we discussed in Section 4.1 roughly relate to hypothesis tests and parameter estimation. Papers proposing new algorithms mostly have hypotheses of the first type: The newly proposed algorithm is faster, yields solutions with better quality or is otherwise advantageous.
In many empirical sciences, the predominant notion has been null hypothesis significance testing (NHST), in which a statistic of the measured data is compared with the distribution of this statistic under the null hypothesis, an assumed scenario where all measured differences are due to chance. Previous works on statistical analysis of experimental algorithms, including the excellent overviews of McGeoch McGeoch (2012) and Coffin and Saltzmann Coffin and Saltzman (2000), use this paradigm.
Due to some limitations of the NHST model, a shift towards parameter estimation Anderson et al. (2000); Trafimow and Marks (2015); Wasserstein and Lazar (2016); Lash (2017) and also Bayesian methods Cumming (2014); Kruschke and Liddell (2018); Murtaugh (2014) is taking place in the statistical community.
We aim at applying the current state of the art in statistical analysis to algorithm engineering, but since no firm consensus has been reached Wasserstein and Lazar (2016), we discuss both frequentist and Bayesian approaches. As an example for null hypothesis testing, we investigate whether the KADABRA and the RK algorithms give equivalent results for the same input graphs. While this equivalence test could also be performed using Bayesian inference, we use a null hypothesis test for illustration purposes.
It is easy to see from running time plots (e. g., Figure 6) that KADABRA is faster than RK. To quantify this speedup, we use Bayesian methods to infer plausible values for the algorithmic speedup and different scaling behavior. We further evaluate the influence of the graph diameter on the running time.
7.1 Statistical Model
A statistical model defines a family of probability distribution over experimental measurements. Many experimental running times have some degree of randomness: Caching effects, network traffic and influence of other processes contribute to this, sometimes the tested algorithm is randomized itself. Even a deterministic implementation on deterministic hardware can only be tested on a finite set of input data, representing a random draw from the infinite set of possible inputs. A model encodes our assumptions about how these sources of randomness combine to yield the distribution of outputs we see. If, for example, the measurement error consists of many additive and independent parts, the central limit theorem justifies a normal distribution.
To enable useful inferences, the model should have at least one free parameter corresponding to a quantity of interest. Any model is necessarily an abstraction, as expressed in the aphorism “all models are wrong, but some are useful” Box (1976).
7.1.1 Example
Suppose we want to investigate whether KADABRA scales better than RK on inputs of increasing size. Figure 6 shows running times of KADABRA and RK with respect to the instance size. These are spread out, implying either that the running time is highly variable, or that it depends on aspects other than the instance size.
A companion Jupyter notebook including this example and the following inferences is included in the supplementary materials.
In general, running times are modeled as functions of the input size, sometimes with additional properties of the input. The running time of KADABRA, for example, possibly depends on the diameter of the input graph. Algorithms in network analysis often have polynomial running times where a reasonable upper bound for the leading exponent can be found. Thus, the running time can be modeled as such a polynomial, written as , with the unknown coefficients of the polynomial being the free parameters of the model.
However, a large number of free model parameters makes inference difficult. This includes the danger of overfitting, i. e., inferring parameter values that precisely fit the measured results but are unlikely to generalize.
To evaluate the scaling behavior, it is thus often more useful to focus on the largest exponent instead. Let be the running time of the implementation of RK and the running time of the implementation of KADABRA on inputs of size , with unknown parameters and .^{38}^{38}38For estimating asymptotic upper bounds, see the work of McGeoch et al. McGeoch et al. (2000) on curve bounding.
The term explicitly models the error; it can be due to variability in inputs (some instances might be harder to process than others of the same size) and measurement noise (some runs suffer from interference or caching effects). Since harder inputs are not constrained to additive difficulty and longer runs have more opportunity to experience adverse hardware effects, we choose a multiplicative error term.^{39}^{39}39Summands with smaller exponents are also subsumed within the error term. If they have a large effect, an additive error might reflect this more accurately. Taking the logarithms of both sides makes the equations linear:
(4)  
(5) 
A commonly chosen distribution for additive errors is Gaussian, justified by the central limit theorem Pólya (1920). Since longer runs have more exposure to possibly adverse hardware or network effects we consider multiplicative error terms to be more likely and use a lognormal distribution.^{40}^{40}40In some cases, it might even make sense to use a hierarchical model with two different error terms: One for the input instances, the other one for the differences on the same input. Since the logarithm of the lognormal distribution is a normal (Gaussian) distribution, Equations 4 and 5 can be rewritten as normally distributed random variables:
(6)  
(7) 
This form shows more intuitively the idea that a statistical model is a set of probability measures on experimental outcomes. In this example, the set of probability measures modeling the performance of an algorithm are parametrized by the tuple .
A problem remains if the input instances are very inhomogeneous. In that case, both and have a large estimated variance (). Even if the variability of performance on the same instance is small, any genuine performance difference might be wrongly attributed to the large interinstance variance. This issue can be addressed with a combined model as recommended by Coffin and Saltzmann Coffin and Saltzman (2000), in which the running time of is a function of the running time of on the same instance :
(8) 
For a more extensive overview of modeling experimental algorithms, see McGeoch (2012).
7.1.2 Model Fit
Several methods exist to infer parameters when fitting a statistical model to experimental data. The most wellknown is arguably the maximumlikelihood fit, choosing parameters for a distribution that give the highest probability for the observed measurements. In the case of a linear model with a normally distributed error, this is equivalent to a leastsquares fit Charnes et al. (1976). Such a fit yields a single estimate for plausible parameter values.
7.2 Formalizing Hypotheses
Previously (Section 4.1), we discussed two types of hypotheses. The first type is that a new algorithm is better in some aspect than the state of the art. The second type claims insight into how the behavior of an algorithm depends on settings and properties of the input. We now express these same hypotheses more formally, as statements about the parameters in a statistical model. The two types are then related to the statistical approaches of hypothesis testing and parameter estimation.
Considering the scaling model presented in Equation (8) and the question whether implementation A scales better than implementation B, the parameter in question is the exponent . The hypothesis that A scales better than B is equivalent to ; both scaling the same implies . Note that the first hypothesis does not imply a fully specified probability distribution on , merely restricting it to the negative halfplane. The hypothesis of does completely specify such a distribution (i. e., a point mass of probability 1 at 0), which is useful for later statistical inference.
7.3 Frequentist Methods
Frequentist statistics defines the probability of an experimental outcome as the limit of its relative frequency when the number of experiments trends towards infinity. This is usually denoted as the classical approach.
7.3.1 Null Hypothesis Significance Testing
As discussed above, null hypothesis significance testing evaluates a proposed hypothesis by contrasting it with a null hypothesis, which states that no true difference exists and the observed difference is due to chance. As an example application, we compare the approximation quality of KADABRA and RK. From theory, we would expect higher approximation errors from KADABRA, since it samples fewer paths. We investigate whether the measured empirical difference supports this theory, or could also be explained with random measurement noise, i. e., with the null hypothesis (denoted with ) of both algorithms having the same distribution of errors and just measuring higher errors from KADABRA by coincidence. Here it is an advantage that the proposed alternate hypothesis (i. e., the distributions are meaningfully different) does not need an explicit modeling of the output distribution, as the distribution of differences in our case does not follow an easily parameterizable distribution.
When deciding whether to reject a null hypothesis (and by implication, support the alternate hypothesis), it is possible to make one of two errors: (i) rejecting a null hypothesis, even though it is true (false positive), (ii) failing to reject a false null hypothesis (false negative). In such probabilistic decisions, the error rate deemed acceptable often depends on the associated costs for each type of error. For scientific research, Fisher suggested that a false positive rate of 5% is acceptable Fisher (1992), and most fields follow that suggestion. This threshold is commonly denoted as .
Controlling for the first kind of error, the value is defined as the probability that a summary statistic as extreme as the observed one would have occurred given the null hypothesis Young et al. (2005). Please note that this is not the probability , i. e., the probability that the null hypothesis is true given the observations.
For practical purposes, a wide range of statistical hypothesis tests have been developed, which aggregate measurements to a summary statistic and often require certain conditions. For an overview of which of them are applicable in which situation, see the excellent textbook of Young and Smith Young et al. (2005). In our example the paired results are of very different instances and clearly not normally distributed. We thus avoid the common ttest and use a Wilcoxon test of pairs Wilcoxon (1945) from the SciPy Jones et al. (2001–) stats module, yielding a value of , see cell 14 of the statistics notebook in the supplementary materials. Since this is smaller than our threshold of 0.05, one would thus say that this result allows us to reject the null hypothesis at the level of . Such a difference is commonly called statistically significant. To decide whether it is actually significant in practice, we look at the magnitude of the difference: The error of KADABRA is about one order of magnitude higher for most instances, which we would call significant.
Multiple Comparisons
The NHST approach guarantees that of all false hypotheses, the expected fraction that seem significant when tested is at most . Often though, a publication tests more than one hypotheses. For methods to address this problem and adjust the probability that any null hypothesis in an experiment is falsely rejected (also called the familywise error rate), see Bonferroni Dunn (1961) and Holm Holm (1979).
7.3.2 Confidence Intervals
One of the main criticism of NHST is that it ignores effect sizes; the magnitude of the value says nothing about the magnitude of the effect. More formally, for every true effect with size and every significance level , there exists an so that all experiments containing measurements are likely to reject the null hypothesis at level . Following this, Coffin and Saltzmann Coffin and Saltzman (2000) caution against overly large data sets  a recommendation which comes with its own set of problems.
The statistical response to the problem of small but meaningless values with large data sets is a shift away from hypothesis testing to parameter estimation. Instead of asking whether the difference between populations is over a threshold, the difference is quantified as a parameter in a statistical model, see also Section 7.2. As Kruschke et al. Kruschke and Liddell (2018) put it, the null hypothesis test asks whether the null value of a parameter would be rejected at a given significance level. A confidence interval merely asks which other parameter values would not be rejected Kruschke and Liddell (2018). We refer to Smithson (2003); Neyman (1937) for a formal definition and usage guidelines.
7.4 Bayesian Inference
Bayesian statistics defines the probability of an experimental outcome as the uncertainty of knowledge about it. Bayes’s theorem gives a formal way to update probabilities on new observations. In its simplest form for discrete events and hypotheses, it can be given as:
(9) 
When observing outcome , the probability of hypothesis is proportional to the probability of conditioned on multiplied by the prior probability . The conditional probability of an outcome given an hypothesis is also called the likelihood of . The prior probability reflects the estimation before making observations, based on background knowledge.
Extended to continuous distributions, Bayes’s rule allows to combine a statistical model with a set of measurements and a prior probability distribution over parameters to yield a posterior probability distribution over parameters. This posterior distribution reflects both the uncertainty in measurements and possible prior knowledge about parameters. A thorough treatment is given by Gelman et al. Gelman et al. (2013).
For our example model introduced in Section 7.1, we model the running times of implementation B as a function of the time of implementation A, as done in Equation (8):
This defines the likelihood function as a Gaussian noise with variance . Since this variance is unknown, we keep it as a model parameter. As we have no specific prior information about plausible values of and , we define the following vague prior distributions:
The first two distributions represent our initial – conservative – belief that the two implementations are equivalent in terms of scaling behavior and constants. We model the variance of the observation noise as an inverse gamma distribution instead of a normal distribution, since a variance cannot be negative.
Figure 8 shows a listing to compute and show the posterior distribution of these three parameters using SciPy Jones et al. (2001–) and PyMC3 Salvatier et al. (2016). {verbbox}[] import pymc3 as pm from scipy import optimize
basic_model = pm.Model()
with basic_model: alpha = pm.Normal(’alpha’, mu=0, sd=10) beta = pm.Normal(’beta’, mu=0, sd=10) sigma = pm.InverseGamma(’sigma’, alpha=1,beta=1)
mu = alpha + beta*logTimeRK
Y_obs = pm.Normal(’Y_obs’, mu=mu, sd=sigma, observed=logTimeKadabra)
with basic_model: # approximate posterior distribution with 10000 samples trace = pm.sample(10000)
pm.summary(trace)
HPD 2.5  Mean  HPD 97.5  

6.87  5.22  3.58  
0.70  1.01  1.29  
0.80  1.13  1.54 
Results are listed in Table 4. The interval of Highest Probability Density (HPD) is constructed to contain 95% of the probability mass of the respective posterior distribution. It is the Bayesian equivalent of the confidence interval (Section 7.3.2) and also called credible interval. The most probable values for and are 5.22 and 1.01, respectively. Taking measurement uncertainty into account, the true values are within the intervals respective with 95% probability. This shows that KADABRA is faster on average, but results about the relative scaling behavior are inconclusive. While the average for is 1.01 and suggests similar scaling, the interval neither excludes the hypothesis that KADABRA scales better, nor the hypothesis that it scales worse.
7.4.1 Equivalence Testing
Computing the highest density interval can also be used for hypothesis testing. In Section 7.3.1 we discussed how to show that two distributions are different. Sometimes, though, we are interested in showing that they are sufficiently similar. An example would be wanting to show that two sampling algorithms give the same distribution of results. This is not easily possible within the NHST paradigm, as the two answers of a classical hypothesis test are “probably different” and “not sure”.
This problem can be solved by calculating the posterior distribution of the parameter of interest and defining a region of practical equivalence (ROPE), which covers all parameter values that are effectively indistinguishable from 0. If of the posterior probability mass are in the region of practical equivalence, the inferred parameter is practically indistinguishable with probability . If of the probability mass are outside the ROPE, the parameter is meaningfully different with probability . If the intervals overlap, the observed data is insufficient to come to either conclusion. In our example, the scaling behavior of two algorithms is equivalent if the inferred exponent modeling their relative running times is more or less 1. We could define practical equivalence as , resulting in a region of practical equivalence of . The interval containing of the probability mass for is neither completely inside nor completely outside it, implying that more experiments are needed to come to a conclusion.
7.4.2 Bayes Factor
Bayes factors are a way to compare the relative fit of several models and hypotheses to a given set of observations. While NHST (Section 7.3.1) evaluates the fit of observations to the null hypothesis and confidence intervals and credible intervals infer the range of plausible values of parameters in a model, the Bayes factor between two hypotheses gives the ratio between their posterior probabilities. This probability ratio of hypotheses is then:
Crucially, the ratio of prior probabilities, which is subjective, is a separate factor from the ratio of likelihoods, which is objective. This objective part, the ratio , is called the Bayes factor.
The first obvious difference to NHST is that calculating a Bayes Factor consists of comparing the fit of both hypotheses to the data, not only the null hypothesis. It thus requires that an alternate hypothesis is stated explicitly, including a probability distribution over observable outcomes. If the alternative hypothesis is meant to be vague, for example just that two distributions are different, an uninformative prior with a high variance should be used. However, specific hypotheses like “the new algorithm is at least 20% faster” can also be modeled explicitly.
This explicit modeling allows inference in both directions; using NHST, on the other hand, one can only ever infer that a null hypothesis is unlikely or that the data is insufficient to infer this. Using Bayes factors, it is possible to infer that is more probable than , or that the observations are insufficient to support this statement, or that is more probable than .
In the previous running time analysis, we hypothesized a better scaling behavior, which was not confirmed by the experimental measurements. However, the graphs with high diameter are larger than average and a cursory complexity analysis of the KADABRA algorithm suggests that the diameter has an influence on the running time. Might the relative scaling of KADABRA and RK depend on the running time?
To answer this question, we compare the fit of two models: The first model is the same as discussed earlier (Equation 8), it models the expected running time of KADABRA on instance as , where is the running time of RK on the same instance. The second model has the additional free parameter , controlling the interaction between the diameter and running times:
(10) 
Comparing for example the errors of a leastsquares fit of the two models would not give much insight, since including an additional free parameter in a model almost always results in a better fit. This does not have to mean that the new parameter captures something interesting.
Instead, we calculate the Bayes factor, for which we integrate over the prior distribution in each model. Since this integral over the prior distribution also includes values of the new parameter which are not a good fit, models with too many additional parameters are automatically penalized. Our two models are similar, thus we can phrase them as a hierarchical model with the additional parameter controlled by a boolean random variable, see Listing 9.
[] import pymc3 as pm from scipy import optimize
basic_model = pm.Model()
with basic_model:
pi = (0.5, 0.5)
selected_model = pm.Bernoulli(’selected_model’, p=pi[1])
alpha = pm.Normal(’alpha’, mu=0, sd=10) beta = pm.Normal(’beta’, mu=0, sd=10) gamma = pm.Normal(’gamma’, mu=0, sd=10) sigma = pm.InverseGamma(’sigma’, alpha=1, beta=1)
mu = alpha + beta*logTimesRK + gamma*logDiameters*selected_model
Y_obs = pm.Normal(’Y_obs’, mu=mu, sd=sigma, observed=logTimeKadabra)
The posterior for the indicator variable selected_model is , yielding a Bayes factor of in support of including the diameter in the model. We can thus conclude that it is very probable that the diameter has an influence on the relative scaling between KADABRA and RK. The inferred mean of the variable is 0.79, meaning that higher diameter values lead to higher running times.
7.5 Recommendations
Which statistical method is best, depends on what needs to be shown. For almost all objectives, both Bayesian and frequentist methods exist, see Table 5. In experimental algorithmics, most hypotheses can be expressed as statements constraining parameters in a statistical model, i. e., “the average speedup of A over B is at least 20%”. Thus, in contrast to earlier statistical practice, we recommend to approach evaluation of hypotheses by parameter estimation and to only use the classical hypothesis tests when parameter estimation is not possible. The additional information gained by parameter estimates has a couple of advantages. For example, when using only hypothesis tests, small differences in large data sets can yield impressively small values and thus claims of statistical significance even when the difference is irrelevant in practice and the significance is statistical only Coffin and Saltzman (2000). Using confidence intervals (Section 7.3.2) or the posterior distribution in addition with a region of practical equivalence avoids this problem Kruschke and Liddell (2018).
Frequentist  Bayesian  

Estimation  Confidence Interval  Posterior  
Hypothesis  Equivalence  TOST  Posterior + ROPE or BF 
Difference  NHST  
Model Selection  AIC  Bayes Factor (BF) 
Below is a rough guideline (also shown in Figure 10) outlining our method selection process. It favors Bayesian methods, since the python library PyMC3 Salvatier et al. (2016) offering them fits well into our workflow.

Define a model that captures the parts of the measured results that interest you, see Section 7.1.

Using confidence intervals (Section 7.3.2) or credible intervals (Section 7.4), estimate plausible values for the model parameters, including their uncertainty. If this proves intractable and you are only interested in whether a measured difference is due to chance, use a significance test instead (Section 7.3.1).

If you want to show that two distributions (of outcomes of algorithms) are similar, use an equivalence test, in which you define a region of practical equivalence.

If you want to show that two distributions (of outcomes of algorithms) are different, you may also use an equivalence test or alternatively, a significance test.

If you want to compare how well different hypotheses explain the data, for example compare whether the diameter has an influence on relative scaling, compare the relative fit using a Bayes factor (Section 7.4.2).
Needless to say, these are only recommendations.
8 Conclusions
Besides setting guidelines for experimental algorithmics, this paper provides a tool for simplifying the typical experimental workflow. We hope that both are useful for the envisioned target group – and beyond, of course. We may have omitted material some consider important. This happened on purpose to keep the exposition reasonably concise. To account for future demands, we could imagine an evolving “community version” of the paper, updated with the help of new coauthors in regular time intervals. That is why we invite the community to contribute comments, corrections and/or text.^{41}^{41}41The source files of this paper can be found at https://github.com/humacsy/aetutorialpaper. We encourage readers post suggestions via GitHub issues and welcome pull requests.
Let us conclude by reminding the reader: most of the guidelines in the paper are not scientific laws nor set in stone. Always apply common sense to adapt a guideline to your concrete situation! Also, we cannot claim that we always followed the guidelines in the past. But this was all the more motivation for us to write this paper, to develop SimexPal and to have a standard to follow – we hope that the community shares this motivation.
Acknowledgements.
We thank Michael Hamann and Sebastian Schlag for their timely and thorough feedback on a working draft of the paper. We also thank Alexander Meier and Dimitri Ghouse for valuable advice on statistics. A subset of the authors was partially supported by grant ME 3619/32 within German Research Foundation (DFG) Priority Programme 1736 Algorithms for Big Data.References
 Bogdanov and Trevisan (2006) Bogdanov, A.; Trevisan, L. AverageCase Complexity. Foundations and Trends® in Theoretical Computer Science 2006, 2, 1–106. doi:\changeurlcolorblack10.1561/0400000004.
 Spielman and Teng (2001) Spielman, D.; Teng, S.H. Smoothed Analysis of Algorithms: Why the Simplex Algorithm Usually Takes Polynomial Time. Proceedings of the Thirtythird Annual ACM Symposium on Theory of Computing; ACM: New York, NY, USA, 2001; STOC ’01, pp. 296–305. doi:\changeurlcolorblack10.1145/380752.380813.
 Heule et al. (2018) Heule, M.J.H.; Järvisalo, M.J.; Suda, M., Eds. Proceedings of SAT Competition 2018; Solver and Benchmark Descriptions, 2018.
 Applegate et al. (2007) Applegate, D.L.; Bixby, R.E.; Chvatal, V.; Cook, W.J. The Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics); Princeton University Press: Princeton, NJ, USA, 2007.
 Mehlhorn and Sanders (2008) Mehlhorn, K.; Sanders, P. Algorithms and Data Structures: The Basic Toolbox; SpringerLink: Springer eBooks, Springer, 2008.
 Puglisi et al. (2007) Puglisi, S.J.; Smyth, W.F.; Turpin, A.H. A Taxonomy of Suffix Array Construction Algorithms. ACM Comput. Surv. 2007, 39. doi:\changeurlcolorblack10.1145/1242471.1242472.
 Johnson (1999) Johnson, D.S. A theoretician’s guide to the experimental analysis of algorithms. Data Structures, Near Neighbor Searches, and Methodology: Fifth and Sixth DIMACS Implementation Challenges, Proceedings of a DIMACS Workshop, USA, 1999, 1999, pp. 215–250.
 MullerHannemann and Schirra (2010) MullerHannemann, M.; Schirra, S., Eds. Algorithm Engineering: Bridging the Gap Between Algorithm Theory and Practice; SpringerVerlag: Berlin, Heidelberg, 2010.
 Moret (2002) Moret, B., Towards a discipline of experimental algorithmics; American Mathematical Society, 2002; pp. 197–213. doi:\changeurlcolorblack10.1090/dimacs/059/10.
 Sanders (2010) Sanders, P. Algorithm Engineering  An Attempt at a Definition Using Sorting as an Example. ALENEX. SIAM, 2010, pp. 55–61.
 Bast et al. (2016) Bast, H.; Delling, D.; Goldberg, A.; MüllerHannemann, M.; Pajor, T.; Sanders, P.; Wagner, D.; Werneck, R.F. Route planning in transportation networks. In Algorithm engineering; Springer, 2016; pp. 19–80.
 Applegate et al. (2006) Applegate, D.L.; Bixby, R.E.; Chvatal, V.; Cook, W.J. The traveling salesman problem: a computational study; Princeton university press, 2006.
 Brandes et al. (2013) Brandes, U.; Robins, G.; McCranie, A.; Wasserman, S. What is network science? Network Science 2013, 1, 1–15. doi:\changeurlcolorblack10.1017/nws.2013.2.
 Newman (2018) Newman, M. Networks; Oxford university press, 2018.
 Coffin and Saltzman (2000) Coffin, M.; Saltzman, M.J. Statistical Analysis of Computational Tests of Algorithms and Heuristics. INFORMS Journal on Computing 2000, 12, 24–44.
 Borassi and Natale (2016) Borassi, M.; Natale, E. KADABRA is an ADaptive Algorithm for Betweenness via Random Approximation. LIPIcsLeibniz International Proceedings in Informatics. Schloss DagstuhlLeibnizZentrum fuer Informatik, 2016, Vol. 57.
 Staudt et al. (2016) Staudt, C.L.; Sazonovs, A.; Meyerhenke, H. NetworKit: A tool suite for largescale complex network analysis. Network Science 2016, 4, 508–530.
 Boldi and Vigna (2014) Boldi, P.; Vigna, S. Axioms for centrality. Internet Mathematics 2014, 10, 222–262.
 Freeman (1977) Freeman, L.C. A set of measures of centrality based on betweenness. Sociometry 1977, pp. 35–41.
 Brandes (2001) Brandes, U. A faster algorithm for betweenness centrality. Journal of mathematical sociology 2001, 25, 163–177.
 Bader et al. (2007) Bader, D.A.; Kintali, S.; Madduri, K.; Mihail, M. Approximating betweenness centrality. International Workshop on Algorithms and Models for the WebGraph. Springer, 2007, pp. 124–137.
 Geisberger et al. (2008) Geisberger, R.; Sanders, P.; Schultes, D. Better Approximation of Betweenness Centrality. Proceedings of the Meeting on Algorithm Engineering & Expermiments; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2008; pp. 90–100.
 Riondato and Kornaropoulos (2016) Riondato, M.; Kornaropoulos, E.M. Fast approximation of betweenness centrality through sampling. Data Mining and Knowledge Discovery 2016, 30, 438–475.
 Riondato and Upfal (2018) Riondato, M.; Upfal, E. ABRA: Approximating betweenness centrality in static and dynamic graphs with rademacher averages. ACM Transactions on Knowledge Discovery from Data (TKDD) 2018, 12, 61.
 Kunegis (2013) Kunegis, J. Konect: the koblenz network collection. Proceedings of the 22nd International Conference on World Wide Web. ACM, 2013, pp. 1343–1350.
 Leskovec and Krevl (2014) Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data, 2014.
 Bader et al. (2012) Bader, D.; Meyerhenke, H.; Sanders, P.; Wagner, D., Eds. Proc. of the 10th DIMACS Implementation Challenge, Contemporary Mathematics. American Mathematical Society, 2012.
 Davis and Hu (2011) Davis, T.A.; Hu, Y. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 2011, 38, 1.
 Boldi and Vigna (2004) Boldi, P.; Vigna, S. The WebGraph Framework I: Compression Techniques. Proc. of the Thirteenth International World Wide Web Conference (WWW 2004); ACM Press: Manhattan, USA, 2004; pp. 595–601.
 Rossi and Ahmed (2016) Rossi, R.A.; Ahmed, N.K. An Interactive Data Repository with Visual Analytics. SIGKDD Explor. 2016, 17, 37–41.
 Goldenberg et al. (2010) Goldenberg, A.; Zheng, A.X.; Fienberg, S.E.; Airoldi, E.M.; others. A survey of statistical network models. Foundations and Trends® in Machine Learning 2010, 2, 129–233.
 McGeoch et al. (2000) McGeoch, C.C.; Sanders, P.; Fleischer, R.; Cohen, P.R.; Precup, D. Using Finite Experiments to Study Asymptotic Performance. Experimental Algorithmics, From Algorithm Design to Robust and Efficient Software [Dagstuhl seminar, September 2000], 2000, pp. 93–126.
 Bernardo and Smith (2009) Bernardo, J.M.; Smith, A.F. Bayesian theory; Vol. 405, John Wiley & Sons, 2009.
 Altman and Bland (2005) Altman, D.G.; Bland, J.M. Standard deviations and standard errors. BMJ 2005, 331, 903, [https://www.bmj.com/content/331/7521/903.full.pdf].
 Ellis (2010) Ellis, P.D. The essential guide to effect sizes: Statistical power, metaanalysis, and the interpretation of research results; Cambridge University Press, 2010.
 Cohen (1988) Cohen, J. Statistical power analysis for the behavioral sciences; Routledge, 1988.
 McGeoch (2012) McGeoch, C.C. A Guide to Experimental Algorithmics, 1st ed.; Cambridge University Press: New York, NY, USA, 2012.
 Rardin and Uzsoy (2001) Rardin, R.L.; Uzsoy, R. Experimental Evaluation of Heuristic Optimization Algorithms: A Tutorial. Journal of Heuristics 2001, 7, 261–304. doi:\changeurlcolorblack10.1023/A:1011319115230.
 Sedgewick (1978) Sedgewick, R. Implementing quicksort programs. Communications of the ACM 1978, 21, 847–857.
 Eppstein and Wang (2001) Eppstein, D.; Wang, J. Fast Approximation of Centrality. Proceedings of the Twelfth Annual ACMSIAM Symposium on Discrete Algorithms; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2001; SODA ’01, pp. 228–229.
 James et al. (2013) James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning – with Applications in R, 2013.
 Hinton (2012) Hinton, G.E. A practical guide to training restricted Boltzmann machines. In Neural networks: Tricks of the trade; Springer, 2012; pp. 599–619.
 Larochelle et al. (2007) Larochelle, H.; Erhan, D.; Courville, A.; Bergstra, J.; Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 473–480.
 LeCun et al. (1998) LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE 1998, 86, 2278–2324.
 Bergamini et al. (2015) Bergamini, E.; Meyerhenke, H.; Staudt, C.L., Approximating Betweenness Centrality in Large Evolving Networks. In 2015 Proceedings of the Seventeenth Workshop on Algorithm Engineering and Experiments (ALENEX); SIAM, 2015; pp. 133–146, [https://epubs.siam.org/doi/pdf/10.1137/1.9781611973754.12]. doi:\changeurlcolorblack10.1137/1.9781611973754.12.
 Green et al. (2012) Green, O.; McColl, R.; Bader, D.A. A Fast Algorithm for Streaming Betweenness Centrality. Proceedings of the 2012 ASE/IEEE International Conference on Social Computing and 2012 ASE/IEEE International Conference on Privacy, Security, Risk and Trust; IEEE Computer Society: Washington, DC, USA, 2012; SOCIALCOMPASSAT ’12, pp. 11–20. doi:\changeurlcolorblack10.1109/SocialComPASSAT.2012.37.
 Kourtellis et al. (2014) Kourtellis, N.; Morales, G.D.F.; Bonchi, F. Scalable Online Betweenness Centrality in Evolving Graphs. CoRR 2014, abs/1401.6981.
 Brandes and Pich (2007) Brandes, U.; Pich, C. Centrality estimation in large networks. INTL. JOURNAL OF BIFURCATION AND CHAOS, SPECIAL ISSUE ON COMPLEX NETWORKS’ STRUCTURE AND DYNAMICS, 2007.
 Bader and Madduri (2006) Bader, D.; Madduri, K. Parallel Algorithms for Evaluating Centrality Indices in Realworld Networks. Proceedings of the 2006 International Conference on Parallel Processing (ICPP ’06), 2006, pp. 539 – 550.
 McLaughlin and Bader (2014) McLaughlin, A.; Bader, D.A. Scalable and High Performance Betweenness Centrality on the GPU. SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2014, pp. 572–583. doi:\changeurlcolorblack10.1109/SC.2014.52.
 Sariyüce et al. (2013) Sariyüce, A.E.; Kaya, K.; Saule, E.; Çatalyürek, U.V. Betweenness Centrality on GPUs and Heterogeneous Architectures. Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units; ACM: New York, NY, USA, 2013; GPGPU6, pp. 76–85. doi:\changeurlcolorblack10.1145/2458523.2458531.
 Shi and Zhang (2011) Shi, Z.; Zhang, B. Fast network centrality analysis using GPUs. BMC Bioinformatics 2011, 12, 149. doi:\changeurlcolorblack10.1186/1471210512149.
 Crescenzi et al. (2015) Crescenzi, P.; D’Angelo, G.; Severini, L.; Velaj, Y. Greedily Improving Our Own Centrality in A Network. Experimental Algorithms; Bampis, E., Ed.; Springer International Publishing: Cham, 2015; pp. 43–55.
 Hennessy and Patterson (2011) Hennessy, J.L.; Patterson, D.A. Computer architecture: a quantitative approach; Elsevier, 2011.
 Bixby (2002) Bixby, R.E. Solving realworld linear programs: A decade and more of progress. Operations research 2002, 50, 3–15.
 Mitchell (2004) Mitchell, D.W. 88.27 more on spreads and nonarithmetic means. The Mathematical Gazette 2004, 88, 142–144.
 Smith (1988) Smith, J.E. Characterizing Computer Performance with a Single Number. Commun. ACM 1988, 31, 1202–1206. doi:\changeurlcolorblack10.1145/63039.63043.
 Fleming and Wallace (1986) Fleming, P.J.; Wallace, J.J. How Not To Lie With Statistics: The Correct Way To Summarize Benchmark Results. Commun. ACM 1986, 29, 218–221. doi:\changeurlcolorblack10.1145/5666.5673.
 AlGhamdi et al. (2017) AlGhamdi, Z.; Jamour, F.; Skiadopoulos, S.; Kalnis, P. A Benchmark for Betweenness Centrality Approximation Algorithms on Large Graphs. Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, June 2729, 2017, 2017, pp. 6:1–6:12. doi:\changeurlcolorblack10.1145/3085504.3085510.
 git (2005) git. git. https://gitscm.com/, 2005.
 GitHub (2007) GitHub. GitHub. https://github.com/, 2007.
 GitLab (2011) GitLab. GitLab. https://gitlab.com/, 2011.
 Bitbucket (2008) Bitbucket. Bitbucket. https://bitbucket.org/, 2008.
 Nethercote and Seward (2007) Nethercote, N.; Seward, J. Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM Sigplan notices. ACM, 2007, Vol. 42, pp. 89–100.
 Reinders (2005) Reinders, J. VTune performance analyzer essentials. Intel Press 2005.
 Hamadi and Wintersteiger (2012) Hamadi, Y.; Wintersteiger, C.M. Seven Challenges in Parallel SAT Solving. Proceedings of the TwentySixth AAAI Conference on Artificial Intelligence, 2012.
 Kimmig et al. (2017) Kimmig, R.; Meyerhenke, H.; Strash, D. Shared Memory Parallel Subgraph Enumeration. 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2017, Orlando / Buena Vista, FL, USA, May 29  June 2, 2017. IEEE Computer Society, 2017, pp. 519–529. doi:\changeurlcolorblack10.1109/IPDPSW.2017.133.
 Goldberg (1991) Goldberg, D. What Every Computer Scientist Should Know About Floatingpoint Arithmetic. ACM Comput. Surv. 1991, 23, 5–48. doi:\changeurlcolorblack10.1145/103162.103163.
 YAML (2002) YAML. The YAML Project. https://yaml.org/, 2002.
 (70) (DFG), G.R.F., 2010.
 JSON (2001) JSON. JSON. https://www.json.org/index.html, 2001.
 Anscombe (1973) Anscombe, F.J. Graphs in Statistical Analysis. The American Statistician 1973, 27, 17–21. doi:\changeurlcolorblack10.1080/00031305.1973.10478966.
 Healy (2018) Healy, K. Data Visualization: A Practical Introduction; Princeton University Press, 2018.
 Sanders (2002) Sanders, P. Presenting Data from Experiments in Algorithmics. In Experimental Algorithmics; Fleischer, R.; Moret, B.; Schmidt, E.M., Eds.; SpringerVerlag Berlin Heidelberg, 2002; Vol. 2547, LNCS, chapter 9, pp. 181–196.
 Tufte (2001) Tufte, E.R. The Visual Display of Quantitative Information; Graphics Press, 2001.
 Munzner (2014) Munzner, T. Visualization Analyis and Design; CRC Press, 2014.
 Ware (2012) Ware, C. Information Visualization: Perception for Design, 3rd ed.; Morgan Kaufmann, 2012.
 Anderson et al. (2000) Anderson, D.R.; Burnham, K.P.; Thompson, W.L. Null hypothesis testing: problems, prevalence, and an alternative. The journal of wildlife management 2000, pp. 912–923.
 Trafimow and Marks (2015) Trafimow, D.; Marks, M. Editorial. Basic and Applied Social Psychology 2015, 37, 1–2.
 Wasserstein and Lazar (2016) Wasserstein, R.L.; Lazar, N.A. The ASA’s statement on pvalues: context, process, and purpose, 2016.
 Lash (2017) Lash, T.L. The harm done to reproducibility by the culture of null hypothesis significance testing. American journal of epidemiology 2017, 186, 627–635.
 Cumming (2014) Cumming, G. The New Statistics: Why and How. Psychological Science 2014, 25, 7–29, [https://doi.org/10.1177/0956797613504966]. PMID: 24220629, doi:\changeurlcolorblack10.1177/0956797613504966.
 Kruschke and Liddell (2018) Kruschke, J.K.; Liddell, T.M. The Bayesian New Statistics: Hypothesis testing, estimation, metaanalysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review 2018, 25, 178–206.
 Murtaugh (2014) Murtaugh, P.A. In defense of P values. Ecology 2014, 95, 611–617.
 Box (1976) Box, G.E.P. Science and Statistics. Journal of the American Statistical Association 1976, 71, 791–799.
 Pólya (1920) Pólya, G. Über den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung und das Momentenproblem. Mathematische Zeitschrift 1920, 8, 171–181.
 Charnes et al. (1976) Charnes, A.; Frome, E.L.; Yu, P.L. The Equivalence of Generalized Least Squares and Maximum Likelihood Estimates in the Exponential Family. Journal of the American Statistical Association 1976, 71, 169–171, [https://www.tandfonline.com/doi/pdf/10.1080/01621459.1976.10481508]. doi:\changeurlcolorblack10.1080/01621459.1976.10481508.
 Fisher (1992) Fisher, R.A. Statistical methods for research workers. In Breakthroughs in Statistics; Springer, 1992; pp. 66–70.
 Young et al. (2005) Young, G.A.; Smith, R.L.; others. Essentials of statistical inference; Vol. 16, Cambridge University Press, 2005.
 Wilcoxon (1945) Wilcoxon, F. Individual comparisons by ranking methods. Biometrics bulletin 1945, 1, 80–83.
 Jones et al. (2001–) Jones, E.; Oliphant, T.; Peterson, P.; others. SciPy: Open source scientific tools for Python, 2001–. [Online; accessed <today>].
 Dunn (1961) Dunn, O.J. Multiple comparisons among means. Journal of the American Statistical Association 1961, 56, 52–64.
 Holm (1979) Holm, S. A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics 1979, 6, 65–70.
 Smithson (2003) Smithson, M. Confidence Intervals; SAGE Publications, Inc., 2003.
 Neyman (1937) Neyman, J. Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability. Phil. Trans. R. Soc. Lond. A 1937, 236, 333–380.
 Gelman et al. (2013) Gelman, A.; Stern, H.S.; Carlin, J.B.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian data analysis; Chapman and Hall/CRC, 2013.
 Salvatier et al. (2016) Salvatier, J.; Wiecki, T.V.; Fonnesbeck, C. Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2016, 2, e55.
 Akaike (1974) Akaike, H. A new look at the statistical model identification. IEEE Transactions on Automatic Control 1974, 19, 716–723. doi:\changeurlcolorblack10.1109/TAC.1974.1100705.
 Barr (1995) Barr, R.S. Designing and Reporting on Computational Experiments with Heuristic Methods; Technical report, Department of Computer Science and Engineering, Southern Methodist University, 1995.
 Moret and Shapiro (2001) Moret, B.M.E.; Shapiro, H.D.D. Algorithms and Experiments: The New (and Old) Methodology. jjucs 2001, 7, 434–446.
 Gent et al. (1997) Gent, I.P.; Grant, S.A.; MacIntyre, E.; Prosser, P.; Shaw, P.; Smith, B.M.; Walsh, T. How Not To Do It. Research Report 97.27 (School of Computer Studies, University of Leeds) 1997.
 McGeoch and Moret (1999) McGeoch, C.C.; Moret, B.M.E. How to present a paper on experimental work with algorithms. SIGACT News 1999, 30, 85–90.
 McGeoch (1996) McGeoch, C.C. Toward an Experimental Method for Algorithm Simulation. INFORMS Journal on Computing 1996, 8, 1–15.
 McGeoch (1992) McGeoch, C.C. Analyzing Algorithms by Simulation: Variance Reduction Techniques and Simulation Speedups. ACM Comput. Surv. 1992, 24, 195–212. doi:\changeurlcolorblack10.1145/130844.130853.
 Council (2005) Council, N.R. Network Science; The National Academies Press: Washington, DC, 2005. doi:\changeurlcolorblack10.17226/11516.
 Hanks et al. (1993) Hanks, S.; Pollack, M.E.; Cohen, P.R. Benchmarks, Test Beds, Controlled Experimentation, and the Design of Agent Architectures. AI Magazine 1993, 14, 17–42.
 Cohen (1995) Cohen, P.R. Empirical Methods for Artificial Intelligence; MIT Press: Cambridge, MA, USA, 1995.
 Eckles et al. (2017) Eckles, D.; Brian, K.; Johan, U. Design and Analysis of Experiments in Networks: Reducing Bias from Interference. Journal of Causal Inference 2017, 5.
 Gui et al. (2015) Gui, H.; Xu, Y.; Bhasin, A.; Han, J. Network A/B Testing: From Sampling to Estimation. Proceedings of the 24th International Conference on World Wide Web; International World Wide Web Conferences Steering Committee: Republic and Canton of Geneva, Switzerland, 2015; WWW ’15, pp. 399–409. doi:\changeurlcolorblack10.1145/2736277.2741081.
 Jensen and Cohen (2000) Jensen, D.D.; Cohen, P.R. Multiple comparisons in induction algorithms. Machine Learning 2000, 38, 309–338.
 Kuhn and Johnson (2013) Kuhn, M.; Johnson, K. Applied predictive modeling; Vol. 26, Springer, 2013.
 Funke and Sanders (2017) Funke, D.; Sanders, P. Parallel dD Delaunay Triangulations in Shared and Distributed Memory. ALENEX. SIAM, 2017, pp. 207–217.
 Booth et al. (2012) Booth, A.; Papaioannou, D.; Sutton, A. Systematic Approaches to a Successful Literature Review; SAGE Publications, 2012.
 Keshav (2007) Keshav, S. How to Read a Paper. SIGCOMM Comput. Commun. Rev. 2007, 37, 83–84. doi:\changeurlcolorblack10.1145/1273445.1273458.
 Kaijanaho (2017) Kaijanaho, A.J. Teaching Master’s Degree Students to Read Research Literature: Experience in a Programming Languages Course 2002–2017. Proceedings of the 17th Koli Calling International Conference on Computing Education Research; ACM: New York, NY, USA, 2017; Koli Calling ’17, pp. 143–147. doi:\changeurlcolorblack10.1145/3141880.3141893.
 Garousi and Felderer (2017) Garousi, V.; Felderer, M. Experiencebased Guidelines for Effective and Efficient Data Extraction in Systematic Reviews in Software Engineering. Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering; ACM: New York, NY, USA, 2017; EASE’17, pp. 170–179. doi:\changeurlcolorblack10.1145/3084226.3084238.
 Brooks (1990) Brooks, A. Searching and Reviewing the Computer Science Literature: A Guide for Research Students, 1990.
 Karger and Stein (1996) Karger, D.R.; Stein, C. A new approach to the minimum cut problem. Journal of the ACM (JACM) 1996, 43, 601–640.
 Brandes et al. (2007) Brandes, U.; Delling, D.; Gaertler, M.; Görke, R.; Hoefer, M.; Nikoloski, Z.; Wagner, D. On Finding Graph Clusterings with Maximum Modularity. Proceedings of the 33rd International Conference on Graphtheoretic Concepts in Computer Science; SpringerVerlag: Berlin, Heidelberg, 2007; WG’07, pp. 121–132.
 Hijazi et al. (2010) Hijazi, H.L.; Bonami, P.; Cornuéjols, G.; Ouorou, A. Mixed Integer NonLinear Programs featuring "On/Off" constraints: convex analysis and applications. Electronic Notes in Discrete Mathematics 2010, 36, 1153–1160.
 Morey et al. (2016) Morey, R.D.; Hoekstra, R.; Rouder, J.N.; Lee, M.D.; Wagenmakers, E.J. The fallacy of placing confidence in confidence intervals. Psychonomic bulletin & review 2016, 23, 103–123.
 Greenland et al. (2016) Greenland, S.; Senn, S.J.; Rothman, K.J.; Carlin, J.B.; Poole, C.; Goodman, S.N.; Altman, D.G. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European journal of epidemiology 2016, 31, 337–350.
 Ioannidis (2005) Ioannidis, J.P. Why most published research findings are false. PLoS medicine 2005, 2, e124.
 Tukey (1980) Tukey, J.W. We Need Both Exploratory and Confirmatory. The American Statistician 1980, 34, 23–25, [https://amstat.tandfonline.com/doi/pdf/10.1080/00031305.1980.10482706]. doi:\changeurlcolorblack10.1080/00031305.1980.10482706.
 Amdahl (1967) Amdahl, G.M. Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 1820, 1967, spring joint computer conference. ACM, 1967, pp. 483–485.
 Gustafson (1988) Gustafson, J.L. Reevaluating Amdahl’s law. Communications of the ACM 1988, 31, 532–533.
 Shi (1996) Shi, Y. Reevaluating Amdahl’s law and Gustafson’s law. Computer Sciences Department, Temple University (MS: 3824) 1996.
 Costa et al. (2007) Costa, L.d.F.; Rodrigues, F.A.; Travieso, G.; Villas Boas, P.R. Characterization of complex networks: A survey of measurements. Advances in physics 2007, 56, 167–242.
 Kim and Wilhelm (2008) Kim, J.; Wilhelm, T. What is a complex graph? Physica A: Statistical Mechanics and its Applications 2008, 387, 2637–2652.
 Boccaletti et al. (2006) Boccaletti, S.; Latora, V.; Moreno, Y.; Chavez, M.; Hwang, D.U. Complex networks: Structure and dynamics. Physics reports 2006, 424, 175–308.
 Watts and Strogatz (1998) Watts, D.J.; Strogatz, S.H. Collective dynamics of ‘smallworld’networks. nature 1998, 393, 440.
 Barabási and Albert (1999) Barabási, A.L.; Albert, R. Emergence of scaling in random networks. science 1999, 286, 509–512.
 Newman (2002) Newman, M.E. Assortative mixing in networks. Physical review letters 2002, 89, 208701.
 Milo et al. (2002) Milo, R.; ShenOrr, S.; Itzkovitz, S.; Kashtan, N.; Chklovskii, D.; Alon, U. Network motifs: simple building blocks of complex networks. Science 2002, 298, 824–827.
 Newman (2006) Newman, M.E. Modularity and community structure in networks. Proceedings of the national academy of sciences 2006, 103, 8577–8582.
 Barabási and Bonabeau (2003) Barabási, A.L.; Bonabeau, E. Scalefree networks. Scientific american 2003, 288, 60–69.
 Amaral et al. (2000) Amaral, L.A.N.; Scala, A.; Barthelemy, M.; Stanley, H.E. Classes of smallworld networks. Proceedings of the national academy of sciences 2000, 97, 11149–11152.
 Sedgewick (1977) Sedgewick, R. The analysis of quicksort programs. Acta Informatica 1977, 7, 327–355.
 Easley and Kleinberg (2010) Easley, D.; Kleinberg, J. Networks, crowds, and markets: Reasoning about a highly connected world; Cambridge University Press, 2010.
 Bavelas (1948) Bavelas, A. A mathematical model for group structures. Applied anthropology 1948, 7, 16–30.
 Shimbel (1953) Shimbel, A. Structural parameters of communication networks. The bulletin of mathematical biophysics 1953, 15, 501–507.
 Shaw (1954) Shaw, M.E. Group structure and the behavior of individuals in small groups. The Journal of psychology 1954, 38, 139–149.
 Cohn and Marriott (1958) Cohn, B.S.; Marriott, M. Networks and centres of integration in Indian civilization. Journal of social Research 1958, 1, 1–9.
 Borgatti and Everett (2006) Borgatti, S.P.; Everett, M.G. A graphtheoretic perspective on centrality. Social networks 2006, 28, 466–484.
 Chung et al. (2006) Chung, F.; Chung, F.R.; Graham, F.C.; Lu, L.; Chung, K.F.; others. Complex graphs and networks; American Mathematical Soc., 2006.
 Lü et al. (2013) Lü, J.; Chen, G.; Ogorzalek, M.J.; Trajković, L. Theory and applications of complex networks: Advances and challenges. Circuits and Systems (ISCAS), 2013 IEEE International Symposium on. IEEE, 2013, pp. 2291–2294.
 Clauset et al. (2009) Clauset, A.; Shalizi, C.R.; Newman, M.E. Powerlaw distributions in empirical data. SIAM review 2009, 51, 661–703.
 Fan et al. (2017) Fan, R.; Xu, K.; Zhao, J. A GPUBased Solution to Fast Calculation of Betweenness Centrality on Large Weighted Networks. PeerJ Computer Science 2017, 3.
 D’Angelo et al. (2016) D’Angelo, G.; Severini, L.; Velaj, Y. On the Maximum Betweenness Improvement Problem. Electronic Notes in Theoretical Computer Science 2016, 322, 153 – 168. Proceedings of ICTCS 2015, the 16th Italian Conference on Theoretical Computer Science, doi:\changeurlcolorblackhttps://doi.org/10.1016/j.entcs.2016.03.011.
 Jacob et al. (2005) Jacob, R.; Koschützki, D.; Lehmann, K.A.; Peeters, L.; TenfeldePodehl, D., Algorithms for Centrality Indices. In Network Analysis: Methodological Foundations; Springer Berlin Heidelberg: Berlin, Heidelberg, 2005; pp. 62–82. doi:\changeurlcolorblack10.1007/9783540319559_4.
 Hong et al. (2012) Hong, C.T.; Chen, D.H.; Chen, Y.B.; Chen, W.G.; Zheng, W.M.; Lin, H.B. Providing Source Code Level Portability Between CPU and GPU with MapCG. Journal of Computer Science and Technology 2012, 27, 42–56. doi:\changeurlcolorblack10.1007/s1139001212054.
 Hayashi et al. (2015) Hayashi, T.; Akiba, T.; Yoshida, Y. Fully Dynamic Betweenness Centrality Maintenance on Massive Networks. Proc. VLDB Endow. 2015, 9, 48–59. doi:\changeurlcolorblack10.14778/2850578.2850580.
 Nemhauser and Wolsey (1978) Nemhauser, G.L.; Wolsey, L.A. Best algorithms for approximating the maximum of a submodular set function. Mathematics of operations research 1978, 3, 177–188.
 Chacon and Straub (2014) Chacon, S.; Straub, B. Pro Git, 2nd ed.; Apress: Berkely, CA, USA, 2014.
 R Core Team (1993) R Core Team. Rproject. https://www.rproject.org/, 1993.
 R Core Team (2012) R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3900051070.
 Williams and Kelley (1986) Williams, T.; Kelley, C. gnuplot. https://www.rproject.org/, 1986.
 Hunter (2007) Hunter, J.D. Matplotlib: A 2D graphics environment. Computing In Science & Engineering 2007, 9, 90–95. doi:\changeurlcolorblack10.1109/MCSE.2007.55.
 The MathWorks, Inc. (1984) The MathWorks, Inc.. Matlab 8.0. https://www.mathworks.com/products/matlab.html, 1984.
 Cox (2006) Cox, D.R. Principles of statistical inference; Cambridge university press, 2006.
 Cox (2011) Cox, N.J. Stata tip 96: Cube roots. Stata Journal 2011, 11, 149–154(6).
 Demsar (2006) Demsar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 2006, 7, 1–30.
 García et al. (2010) García, S.; Fernández, A.; Luengo, J.; Herrera, F. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 2010, 180, 2044–2064. doi:\changeurlcolorblack10.1016/j.ins.2009.12.010.
 Garcia and Herrera (2008) Garcia, S.; Herrera, F. An extension on“statistical comparisons of classifiers over multiple data sets”for all pairwise comparisons. Journal of Machine Learning Research 2008, 9, 2677–2694.
 Knuth (1997) Knuth, D.E. The art of computer programming; Vol. 3, Pearson Education, 1997.