Comparing the Performance of Graphical Structure Learning Algorithms with \proglangTETRAD

# Comparing the Performance of Graphical Structure Learning Algorithms with \proglangTetrad

Joseph D. Ramsey
Carnegie Mellon University \AndDaniel Malinsky
Carnegie Mellon University
\Plainauthor

Joseph D. Ramsey, Daniel Malinsky \PlaintitleComparing the Performance of Graphical Structure Learning Algorithms with TETRAD \ShorttitleComparing the Performance of Structure Learning Algorithms \Abstract In this report we describe a tool for comparing the performance of causal structure learning algorithms implemented in the \proglangTETRAD freeware suite of causal analysis methods. Currently the tool is available as package in the \proglangTETRAD source code (written in \proglangJava), which can be loaded up in an Integrated Development Environment (IDE) such as IntelliJ IDEA. Simulations can be done varying the number of runs, sample sizes, and data modalities. Performance on this simulated data can then be compared for a number of algorithms, with parameters varied and with performance statistics as selected, producing a publishable report. The order of the algorithms in the output can be adjusted to the user’s preference using a utility function over the statistics. Data sets from simulation can be saved along with their graphs to a file and loaded back in for further analysis, or used for analysis by other tools. The package presented here may also be used to compare structure learning methods across platforms and programming languages, i.e., to compare algorithms implemented in \proglangTETRAD with those implemented in \proglangMATLAB or \proglangR. \KeywordsCausal inference, graphical models, structure learning, performance \PlainkeywordsCausal inference, graphical models, structure learning, performance \Address Joseph D. Ramsey and Daniel Malinsky
Department of Philosophy
Carnegie Mellon University
5000 Forbes Avenue Pittsburgh, PA 15224
E-mail:

## 1 Introduction

Often researchers are faced with the problem of choosing an algorithm from among possibly dozens of relevant algorithms for a particular task. This can be time-consuming and error-prone; one must try each algorithm in turn, vary the parameters for that algorithm, run it in simulation on common data sets that hopefully reflect the properties of the real data of interest, and somehow try to discern which algorithm has the best performance over the range of cases under study. Reading research papers for descriptions and evaluations of algorithms is often unhelpful, since papers tend to compare only one or two algorithms at a time, on performance statistics that may not be of interest to the user, using simulations that are not appropriate for the domain. Ideally the user could directly compare a range of algorithms, on data of their choosing, and on performance statistics of interest to them, so that they could make an informed decision as to which algorithm(s) may be best suited to the user’s particular purpose.

It is a task we feel is best automated and used early and often. We focus on the structure learning algorithms in the \proglangTETRAD freeware (http://www.phil.cmu.edu/tetrad). Within \proglangTETRAD, we have created a tool for comparing algorithms, both “basic" algorithms with various parameterizations and algorithms put together in various combinations. The relevant code is contained in the package \pkgalgcomparison within \proglangTETRAD.111The full package path in the code library is |edu.cmu.tetrad.algcomparison|. It is possible to construct studies in which combinations of structure learning algorithms are compared head-to-head on common data, with known true models; winners conveniently float to the top of the list of algorithms when sorted by a utility function that reflects the user’s interests. Algorithms that perform poorly for the intended type of data analysis quickly become apparent. This makes it easy to identify the general class of algorithms the user may want to choose from for their purposes. As the user learns more about their particular problem, they can easily refine their assessment by running a modified set of simulations. The procedure described here is more time-efficient than setting up specific simulation tests for pairs or triples of algorithms, since one can compare simultaneously as many algorithms and algorithm variants as desired.

A general problem which arises in developing novel algorithms is knowing how their results compare to those of existing algorithms, where the algorithms’ weaknesses are from the point of view of performance, where they need to be improved, and whether a development effort should be abandoned in favor of other methods with better performance. Having a tool to automatically compare a new algorithm to all existing algorithms, easily, on-the-fly, is of non-trivial advantage for future algorithm development.

Our tool is coded up in \proglangTETRAD, but of course there are excellent algorithms implemented in a number of platforms and languages and it may be desirable to assess those algorithms as well. Furthermore, one may wish to evaluate how \proglangTETRAD algorithms perform on data generated using other software packages. Thus, it would be preferable to be able to save data sets and graphs to be read by other packages, and also to read data and graphs produced in other platforms in order to analyze them inside of our algorithm comparison tool. A facility for this has been provided, for data and graphs formatted as is usual for \proglangTETRAD. If specific formats are needed, the tools that load and save data and graphs can be copied and modified by the user to load data and graphs in the external format. Moreover, we describe below how the tool presented here can be used to compare algorithms across platforms, e.g., compare implementations of the same or similar algorithms in \proglangR, \proglangMATLAB, \proglangJava, and potentially other languages.

Since \proglangJava is object-oriented, we can take advantage of clean object-orientation to design this tool, which is a considerable advantage. A performance evaluation tool should have available a wide range of algorithms for easy comparison, and we would also like to be able to add new algorithms easily. Combinations of algorithms are often treated in practice as novel algorithms; we allow them to be treated as such. The tool ought to have some standard styles of simulation readily available while being able to add new simulation styles easily. One should be able to handle continuous variables or discrete variables, or mixtures of the two (for algorithms that can analyze mixed variable sets). Parameters ought to have defaults, but if the user wishes to change the default settings of the parameters or add parameters for novel algorithms that are not among the current set, the user should to be able to do this. A variety of performance statistics ought to be available to the user. It should be relatively straightforward with some obvious programming to add new performance statistics. There is no point in antagonizing users if they do not wish to use the statistics chosen by default; it should be easy to change the performance statistics included in the output. It should be possible to create different styles of true graphs for the simulation models, and it must be easy to add new styles. Finally, in deference to the user’s needs, it should be possible for the user to decide which combination of statistics to employ to pick the best algorithm or algorithms. This is because different users with different scientific backgrounds may very well have different views as to what is important in an estimated causal model.

We take the view that these differences should be handled using a modular architecture. We write a central comparison algorithm that allows pieces of different sorts to be easily plugged into a script as the user wishes. We allow algorithms to be incorporated modularly, and similarly for parameters, simulations, performance statistics, independence tests, and scores. The user writes (or modifies) a script in which these modules are all specified and passed into the comparison algorithm. Though \proglangJava is perhaps a less commonly used language for statistical software than (e.g.) \proglangR, the modular features of \proglangTETRAD implemented in \proglangJava make for a more flexible evaluation environment than is (straightforwardly) possible in \proglangR. For example, since many different varities of structure learning algorithms are coded in a mutually-compatible object-oriented way in \proglangTETRAD, it is trivial to execute “hybrid” algorithms which e.g., execute constraint-based search on a starting graph generated by greedy search, and then use entropy-based methods to make final orientations. As far as we are aware, such combinations would require a significant amount of custom scripting in other programming platforms. Other benefits of the \proglangTETRAD platform are discussed below. See Appendix A for instructions for downloading \proglangJava and setting up \proglangTETRAD.

In Sections 3 to 7 we describe the various modular components of the \pkgalgcomparison package and then in Sections 8 and 9 we provide example scripts which combine these pieces to generate reports. In Section 10 we describe the full set of options currently implemented, though more options are continuously being added as \proglangTETRAD is open-source. In Section 11 we discuss and illustrate performance comparison across platforms. In the last section we briefly describe a graphical user interface (GUI) which is also available. First, we provide a brief background on causal structure learning with \proglangTETRAD.

## 2 A brief introduction to \proglangTetrad

The \proglangTETRAD software was introduced in the mid-1980s to aid in constructing, testing, predicting with, and learning causal statistical models based on structural equations or graphical representations like Directed Acyclic Graphs (DAGs) (Glymour et al., 1987; Spirtes et al., 1990; Scheines et al., 1998; Spirtes et al., 2000). The capabilities and flexibility of \proglangTETRAD has increased with years of algorithm development and application in several scientific fields including biology (e.g. Shipley et al., 2006), neuroscience (e.g. Smith et al., 2011; Mills-Finnerty et al., 2014), economics (e.g. Bessler and Lee, 2002; Demiralp and Hoover, 2003), climate science (e.g. Ebert-Uphoff and Deng, 2012), education research (e.g. Rau et al., 2013), and other areas. Though \proglangTETRAD is capable of performing a wide range of tasks relevent to causal inference, we will focus only on graphical structure learning here.

\proglang

TETRAD implements numerous algorithms which search for causal graphical models. The resultant models are intended to have a causal interpretations, the precise details of which depend on the underlying assumptions and the type of output graph produced by the method. At the time of writing, there are dozens of structure learning algorithms and variations; we do not review them all in detail here. Recent overviews of causal graphical modeling, structure learning algorithms, and the assumptions underlying causal inference from observational data can be found in Spirtes and Zhang (2016), Drton and Maathuis (2017), and Heinze-Deml et al. (2018). Some structure learning methods assume the data is generated by a causal DAG with no latent variables, and return a graph which represents a Markov equivalence class of DAGs (called a Pattern or CPDAG). See Figure 1. We call these Pattern search algorithms. Other methods allow for the possibility of latent variables and return a graph which represents a Markov equivalence class of ancestral graphs (called a PAG, a Partial Ancestral Graph). See Figure 2. We call these PAG search algorithms. See Richardson and Spirtes (2002) and Zhang (2008a) for background on ancestral graphs.

Constraint-based structure learning algorithms include PC (Spirtes et al., 2000), Conservative PC (Ramsey et al., 2006),222Conservative PC is abbreviated CPC, and other constraint-based algorithms have been implemented to import the “conservative” decision-making of CPC. So, for example, CFCI is a conservative version of FCI, and likewise for CPC-stable, etc. PC-stable (Colombo and Maathuis, 2014), CCD (Richardson, 1996), FCI (Spirtes et al., 1995; Zhang, 2008b), RFCI (Colombo et al., 2012), and variations on these. Constraint-based methods use statistical tests of conditional independence to narrow the range of possible causal models consistent with observed data. Different independence tests are appropriate for different kinds of data. For example, for continuous variables with multivariate Gaussian distributions one may use a test based on Fisher’s Z transformation, whereas discrete multinomial data or non-linear non-Gaussian data require different tests. \proglangTETRAD implements a variety of independence tests; more on this below.

Score-based algorithms include GES (Chickering, 2002; Chickering and Meek, 2002) and GFCI (Ogarrio et al., 2016). In recent versions of \proglangTETRAD, GES has been superceded by a faster, parallelized version of the greedy search algorithm called fGES (Ramsey et al., 2017). These methods use a model score – typically a maximum likelihood penalized for complexity such as the BIC score – to search for models with the best fit to the data. Some of the constraint-based algorithms above have also been modified to make conditional independence judgements based on score differences instead of a statistical hypothesis tests.

In addition to the above, \proglangTETRAD implements algorithms like the ICA-based LiNGaM (Shimizu et al., 2006) and MGM (Lee and Hastie, 2013; Sedgewick et al., 2016). These methods are neither Pattern nor PAG search algorithms; they return a DAG and a Markov random field, respectively. There are also some causal clustering algorithms like FOFC, FTFC, and BuildPureClusters, which search for latent variable measurement models (Kummerfeld and Ramsey, 2016; Kummerfeld et al., 2014; Silva et al., 2006). In Section 10 we provide a list of the algorithms currently implemented and compatible with the performance comparison tool presented here. New algorithms are continuously being developed and added to the codebase.

\proglang

TETRAD is a longstanding open-source project with numerous contributors. The copyright is owned by Peter Spirtes, Richard Scheines, Joseph Ramsey, and Clark Glymour. Ramsey has implemented most of the program, which consists of a large number of packages containing code for algorithms, statistical tests, data-generation and manipulaton methods, etc. The subject of this note is just one package in \proglangTETRAD, namely the \pkgalgcomparison package. The centerpiece of this package is the \proglangJava class Comparison (an early version of which was written by both Malinsky and Ramsey, then substantially generalized and improved by Ramsey), which is responsible for combining the various modular components discussed below, executing methods elsewhere in \proglangTETRAD, and producing the output. The package \pkgalgcomparison also includes wrappers and interfaces for the aforementioned various components of \proglangTETRAD, which enable the Comparison class to handle them correctly and uniformly.333Some parts of the \proglangTETRAD code outside of the \pkgalgcomparison package have been modified to conform with the wrappers and interfaces presented here. Thus, the \pkgalgcomparison package is not a standalone tool, but is integrated with and dependent on the rest of the \proglangTETRAD code. Finally, the \pkgalgcomparison package includes a small number or example scripts which the user can modify and execute to generate results; we will describe these below.

## 3 What is an algorithm?

Algorithms in \proglangTETRAD search for causal graphical models from data sets. That is, every algorithm takes a data set as input along with certain parameter values, and produces a graphical model as output. The type of graph produced depends on the algorithm’s presumed seach space. This output graph is then compared to the graph which that algorithm would produce in the ideal limit of infinite data. For our purposes, we define an algorithm as an interface in \proglangJava, as follows: {CodeChunk} {CodeInput} public interface Algorithm Graph search(DataSet dataSet, Parameters parameters); Graph getComparisonGraph(Graph dag); String getDescription(); DataType getDataType(); List<String> getParameters(); This has the following methods:

• search: Takes a set of parameters and a data set and returns a graph. For algorithms which search for a Pattern, this will be an estimate of a Pattern; for algorithms which search for a PAG, this will be an estimate of a PAG. For other types of algorithms, an appropriate graph type will be returned.

• getComparisonGraph: Returns a comparison graph, given the true DAG – that is, the graph against which the result of the search method is to be compared. For Pattern algorithms, this returns the Pattern (Markov equivalence class) implied by the true DAG; for PAG algorithms, this returns the PAG implied by the true DAG.

• getDataType: Returns a data type – continuous, discrete, or mixed. Algorithms that work only for continuous variables will not run on discrete or mixed data, for instance, though algorithms that are for mixed data will run on any data file. This compatibility needs to be checked before running a search.

• getDescription: Returns a description of the algorithm. This will be printed in the output and will distinguish one algorithm from another.

• getParameters: Returns a list of the names of parameters used by the algorithm.

Quite a number of such algorithms are defined currently in the code. For instance, the PC algorithm using the Fisher Z test is defined as an algorithm for continuous variables, and the PC algorithm with the Chi-square test is defined as an algorithm for discrete data. There are algorithms for mixed data as well. These are organized into distinct subpackages in \proglangTETRAD. To add a new algorithm, one simply needs to instantiate this interface.444Some search procedures operate on multiple data sets instead of only one. For example, the IMaGES algorithm was originally developed for search across the fMRI neural activity measurements of multiple subjects; each subject corresponds to their own data set (Ramsey et al., 2011). For such algorithms, the correct interface is not Algorithm but rather MultiDataSetAlgorithm. This is only relevant if the user has an interest in implementing another search procedure which operates on multiple data sets. Examples are given below.

## 4 What is a simulation?

If one wants to know how well an algorithm does on data in practice (not in theory), there are really only two ways to find out. One way is to use real data for which some gold-standard information is known. If one knows that causes , then when one runs an algorithm on variables including and , the output ought to show that causes or at least be compatible with that conclusion. The other way is to simulate data from a known model and see how well the algorithm can recover the true structure. We call this a simulation study. Since the user knows everything about the true model in a simulation study, they can test all predictions that the algorithm makes, not just one or a few.

In order to do a simulation study, the user needs a procedure to simulate data from a true model. We call such a procedure a simulation method. This is embodied in the following interface: {CodeChunk} {CodeInput} public interface Simulation void simulate(Parameters parameters); int getNumDataSets(); Graph getTrueGraph(); DataSet getDataSet(int index); DataType getDataType(); String getDescription(); List<String> getParameters(); The interface has seven methods:

• simulate: When this is called, a new graph is created and new simulated data set is created for each run. The procedure by which the random graph is constructed depends on the simulation type; see below and Section 10.

• getNumDataSets: Returns the number of data sets to simulate (or load from a file).

• getTrueGraph: Returns the graph used to generate the data for the model at the given index, from 0 up to one less than getNumDataSets.

• getDataSet: Returns data for the model at the given index, from 0 up to one less than getNumDataSets.

• getDataType: Returns the type of data simulated – continuous, discrete, or mixed.

• getDescription: Returns a description of the simulation.

• getParameters: Returns a list of names of parameters used in the simulation.

Note that adding new simulations is not difficult from a programming perspective (though it may be non-trivial to design the random graph or data generation procedures); the user simply copies one of the existing simulations, gives a different name, and adjusts its methods.

The data type is included to make sure that the data type of the simulation is matched to the data type of the algorithm. If you simulate continuous data, you cannot use a discrete variable algorithm to analyze it; that will throw exceptions. Algorithms for which there is a mismatch of this type are simply skipped in the analysis.

Simulations follow the convention that all data sets and graphs are created first and saved and then returned as requested using the getTrueGraph and getDataSet methods. This is done for consistency, in case someone wants to retrieve the first graph twice, for instance, so that the same graph is always returned, but it is also done because graphs loaded from a file are best treated this way – that is, all graphs and data sets are first loaded and then returned on demand by the getTrueGraph and getDataSet methods.

Note that most simulation methods create a new random DAG according to a procedure which begins with an empty graph and iteratively adds directed edges according to some fixed probability. There are alternative random graph generating procedures, as well as procedures which generate cyclic graphs, scale-free graphs, or other varieties. The construction of graphs and the definition of simulation methods is separated out into different functionalities, so the user passes a type of random graph into a simulation method to generate data from that type of graph. (The user may also generate data from a specific graph, perhaps to evaluate performance on a causal structure of particular scientific interest, by loading in a graph file and generating data from that model.)

There are many simulation parameters which the user may adjust. For example, one parameter (“numLatents”) controls the number of latent variables to include in the true graph. The latent variables produced here are unmeasured common causes of two or more measured variables, so they act as confounders. Further, the user may adjust the range of edge coefficients (strengths of direct causal connections) which are by default sampled uniformly from . The lower and upper bound of this range can be adjusted by setting the “coefLow” and “coefHigh” parameters. The sparsity of the random graph is controlled by setting the average degree of the graph (the number of edge endpoints at a vertex), a parameter called “avgDegree." Furthermore, the sparsity can be controlled by setting the parameters “maxDegree,” “maxOutdegree,” and “maxIndegree”; the simulation procedure will be forced to satisfy these specifications in generating random graphs.555Note that if the user sets relatively low bounds on the maximum degree, out-degree, or in-degree of the graph, the generated graph may not achieve the desired average degree, but rather something lower. To specify non-linear functions or non-Gaussian error distributions for non-linear, non-Gaussian simulations, the user modifes yet other parameters. The example scripts later in this paper illustrate how to set parameters, though the relevant parameters vary with the type of simulation. Generating a configuration file, as described in Section 10, will indicate to the user what parameters are able to be modified and what their default values are.

## 5 What is a performance statistic?

Performance statistics can also be added modularly to the comparison tool, though several are already provided to choose from. As with algorithms and simulations, there is an interface that all performance statistics must implement, as follows: {CodeChunk} {CodeInput} public interface Statistic String getAbbreviation(); String getDescription(); double getValue(Graph trueGraph, Graph estGraph); double getNormValue(double value); Statistics are defined for an estimated graph with respect to a true graph. The methods are as follows:

• getAbbreviation. This returns a hopefully short, unique abbreviation for the statistic, such as “AP" for Adjacency Precision. This will be printed at the top of each column for that statistic.

• getDescription. This is a description of the statistic, such as “Adjacency Precision." At the beginning of the report, these descriptions will be printed so that the user knows what they are.

• getValue. This returns the value of the statistic, given the true graph and estimated graph.

• getNormValue. This returns a normalized value of the statistic between 0 and 1, for which higher values are better. A utility function can be calculated based on weighted combinations of performance statistics, and this utility can be used to sort the rows of the resulting performance summary table. The normalized values of the statistic are necessary for comparing different performance statistics on a common scale, since some statistics (e.g., Adjacency Precision) are naturally bounded between 0 and 1, whereas others (such as Structural Hamming Distance) are not. More details in Section 7.

## 6 What are independence tests and scores?

In general, an independence test is a function from pairs of variables and sets of conditioning variables to true or false. For instance, you may want to know whether and are independent conditional on . If they are independent the function should return true; otherwise, if they are dependent, it should return false. For the purposes of the algorithm comparison tool, an interface is defined to take a data set and some parameters and return such an independence test: {CodeChunk} {CodeInput} public interface IndependenceWrapper IndependenceTest getTest(DataSet dataSet, Parameters parameters); String getDescription(); DataType getDataType(); List<String> getParameters(); The methods have familiar names and functionality, except for the first method; getTest takes a data set and some parameters as input and returns an independence test object. Implementations of the Algorithm interface can then use this independence test object to execute a search. For instance, the PC algorithm takes an independence test of this form and to estimate a Pattern; in the examples below, we specifically use an instantiation of the IndependenceWrapper interface called “FisherZ" as input to the PC algorithm. This lets the PC algorithm use the Fisher Z independence test to decide which independence contraints hold in the data. To run this, the user passes the FisherZ implementation of IndependenceWrapper as an argument to the PC algorithm constructor. To use a different independence test, the user simply passes a different implementation of the IndependenceWrapper interface (e.g., one which returns a test for non-Gaussian data) as argument to the PC constructor.666The class IndependenceTest is an existing interface in \proglangTETRAD; this is implemented in a number of different ways. However, IndependenceTest instantiations take DataSet objects as arguments to their constructors, which make them difficult to deal with in scripts, where data sets are not antecedently known. IndependenceWrapper solves this problem by moving the DataSet to an argument of a method instead of an argument to a constructor.

Scores are handled in the same way, using a parallel interface for scores: {CodeChunk} {CodeInput} public interface ScoreWrapper Score getScore(DataSet dataSet, Parameters parameters); String getDescription(); DataType getDataType(); List<String> getParameters();

Algorithms that take scores may use scores wrapped with an instantiation of this interface as arguments to their constructors, just as with independence tests. For example, the GES algorithm typically uses a BIC score to perform a greedy search for continuous variables or the BDeu score for discrete variables.

Several wrappers for independence tests and scores are included in the “independence" and “score" subpackages of the algorithm tool package. More can easily be added if their implementations are available.

## 7 Deferring to the user’s preferences

Different users may for completely valid reasons have totally different ideas, or subtly different ideas, as to what makes for a good algorithm. One user may be aiming for methods that get the undirected structure of a graph right, whereas another user may be mainly interested in methods that get the orientations right. One user may be interested making sure that the adjacencies that are estimated are found in the true model (adjacency precision), whereas another user may only care that as many true adjacencies are found as possible (adjacency recall). Yet another user may be looking for the right balance of precision and recall for adjacencies, which is best measured using the Matthew’s correlation or the F1 score. Even yet another user may be instead interested in a search procedure that can run quickly, and among those procedures look for the one with the highest Matthew’s correlation. It is not that these choices are not objective; rather, different researchers have different goals. It is not even that the goals are only personal preferences – they may be imposed by the structure of the scientific problem that is being solved. There ought to be a flexible way to adapt algorithm comparison to these preferences.

One idea777Suggested by Gregory Cooper, personal communication. is to place a weight over the statistics that are calculated for the algorithm and simply sort the output list of algorithms high to low by a weighted utility function. This we have done; the user can associate with each statistic a weight, a number between 0 and 1, and the utility of each algorithm is calculated as a weighted sum of the normalized performance statistics. While this does not capture all possible decision rules one may have for ranking algorithms, it is quite flexible and does not sacrifice any information (since the entire battery of results is in any case printed). The utility function is a convenient way of sorting the performance results. It is of course possible that the user just examines the summary table visually to determine which results they like best; this can be done no matter what order the rows are printed in. The utility function lets the user more precisely weigh different performance statistics in constructing an ordering over the algorithms. We show how to do this below.

Randomly generated graphs and their simulated data sets can easily be saved to a directory and loaded back in for further analysis. This also allows the algorithm tool to combine nicely with other software packages; so long as the formatting and directory/filename conventions are met, data and graphs from other packages can be loaded into the tool and analyzed as if they were simulated data. Also, data simulated in the \proglangTETRAD environment can be exported to other tools.888Thanks to Peter Spirtes for suggesting that files be saved to a directory for future reuse.

The following is an example script which saves graphs and simulated data to a directory.

{CodeChunk}{CodeInput}

public class ExampleSave public static void main(String… args) Parameters parameters = new Parameters();

parameters.put("numRuns", 10); parameters.put("numMeasures", 100); parameters.put("avgDegree", 4); parameters.put("sampleSize", 100, 500, 1000);

Simulation simulation = new SemSimulation(new RandomForward()); Comparison comparison = new Comparison(); comparison.setShowAlgorithmIndices(true); comparison.saveToFiles("comparison", simulation, parameters); This creates a series of new simulations as indicated, with different sample sizes, saving all of the data and graphs to the specified directories. Graphs for each are saved in the subdirectory “graph"; data files for each are saved into the subdirectory “data", numbered 1 up to the number of runs, and a file with parameter settings is saved into the file “parameters.txt" in the “comparison/save1" directory. The data files are all tab-delimited, with variable names in the first row. The graph files contain a list of nodes followed by a numbered list of edges. To examine the exact format, simply generate files using the method above and examine one of the graph files.

Notice that the parameter “sampleSize” has been given three values. The procedure for saving simulations will create a new simulation for each sample size; these will be saved in parallel subdirectories of the directory “comparison/save1". They can be loaded back in as a group by referring to the same directory, which we will do below. Any simulation parameter may be given multiple values in this way; separate simulations will be saved for all combinations of these parameter values.

In the next example, we load in previously saved simulated data and perform structure search with several algorithms, comparing the results.

{CodeChunk}{CodeInput}

public class ExampleCompareFromFiles public static void main(String… args) Parameters parameters = new Parameters(); parameters.set("alpha", 1e-4);

Statistics statistics = new Statistics();

statistics.setWeight("AP", 1.0); statistics.setWeight("AR", 0.5); statistics.setWeight("AHP", 1.0); statistics.setWeight("AHR", 0.5);

Algorithms algorithms = new Algorithms();

Comparison comparison = new Comparison(); comparison.setShowAlgorithmIndices(false); comparison.setShowSimulationIndices(false); comparison.setSortByUtility(true); comparison.setShowUtilities(true);

comparison.compareFromFiles("comparison", algorithms, statistics, parameters); The various algorithms, statistics and simulations that at are desired are passed in as indicated. We have number of basic algorithms available and a variety of independence tests; these can be used in various combinations to produce quite a large number of possible results to compare. In the script, we have included PC, Conservative PC, PC-stable, and Conservative PC-stable using the Fisher Z independence test. The user must specify a simulation method. Above we simulated data from linear Gaussian structural equation models. The user can change this simply by choosing a different compatible simulation. The output of this script when run is as follows:

Sat Jul 23 14:37:52 EDT 2016

Statistics:

McArrow = Matthew’s correlation coefficient for adjacencies
F1Arrow = F1 statistic for arrows
SHD  = Structural Hamming Distance
E = Elapsed Time in Seconds

Parameters:

numRuns = 10

Simulation:

Simulation 1:
Load data sets and graphs from a directory.

Linear, Gaussian SEM simulation
numMeasures = 100
numLatents = 0
avgDegree = 4
maxDegree = 100
maxIndegree = 100
maxOutdegree = 100
connected = 0
numRuns = 10
varLow = 1
varHigh = 3
sampleSize = 100

Simulation 2:
Load data sets and graphs from a directory.

Linear, Gaussian SEM simulation
numMeasures = 100
numLatents = 0
avgDegree = 4
maxDegree = 100
maxIndegree = 100
maxOutdegree = 100
connected = 0
numRuns = 10
varLow = 1
varHigh = 3
sampleSize = 500

Simulation 3:
Load data sets and graphs from a directory.

Linear, Gaussian SEM simulation
numMeasures = 100
numLatents = 0
avgDegree = 4
maxDegree = 100
maxIndegree = 100
maxOutdegree = 100
connected = 0
numRuns = 10
varLow = 1
varHigh = 3
sampleSize = 1000

Algorithms:

1. PC ("Peter and Clark") using Fisher Z test
2. CPC (Conservative "Peter and Clark") using Fisher Z test
3. PC-stable ("Peter and Clark" stable) using Fisher Z test
4. CPC-stable (Conservative "Peter and Clark" stable) using Fisher Z test

Weighting of statistics:

U =
1.0 * f(AP)
0.5 * f(AR)
1.0 * f(AHP)
0.5 * f(AHR)

Note that f for each statistic is a function that maps the statistic to the
interval [0, 1], with higher being better.

AVERAGE STATISTICS

All edges

Alg  Sim     AP     AR    AHP    AHR  McAdj  McArrow  F1Adj  F1Arrow     SHD       E      U
4    3  0.933  0.592  0.971  0.858  0.735    0.909  0.723    0.911  169.300  0.985  0.657
4    2  0.948  0.561  0.987  0.804  0.722    0.886  0.705    0.886  177.900  0.543  0.654
2    3  0.893  0.602  0.940  0.897  0.725    0.915  0.719    0.918  176.400  0.706  0.646
2    2  0.914  0.572  0.966  0.827  0.714    0.889  0.703    0.891  181.000  0.401  0.645
4    1  0.945  0.311  0.999  0.818  0.533    0.900  0.468    0.900  265.400  0.199  0.627
2    1  0.920  0.323  0.998  0.814  0.536    0.897  0.477    0.897  261.400  0.118  0.622
3    3  0.933  0.592  0.761  0.951  0.735    0.843  0.723    0.845  175.000  1.424  0.616
3    2  0.945  0.560  0.778  0.919  0.720    0.837  0.703    0.842  188.500  0.535  0.616
1    2  0.914  0.572  0.799  0.922  0.714    0.851  0.703    0.856  181.000  0.229  0.615
1    3  0.893  0.602  0.765  0.957  0.725    0.848  0.719    0.850  173.100  0.511  0.610
1    1  0.920  0.323  0.889  0.886  0.536    0.883  0.477    0.887  270.500  0.256  0.604
3    1  0.945  0.311  0.844  0.878  0.533    0.854  0.468    0.860  277.700  0.200  0.596

STANDARD DEVIATIONS

All edges

Alg  Sim     AP     AR    AHP    AHR  McAdj  McArrow  F1Adj  F1Arrow    SHD       E      U
4    3  0.025  0.043  0.011  0.024  0.035    0.016  0.037    0.016  18.792  0.234  0.657
4    2  0.014  0.022  0.006  0.018  0.017    0.011  0.019    0.011   7.564  0.091  0.654
2    3  0.039  0.037  0.022  0.026  0.039    0.022  0.038    0.021  21.619  0.127  0.646
2    2  0.024  0.026  0.017  0.020  0.023    0.012  0.024    0.012  12.028  0.090  0.645
4    1  0.031  0.017  0.002  0.017  0.021    0.010  0.021    0.011   6.818  0.054  0.627
2    1  0.039  0.019  0.003  0.021  0.025    0.012  0.024    0.013   8.316  0.046  0.622
3    3  0.025  0.043  0.016  0.017  0.035    0.011  0.037    0.011  20.160  0.308  0.616
3    2  0.019  0.022  0.019  0.017  0.019    0.007  0.020    0.007   8.810  0.166  0.616
1    2  0.024  0.026  0.026  0.014  0.023    0.015  0.024    0.015  10.985  0.050  0.615
1    3  0.039  0.037  0.029  0.012  0.039    0.019  0.038    0.019  20.755  0.079  0.610
1    1  0.039  0.019  0.028  0.014  0.025    0.010  0.024    0.009   9.336  0.148  0.604
3    1  0.031  0.017  0.023  0.014  0.021    0.012  0.021    0.011   7.243  0.054  0.596

WORST CASE

All edges

Alg  Sim     AP     AR    AHP    AHR  McAdj  McArrow  F1Adj  F1Arrow     SHD       E      U
4    3  0.889  0.530  0.949  0.819  0.690    0.882  0.673    0.884  145.000  0.573  0.657
4    2  0.923  0.520  0.979  0.773  0.693    0.867  0.671    0.866  168.000  0.452  0.654
2    3  0.835  0.555  0.900  0.848  0.670    0.884  0.667    0.889  147.000  0.470  0.646
2    2  0.872  0.525  0.941  0.789  0.680    0.865  0.669    0.866  163.000  0.205  0.645
4    1  0.899  0.280  0.994  0.785  0.499    0.882  0.431    0.880  252.000  0.125  0.627
2    1  0.863  0.290  0.994  0.772  0.495    0.874  0.436    0.871  244.000  0.067  0.622
3    3  0.889  0.530  0.744  0.932  0.690    0.829  0.673    0.833  145.000  0.990  0.616
3    2  0.909  0.520  0.754  0.887  0.693    0.826  0.671    0.831  177.000  0.255  0.616
1    2  0.872  0.525  0.761  0.897  0.680    0.826  0.669    0.832  167.000  0.150  0.615
1    3  0.835  0.555  0.724  0.944  0.670    0.819  0.667    0.820  146.000  0.412  0.610
1    1  0.863  0.290  0.837  0.865  0.495    0.861  0.436    0.867  254.000  0.089  0.604
3    1  0.899  0.280  0.808  0.849  0.499    0.837  0.431    0.843  265.000  0.167  0.596


The definitions of the output statistics, the parameter settings, the simulation method used, the list of algorithms to be compared, and the utility function specified are all printed, followed by three tables. Ten runs were requested. The first table reports average statistic value across the ten runs for each performance statistic, algorithm, and simulation. The second table shows the standard deviations of these averages. The third table, instead of reporting the average statistic value across the ten runs, reports instead the worst statistic in each case. Note that if only one run were requested, the standard deviation tables would be filled with “NaN" (Not a Number) symbols, since that would entail a division by zero.

Notice that the various combinations of simulations and algorithms are all represented as different rows in this output. This code loads in the simulated data and graphs just saved and passes them to the compareFromFiles method, along with the algorithms to test, the statistics to compare them on, and the parameter settings needed to run all of the algorithms. If a parameter value is not set, and a default value is known for it, the default values will be used. In any case, all parameter values used will be printed in the output. Any parameters values that were used when the data were simulated will override any parameter values set in this script. Output is written to the file “comparison/Comparison.txt".

Performance statistics may be added or removed, though adding too many statistics will make the summary table very wide. Note that a “U" statistic is added to the end of this list to indicate the value of the utility function by which the algorithms are sorted high to low. These two methods, {CodeChunk} {CodeInput} comparison.setSortByUtility(true); comparison.setShowUtilities(true); can be used to control whether the output is sorted by the utility value for each simulation/algorithm combination, and whether a “U” column of utilities is included in the output table. This allows the user to spot ties, for instance, or to see whether certain algorithms are very close in their utilities. No user-specified statistic should be named “U" or else there will be two “U" columns. Note also that the user may elect to print parameter settings to the output table by inserting statistics.add(new ParameterColumn("alpha")); where “alpha” may be any parameter. Finally, though by default the output of a Pattern search algorithm (e.g., PC, GES) is compared to the Pattern (a.k.a. CPDAG) implied by the true graph, and the output of a PAG search algorithm (e.g., FCI, GFCI) is compared to the PAG implied by the true graph (marginalizing over latent variables), the user may override these defaults and choose to compare the estimated graphs to (for example) the true DAG for each run. The syntax for such an override requires calling: {CodeChunk} {CodeInput} comparison.setComparisonGraph(ComparisonGraph.true_DAG); before executing one of the main comparison methods. true_DAG can be replaced by true_Pattern or true_PAG.

## 9 Running simulations without saving the data

For simulations where the data need not be saved, the simulation output can be used directly. The user may elect not to save simulation data if there are too many files, or if they do not plan to run further comparisons with the same data, or if the files are too large for the data loaders to comfortably handle or store.

Here is an example: {CodeChunk} {CodeInput} public class ExampleCompareSimulation public static void main(String… args) Parameters parameters = new Parameters();

parameters.set("numRuns", 10); parameters.set("numMeasures", 100); parameters.set("avgDegree", 4); parameters.set("sampleSize", 500); parameters.set("alpha", 1e-4, 1e-3, 1e-2);

Statistics statistics = new Statistics();

statistics.setWeight("AP", 1.0); statistics.setWeight("AR", 0.5);

Algorithms algorithms = new Algorithms();

Simulations simulations = new Simulations();

Comparison comparison = new Comparison();

comparison.setShowAlgorithmIndices(true); comparison.setShowSimulationIndices(true); comparison.setSortByUtility(true); comparison.setShowUtilities(true);

comparison.compareFromSimulations("comparison", simulations, algorithms, statistics, parameters); The specific parameters that an algorithm uses are returned by the method getParameters on the algorithm object. In the example, we wish to maximize adjacency precision, with some deference to adjacency recall, so we give “AP" a weight of 1.0 and “AR" a weight of 0.5.

Note also that one of our specified algorithms is a “hybrid” which runs fGES first, and then CPC using the output of fGES as a starting graph. Many algorithms can take an initial graph as input, and the syntax demonstrates how to instantiate such “hybrid” algorithms (i.e., pass an algorithm constructor for the starting graph, fGES in our case, as a parameter to another algorithm constructor, CPC in our case).

With all that in place, the user simply executes the main method of one of these classes.999In IntelliJ IDEA, right-click on the name of class with the main method and select “Run." We show the output from the second example, because it involves two simulations and four search algorithms with three different values for tuning parameters:

Sat Jul 23 14:38:30 EDT 2016

Statistics:

McArrow = Matthew’s correlation coefficient for adjacencies
F1Arrow = F1 statistic for arrows
SHD  = Structural Hamming Distance
E = Elapsed Time in Seconds

Parameters:

numMeasures = 100
numLatents = 0
avgDegree = 4
maxDegree = 100
maxIndegree = 100
maxOutdegree = 100
connected = 0
numRuns = 10
varLow = 1
varHigh = 3
sampleSize = 500

Simulation:

Linear, Gaussian SEM simulation
Algorithms:

1. PC ("Peter and Clark") using Fisher Z test, alpha = 1.0E-4
2. PC ("Peter and Clark") using Fisher Z test, alpha = 0.001
3. PC ("Peter and Clark") using Fisher Z test, alpha = 0.01
4. CPC (Conservative "Peter and Clark") using Fisher Z test
with initial graph from FGES (Fast Greedy Equivalence Search) using Sem BIC Score, alpha = 1.0E-4
5. CPC (Conservative "Peter and Clark") using Fisher Z test
with initial graph from FGES (Fast Greedy Equivalence Search) using Sem BIC Score, alpha = 0.001
6. CPC (Conservative "Peter and Clark") using Fisher Z test
with initial graph from FGES (Fast Greedy Equivalence Search) using Sem BIC Score, alpha = 0.01
7. PC-stable ("Peter and Clark" stable) using Fisher Z test, alpha = 1.0E-4
8. PC-stable ("Peter and Clark" stable) using Fisher Z test, alpha = 0.001
9. PC-stable ("Peter and Clark" stable) using Fisher Z test, alpha = 0.01
10. CPC-stable (Conservative "Peter and Clark" stable) using Fisher Z test, alpha = 1.0E-4
11. CPC-stable (Conservative "Peter and Clark" stable) using Fisher Z test, alpha = 0.001
12. CPC-stable (Conservative "Peter and Clark" stable) using Fisher Z test, alpha = 0.01

Weighting of statistics:

U =
1.0 * f(AP)
0.5 * f(AR)

Note that f for each statistic is a function that maps the statistic to the
interval [0, 1], with higher being better.

AVERAGE STATISTICS

All edges

Alg     AP     AR    AHP    AHR  McAdj  McArrow  F1Adj  F1Arrow     SHD       E      U
6  0.980  0.677  0.964  0.919  0.809    0.939  0.801    0.941  133.100  1.369  0.659
5  0.988  0.633  0.974  0.890  0.784    0.928  0.771    0.930  148.600  1.738  0.652
9  0.956  0.671  0.735  0.964  0.795    0.833  0.788    0.834  137.100  0.507  0.646
12  0.956  0.671  0.974  0.862  0.795    0.913  0.788    0.915  134.700  0.569  0.646
4  0.988  0.598  0.977  0.865  0.761    0.916  0.744    0.917  162.200  2.644  0.643
8  0.965  0.623  0.747  0.954  0.769    0.836  0.757    0.838  156.100  0.426  0.638
11  0.965  0.623  0.981  0.844  0.769    0.906  0.757    0.907  152.000  0.441  0.638
3  0.933  0.685  0.760  0.972  0.793    0.852  0.790    0.852  132.700  0.322  0.638
2  0.947  0.631  0.794  0.961  0.766    0.867  0.758    0.869  154.400  0.277  0.632
7  0.969  0.577  0.766  0.937  0.740    0.839  0.723    0.842  175.200  0.427  0.629
10  0.969  0.577  0.982  0.824  0.740    0.895  0.723    0.896  169.800  0.407  0.629
1  0.944  0.587  0.809  0.948  0.737    0.870  0.723    0.873  171.800  0.544  0.619

STANDARD DEVIATIONS

All edges

Alg     AP     AR    AHP    AHR  McAdj  McArrow  F1Adj  F1Arrow    SHD       E      U
6  0.008  0.032  0.013  0.015  0.020    0.010  0.023    0.010  12.662  0.202  0.659
5  0.004  0.030  0.011  0.026  0.019    0.014  0.023    0.014  12.563  0.226  0.652
9  0.015  0.034  0.024  0.011  0.023    0.012  0.025    0.013  16.017  0.133  0.646
12  0.015  0.034  0.011  0.013  0.023    0.007  0.025    0.007  14.863  0.116  0.646
4  0.007  0.028  0.014  0.029  0.019    0.016  0.023    0.016  11.905  0.590  0.643
8  0.012  0.031  0.019  0.011  0.022    0.012  0.024    0.012  15.029  0.114  0.638
11  0.012  0.031  0.010  0.016  0.022    0.010  0.024    0.010  12.815  0.068  0.638
3  0.013  0.029  0.027  0.009  0.021    0.016  0.022    0.017  12.667  0.096  0.638
2  0.012  0.029  0.021  0.006  0.022    0.013  0.024    0.013  13.302  0.087  0.632
7  0.016  0.026  0.020  0.012  0.021    0.013  0.023    0.013  12.985  0.115  0.629
10  0.016  0.026  0.009  0.019  0.021    0.009  0.023    0.010  11.292  0.062  0.629
1  0.015  0.028  0.034  0.012  0.022    0.021  0.024    0.020  14.172  0.313  0.619

WORST CASE

All edges

Alg     AP     AR    AHP    AHR  McAdj  McArrow  F1Adj  F1Arrow     SHD       E      U
6  0.964  0.620  0.938  0.896  0.768    0.928  0.756    0.931  120.000  1.046  0.659
5  0.983  0.565  0.958  0.846  0.738    0.906  0.717    0.907  134.000  1.350  0.652
9  0.931  0.610  0.698  0.945  0.752    0.813  0.742    0.812  119.000  0.332  0.646
12  0.931  0.610  0.958  0.849  0.752    0.901  0.742    0.903  116.000  0.351  0.646
4  0.976  0.535  0.958  0.818  0.717    0.891  0.693    0.891  148.000  1.970  0.643
8  0.947  0.550  0.713  0.936  0.721    0.816  0.701    0.817  142.000  0.319  0.638
11  0.947  0.550  0.968  0.823  0.721    0.893  0.701    0.894  138.000  0.323  0.638
3  0.915  0.630  0.729  0.960  0.754    0.829  0.748    0.830  119.000  0.185  0.638
2  0.932  0.565  0.762  0.953  0.718    0.845  0.704    0.847  140.000  0.154  0.632
7  0.942  0.510  0.746  0.909  0.693    0.822  0.667    0.826  166.000  0.165  0.629
10  0.942  0.510  0.973  0.793  0.693    0.879  0.667    0.878  160.000  0.311  0.629
1  0.928  0.515  0.766  0.932  0.683    0.840  0.662    0.843  157.000  0.204  0.619


In the summary table, the algorithm with the best utility value is printed at the top. Utility ties or near-ties are noted in the “U" column. Algorithms are printed in the same order in each table.

For continuous or discrete data, tables of the sort above are produced, but for mixed variable algorithms, additional tables are written depending on the types of variables an edge connects. One table is produced with statistics for all edges taken together, then a second table is produced for just the discrete-discrete edges, a third table for the discrete-continuous edges, and finally a fourth table for the continuous-continuous edges, for each table type. These are not shown in the above example. Notice that the user can vary the input parameters (called “tuning parameters") to search algorithms, e.g., the “alpha" value theshold for a conditional independence judgement. If the user specifies multiple values for such parameters, each parameter setting is effectively treated as a different algorithm. One potential benefit of this functionality is to aid in choosing the value of tuning parameters based on algorithm performance.

## 10 Initial configuration

The available simulations, algorithms, independence tests, scores, and performance statistics can be augmented over time. The options available in the current configuration are as follows.

Simulations:

• For continuous data, form a linear Gaussian structural equation model from the randomly generated DAG and simulate i.i.d. data.

• For discrete data, form a multinomial model from the randomly generatred DAG and simulate i.i.d. data. The user needs to specify the number of categories per node.

• For mixed data, use the model given in Lee and Hastie (2013), as developed by Sedgewick et al. (2016). The user needs to specify the number of categories per node as well as the percentage of nodes that should be made discrete.

• Also for mixed data, run a SEM simulation as in the first option and discretize a portion of the nodes using breakpoints. The user needs to specify the percentage of nodes to render discrete as well as the number of categories to use to discretize them.

• For continuous data, form a SEM simulation using non-linear functions and non-Gaussian error terms. The user needs to specify the non-linear functions and error distributions, which can be chosen from a number of options.

• A cyclic (non-recursive) linear Gaussian structural equation model along the lines of the first option.

• A version of the first option which uses a scale-free graph.

• A (stationary) time series SEM simulation. This is a version of the first option which follows the form of a first-order structural vector autoregression.

Pattern search algorithms:

• PC (takes an independence test as input)

• CPC (takes an independence test as input)

• PC-stable (takes an independence test as input)

• CPC-stable (takes an independence test as input)

• fGES (takes a score as input)

• IMaGES (takes a score and multiple data sets as input)

PAG search algorithms:

• FCI (takes an independence test as input)

• RFCI (takes an independence test as input)

• CFCI (takes an independence test as input)

• GFCI (takes a score and independence test as input)

• tsFCI (a time series version of FCI, takes an independence test as input)

• tsGFCI (a time series version of GFCI, takes a score and independence test as input)

• CCD (takes an independence test as input, for linear cyclic models)

Other continuous-data algorithms:

• LiNGAM

• GLASSO (graphical lasso)

Other mixed-data algorithms:

• MGM

• MixedFGES (heuristic using the BDeu or BIC score, treating all variables as discrete or treating all variables as continuous, respectively)

Latent variable clustering algorithms:

• FOFC (Find One-Factor Clusters)

• FTFC (Find Two-Factor Clusters)

• BPC (BuildPureClusters)

Independence Tests:101010See Spirtes et al. (2000) for a description of several of these tests: Fisher Z, Chi-square, and G-square. The Mixed Multinomial Logistic Regression Wald Test, Conditional Gaussian Likelihood Ratio Test, and Conditional Gaussian BIC score are described in Ramsey (In preparation).

• Fisher Z

• Chi-square

• G-square

• Mixed Multinomial Logistic Regression Wald Test

• Conditional Correlation (for non-linear, non-Gaussian data; see Ramsey (2014))

• Conditional Gaussian Likelihood Ratio Test

• SEM BIC score-based test

• d-separation test directly on the graph

Scores:

• SEM BIC score (continuous data)

• Discrete BIC score (discrete data)

• BDeu score (discrete data)

• Conditional Gaussian BIC score (continuous data)

• d-separation score directly on the graph

Statistics:

• Arrow Precision

• Arrow Recall

• Matthew’s Correlation for Arrows

• F1 Statistic for Arrows

• Structural Hamming Distance (SHD)

• Elapsed Time

These algorithms, tests, and scores can be used in any combination compatible across data types. So continuous tests (e.g., Fisher Z) cannot be used with discrete data. Most algorithms are also able to take an “initial graph” as input, so they can be combined in sequence. As in our example in Section 9, the user may search for a model using fGES, and use this as input to CPC to create a “hybrid” algorithm. Additions to this list are already underway, to accommodate more scores, tests and algorithms. The user may examine the current list of implemented modules by constructing a “configuration report,” which includes a list of relevant parameters for each module: {CodeChunk} {CodeInput} new Comparison().configuration("comparison/Config.txt"); This creates a file in the directory specified which lists all available search algorithms, simulations, independence tests, scores, and performance statistics. The file also indicates for each class what parameters are able to modified, and what the default settings are.

## 11 Comparing algorithms across platforms

The comparison tool described has additional features may be used to compare causal search algorithms from different platforms and languages. As already discussed, popular methods are written in or accessed by \proglangMATLAB and by \proglangR, with underlying code written sometimes in \proglangC or \proglangC++. It is possible that in the near future, algorithms in \proglangPascal or \proglangPython will need to be included as well. One can in some cases script these algorithms from \proglangJava, but even when feasible, doing so can distort the running times. Out of fairness, \proglangMATLAB implementations should be run in \proglangMATLAB, \proglangR implementations in \proglangR, \proglangJava implementations in \proglangJava, and so on; these are the preferred ways of accessing those algorithms and the ways that are documented. It is possible that some runs for particular algorithms on particular platforms may take a very long time, and so it is preferable that these runs (and all runs, really) need not be repeated for every comparison or evaluation. It is also preferable not to repeat runs in order to obtain a fixed table of results rather than one that changes from compilation to compilation, since some algorithms may rely on random perturbations. Since the output graphs and elapsed runtimes from different platforms may not be consistently formatted, one may either write code in particular projects to output results in the \proglangTETRAD format, or one may write parsers in the \pkgalgcomparison package to read in results in the native format of the alternative platforms.

We opt for the latter approach and introduce the notion of an “external” algorithm in the comparison tool, an abstract class that extends the Search interface and allows for graphs and elapsed times from individual projects to be parsed into the internal format the comparison tool uses. We have prepared five extensions of this abstract class:

2. ExternalAlgorithmBnlearnMmhc.java

3. ExternalAlgorithmBNTPc.java

4. ExternalAlgorithmPcalgPc.java

5. ExternalAlgorithmPcalgGes.java

The first of these will load in graphs and elapsed times as written to files in \proglangTETRAD. The second will load in graphs in the format saved by \pkgbnlearn, an \proglangR package (Scutari, 2010). The third will load in outputs and elapsed time in the format used by \pkgBayes Net Toolbox in \proglangMATLAB (Murphy, 2001). The fourth and fifth will load output graphs and elapsed times in the format used by \pkgpcalg in \proglangR, specifically for the PC algorithm and GES algorithm output formats (Kalisch et al., 2012).111111In some cases platforms may produce output in multiple formats, and we chose convenient representations to import to our comparison tool. In each platform, the user may write a script to execute the individual runs and save the estimed graphs and elapsed runtimes in an appropriate directory structure. The user then may write a script (or modify the one we include as an example) for the comparison tool; this script loads the estimated graphs and – using the ExternalAlgorithm API – compares the estimated graphs against the specified “true graphs” (also the elapsed runtimes), where the comparison tool compiles evaluatory statistics in the usual way. While not technically straightforward, and certainly not entirely automatic, we believe this approach instantiates a fair way to compare causal search algorithms across platforms. Of course, the user may rather save estmated graphs from their preferred platform in one of the above formats in lieu of writing additional graph and elapsed runtime parsers. That is, the user may save estimated graphs and elapsed runtimes for a new project in a manner that can be read by the parser in ExternalAlgorithmPcalgPc.java or one of the other available parsers.

To create new ExternalAlgorithm classes with new parsers, the user may model their new class on existing ExternalAlgorithm classes and include overrides for the following methods: search, getComparisonGraph, getDescription, getDataType, and getElapsedTime. The search method must parse a graph from a file and return a Graph object. This takes a data set as an argument. The getElapsedTime method must parse the elapsed time from a file and return the number; again, this takes a data set as an argument. Here is how we implement the method for algorithms like PC implemented in \pkgpcalg:

{CodeChunk}{CodeInput}

public class ExternalAlgorithmPcalgPc extends ExternalAlgorithm static final long serialVersionUID = 23L; private final String extDir; private String shortDescription = null;

public ExternalAlgorithmPcalgPc(String extDir) this.extDir = extDir; this.shortDescription = new File(extDir).getName().replace("_", " ");

public ExternalAlgorithmPcalgPc(String extDir, String shortDecription) this.extDir = extDir; this.shortDescription = shortDecription;

public Graph search(DataModel dataSet, Parameters parameters) int index = getIndex(dataSet);

File file = new File(path, "/results/" + extDir + "/" + (simIndex + 1) + "/graph." + index + ".txt");

System.out.println(file.getAbsolutePath());

GraphUtils.circleLayout(graph, 225, 200, 150);

return graph; catch (IOException e) throw new RuntimeException("Couldn’t parse graph.");

/** * Returns the pattern of the supplied DAG. **/ public Graph getComparisonGraph(Graph graph) return new EdgeListGraph(graph);

public String getDescription() if (shortDescription == null) return "Load data from " + path + "/" + extDir; else return shortDescription;

@Override public DataType getDataType() return DataType.Continuous;

@Override public long getElapsedTime(DataModel dataSet, Parameters parameters) int index = getIndex(dataSet);

File file = new File(path, "/elapsed/" + extDir + "/" + (simIndex + 1) + "/graph." + index + ".txt");

try BufferedReader r = new BufferedReader(new FileReader(file)); String l = r.readLine(); // Skip the first line. return Long.parseLong(l); catch (IOException e) return -99;

public static Graph loadGraphPcAlgMatrix(DataSet dataSet) List<Node> vars = dataSet.getVariables();

Graph graph = new EdgeListGraph(vars);

for (int i = 0; i < dataSet.getNumRows(); i++) for (int j = 0; j < dataSet.getNumColumns(); j++) if (i == j) continue; int g = dataSet.getInt(i, j); int h = dataSet.getInt(j, i);

if (g == 1 && h == 1 && !graph.isAdjacentTo(vars.get(i), vars.get(j))) graph.addUndirectedEdge(vars.get(i), vars.get(j)); else if (g == 1 && h == 0) graph.addDirectedEdge(vars.get(j), vars.get(i));

return graph;

Implementations for other platforms or packages would simply need to adjust the parsing of graphs and elapsed runtimes. The following is an example script which compares several ExternalAlgorithms One would then construct the Parameters and Statistics objects in the usual way and proceed as follows, for example, in a separate class with a main method, as in the Example classes above:

{CodeChunk}{CodeInput}

public class CompareExternalAlgorithms public static void main(String… args) Parameters parameters = new Parameters();

parameters.set("numRuns", 10);

Statistics statistics = new Statistics();

statistics.setWeight("AP", 1.0); statistics.setWeight("AR", 0.5); statistics.setWeight("AHP", 1.0); statistics.setWeight("AHR", 0.5);

Algorithms algorithms = new Algorithms();

Comparison comparison = new Comparison(); comparison.setShowAlgorithmIndices(true); comparison.setShowSimulationIndices(true); comparison.setSortByUtility(false); comparison.setShowUtilities(false);

comparison.generateReportFromExternalAlgorithms( "/Users/user/comparison-data/output", "/Users/user/causal-comparisons/output", algorithms, statistics, parameters);

A comparison report can be generated by executing this script. Here, we use descriptive names for directory names; these names appear in the report with underscores replaced by spaces. We use three implementations of the PC algorithm just for illustration here – of course the user may rather wish to compare different algorithms. The user may fix the data for all comparisons in advance. Thus, if the user may run new algorithms (or evaluate with new statistics) on the same data in multiple executions of the (modified) script as desired. This makes the comparison very flexible and encourages reproducibility.

## 12 Graphical user interface

A user may elect to use the graphical user interface (GUI) in lieu of writing/modifying \proglangJava scripts like the ones in this document. The GUI effectively executes these scripts in the background, but allows users to point-and-click and use drop-down menus to configure options instead of writing text. The user creates and manipulates modules (for creating data, estimation, search, etc.) in boxes connected by arrows. Options like sample sizes, parameter settings, and so on are set by the user with drop-down menus or by typing in empty fields. Most of the functions described in this document can be executed by the GUI, though not all; currently the GUI remains under development so we hope that complete equivalence between the GUI and script-based operation of this algorithm comparison tool is forthcoming. We do not give detailed instructions on how to use the GUI here for two reasons. First, the GUI has its own internal user-friendly “help" menu and guide. This has step-by-step instructions on which modules to connect and what their functionalities are. Second, the GUI is being updated and improved in response to user feedback. So, the visual appearance and available options of the GUI may change in response to user needs.

## 13 Conclusion

The algorithm tool described here is ready for use and can be found in the development branch of the GitHub repository for \proglangTETRAD:
https://github.com/cmu-phil/tetrad.
It is located in the package:
edu.cmu.tetrad.algcomparison.
The example files above are included in the package:
edu.cmu.tetrad.algcomparison.examples.

Though new components will be developed and added, the modular architecture of the tool should make incorporating new or customized components easy and intuitive. Several utilities for analyzing time series data are currently under development, and additional structure learning algorithms (including pairwise causal orientation methods) will be added shortly. We hope that the flexibility of the tool and the \proglangTETRAD environment will encourage researchers to investigate the performance of their preferred algorithms, as well as alternatives, on simulated data appropriate to their domain, thus aiding in the deliberate and confident endorsement of methods well-suited to particular scientific problems.

## Acknowledgements

We thank many people for help in putting this report together and suggesting various modifications that have made it better, including (but not limited to) Peter Spirtes, Clark Glymour, Greg Cooper, Takis Benos, and A.J. Sedgewick. This research is supported by grant U54HG008540 awarded by the National Institutes of Health.

## Appendix A

https://www.java.com/en/download/help/download_options.xml
The detailed instructions depend on one’s operating system. To edit, compile, and execute the \proglangTETRAD source code, one needs an Integrated Development Environment (IDE). We recommend IntelliJ IDEA CE which can be downloaded (at no cost) here:
https://www.jetbrains.com/idea/download/
Alternative IDEs include Eclipse and NetBeans. Once the user has \proglangJava and an IDE installed, they may follow these detailed instructions to set up the \proglangTETRAD project in IntelliJ IDEA: https://github.com/cmu-phil/tetrad/wiki/Setting-up-Tetrad-in-IntelliJ-IDEA
The basic steps are: 1) Set up a new “project from existing sources” 2) select the “pom.xml” file in the directory “Tetrad” 3) choose all default settings in IntelliJ IDEA, and 4) compile the project with the Maven build tool within IntellIJ IDEA. From this point, the user is ready to edit and execute scripts in the “algcomparison/examples” folder, which contains the scripts referred to in this paper.

## Appendix B

The user may wish to save their output in tab-delimited format, so that they can readily copy and paste the summary tables into a program like Excel. To do this, replace the last line of any of script which generates summary tables with: {CodeChunk} {CodeInput} Comparison comparison = new Comparison(); comparison.setTabDelimitedTables(true); comparison.compareAlgorithms("comparison/Comparison.txt", simulations, algorithms, statistics, parameters);

## References

• Bessler and Lee (2002) Bessler DA, Lee S (2002). “Money and prices: US data 1869–1914 (a study with directed graphs).” Empirical Economics, 27(3), 427–446.
• Chickering (2002) Chickering DM (2002). “Optimal structure identification with greedy search.” Journal of Machine Learning Research, 3, 507–554.
• Chickering and Meek (2002) Chickering DM, Meek C (2002). “Finding optimal Bayesian networks.” In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pp. 94–102. Morgan Kaufmann Publishers Inc.
• Colombo and Maathuis (2014) Colombo D, Maathuis MH (2014). “Order-independent constraint-based causal structure learning.” Journal of Machine Learning Research, 15(1), 3741–3782.
• Colombo et al. (2012) Colombo D, Maathuis MH, Kalisch M, Richardson TS (2012). “Learning high-dimensional directed acyclic graphs with latent and selection variables.” The Annals of Statistics, pp. 294–321.
• Demiralp and Hoover (2003) Demiralp S, Hoover KD (2003). “Searching for the causal structure of a vector autoregression.” Oxford Bulletin of Economics and Statistics, 65(s1), 745–767.
• Drton and Maathuis (2017) Drton M, Maathuis MH (2017). “Structure learning in graphical modeling.” Annual Review of Statistics and Its Application, 4, 365–393.
• Ebert-Uphoff and Deng (2012) Ebert-Uphoff I, Deng Y (2012). “Causal discovery for climate research using graphical models.” Journal of Climate, 25(17), 5648–5665.
• Glymour et al. (1987) Glymour C, Scheines R, Spirtes P, Kelly K (1987). Discovering Causal Structure: Artificial Intelligence, Philosophy of Science, and Statistical Modeling. Academic Press.
• Heinze-Deml et al. (2018) Heinze-Deml C, Maathuis MH, Meinshausen N (2018). “Causal structure learning.” Annual Review of Statistics and Its Applications, arXiv preprint arXiv:1706.09141.
• Kalisch et al. (2012) Kalisch M, Mächler M, Colombo D, Maathuis MH, Bühlmann P (2012). “Causal inference using graphical models with the \proglangR package \pkgpcalg.” Journal of Statistical Software, 47(11), 1–26.
• Kummerfeld and Ramsey (2016) Kummerfeld E, Ramsey JD (2016). “Causal clustering for 1-factor measurement models.” In KDD ’16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Proceedings.
• Kummerfeld et al. (2014) Kummerfeld E, Ramsey JD, Yang R, Spirtes P, Scheines R (2014). “Causal clustering for 2-factor measurement models.” In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 34–49. Springer.
• Lee and Hastie (2013) Lee JD, Hastie T (2013). “Structure learning of mixed graphical models.” Journal of Machine Learning Research: Workshop and Conference Proceedings (AISTATS 13), pp. 388–396.
• Mills-Finnerty et al. (2014) Mills-Finnerty C, Hanson C, Hanson SJ (2014). “Brain network response underlying decisions about abstract reinforcers.” NeuroImage, 103, 48–54.
• Murphy (2001) Murphy K (2001). “The Bayes Net Toolbox for Matlab.” Computing Science and Statistics, 33(2), 1024–1034.
• Ogarrio et al. (2016) Ogarrio JM, Spirtes P, Ramsey JD (2016). “A hybrid causal search algorithm for latent variable models.” In Proceedings of the Eighth International Conference on Probabilistic Graphical Models, pp. 368–379.
• Ramsey (In preparation) Ramsey JD (In preparation). ‘‘Mixed variable search using the conditional Gaussian likelihood.”
• Ramsey et al. (2017) Ramsey JD, Glymour M, Sanchez-Romero R, Glymour C (2017). “A million variables and more: the Fast Greedy Equivalence Search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images.” International Journal of Data Science and Analytics, 3(2), 121–129.
• Ramsey et al. (2011) Ramsey JD, Hanson SJ, Glymour C (2011). “Multi-subject search correctly identifies causal connections and most causal directions in the DCM models of the Smith et al. simulation study.” NeuroImage, 58(3), 838–848.
• Ramsey et al. (2006) Ramsey JD, Spirtes P, Zhang J (2006). “Adjacency-Faithfulness and Conservative Causal Inference.” In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, pp. 401–408. AUAI Press.
• Rau et al. (2013) Rau M, Scheines R, Aleven V, Rummel N (2013). “Does Representational Understanding Enhance Fluency - Or Vice Versa? Searching for Mediation Models.” In Proceedings of the Sixth International Conference on Educational Data Mining (EDM 13).
• Richardson (1996) Richardson T (1996). “A discovery algorithm for directed cyclic graphs.” In Proceedings of the Twelfth International Conference on Uncertainty in Artificial Intelligence, pp. 454–461. Morgan Kaufmann Publishers Inc.
• Richardson and Spirtes (2002) Richardson T, Spirtes P (2002). “Ancestral graph Markov models.” The Annals of Statistics, 30(4), 962–1030.
• Scheines et al. (1998) Scheines R, Spirtes P, Glymour C, Meek C, Richardson T (1998). “The TETRAD project: Constraint based aids to causal model specification.” Multivariate Behavioral Research, 33(1), 65–117.
• Scutari (2010) Scutari M (2010). “Learning Bayesian Networks with the \pkgbnlearn \proglangR Package.” Journal of Statistical Software, 35(3), 1–22.
• Sedgewick et al. (2016) Sedgewick AJ, Shi I, Donovan RM, Benos PV (2016). “Learning mixed graphical models with separate sparsity parameters and stability-based model selection.” BMC Bioinformatics, 17(5), 307.
• Shimizu et al. (2006) Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A (2006). “A linear non-Gaussian acyclic model for causal discovery.” Journal of Machine Learning Research, 7, 2003–2030.
• Shipley et al. (2006) Shipley B, Lechowicz MJ, Wright I, Reich PB (2006). ‘‘Fundamental trade-offs generating the worldwide leaf economics spectrum.” Ecology, 87(3), 535–541.
• Silva et al. (2006) Silva R, Scheines R, Glymour C, Spirtes P (2006). “Learning the structure of linear latent variable models.” Journal of Machine Learning Research, 7, 191–246.
• Smith et al. (2011) Smith SM, Miller KL, Salimi-Khorshidi G, Webster M, Beckmann CF, Nichols TE, Ramsey JD, Woolrich MW (2011). “Network modelling methods for fMRI.” NeuroImage, 54(2), 875–891.
• Spirtes et al. (2000) Spirtes P, Glymour C, Scheines R (2000). Causation, prediction, and search. MIT press.
• Spirtes et al. (1995) Spirtes P, Meek C, Richardson T (1995). “Causal inference in the presence of latent variables and selection bias.” In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 499–506. Morgan Kaufmann Publishers Inc.
• Spirtes et al. (1990) Spirtes P, Scheines R, Glymour C (1990). “Simulation studies of the reliability of computer-aided model specification using the TETRAD II, EQS, and LISREL programs.” Sociological Methods & Research, 19(1), 3–66.
• Spirtes and Zhang (2016) Spirtes P, Zhang K (2016). “Causal discovery and inference: concepts and recent methodological advances.” Applied Informatics, 3(3).
• Zhang (2008a) Zhang J (2008a). “Causal reasoning with ancestral graphs.” Journal of Machine Learning Research, 9, 1437–1474.
• Zhang (2008b) Zhang J (2008b). “On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias.” Artificial Intelligence, 172(16), 1873–1896.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters