[
Abstract
The literature groups algorithms to learn the structure of Bayesian networks from data in three separate classes: constraintbased algorithms, which use conditional independence tests to learn the dependence structure of the data; scorebased algorithms, which use goodnessoffit scores as objective functions to maximise; and hybrid algorithms that combine both approaches. Famously, Cowell (2001) showed that algorithms in the first two classes learn the same structures when the topological ordering of the network is known and we use entropy to assess conditional independence and goodness of fit.
In this paper we address the complementary question: how do these classes of algorithms perform outside of the assumptions above? We approach this question by recognising that structure learning is defined by the combination of a statistical criterion and an algorithm that determines how the criterion is applied to the data. Removing the confounding effect of different choices for the statistical criterion, we find using both simulated and realworld data that constraintbased algorithms do not appear to be more efficient or more sensitive to errors than scorebased algorithms; and that hybrid algorithms are not faster or more accurate than constraintbased algorithms. This suggests that commonly held beliefs on structure learning in the literature are strongly influenced by the choice of particular statistical criteria rather than just properties of the algorithms themselves.
Who Learns Better Bayesian Network Structures?] Who Learns Better Bayesian Network Structures: ConstraintBased, Scorebased or Hybrid Algorithms?
Department of Statistics, University of Oxford, United Kingdom Catharina Elisabeth Graafland catharina.graafland@unican.es
José Manuel Gutiérrez manuel.gutierrez@unican.es
Institute of Physics of Cantabria (CSICUC), Santander, Spain
Editors: Milan Studený, Jiří Vomlel, Václav Kratochvíl
Keywords: Bayesian networks; structure learning; conditional independence tests; network scores; climate networks.
1 Introduction
Bayesian networks (BNs; Koller and Friedman, 2009) are a class of graphical models defined over a set of random variables , each describing some quantity of interest, that are associated with the nodes of a directed acyclic graph (DAG) . (They are often referred to interchangeably.) The structure of the DAG, that is, the pattern of arcs in , encodes the independence relationships between those variables, with graphical separation in implying conditional independence in probability. As a result, induces the factorisation
(1) 
in which the global distribution of (with parameters ) decomposes in one local distribution for each (with parameters , ) conditional on its parents . This decomposition does not uniquely identify a single BN, but groups BNs into equivalence classes (Chickering, 1995) of models that are probabilistically indistinguishable. All BNs in the same equivalence class have the same underlying undirected graph and vstructures (patterns of arcs like , with no arc between and ); and each equivalence class is characterised by the completed partiallydirected acyclic graph (CPDAG) that arises from the combination of these two quantities.
While in principle there are many possible choices for the distribution of , the literature has focused mostly on two sets of assumptions. Discrete BNs (Heckerman et al., 1995) assume that , ; their parameters are the conditional probabilities of given each configuration of the values of its parents. As a result, is also multinomial. Gaussian BNs (GBNs; Geiger and Heckerman, 1994) assume that the are univariate normals linked by linear dependencies to their parents: in what is essentially a linear regression model of against the with regression coefficients . Equivalently, the can be parameterised with the partial correlations between and each parent given the rest. In both cases is multivariate normal. Other distributional assumptions have seen less widespread adoption due to the lack of exact conditional inference and simple closedform estimators (e.g. copulas, Elidan, 2010) or because of limitations in the DAGs they can encode (e.g. conditional linear Gaussian BNs, Lauritzen and Wermuth, 1989).
1.1 Learning a Bayesian Network from Data
The task of learning a BN with DAG and parameters from a data set containing observations is performed in two steps in an inherently Bayesian fashion:
Structure learning consists in finding the DAG that encodes the dependence structure of the data; parameter learning consists in estimating the parameters given the obtained from structure learning. If we assume parameters in different local distributions are independent, they can be learned in parallel for each node because (1) then implies
On the other hand, structure learning is well known to be computationally challenging and several algorithms have been proposed to solve it, following one of three possible approaches: constraintbased, scorebased and hybrid.
Constraintbased algorithms are based on the seminal work of Pearl on causal graphical models (Verma and Pearl, 1991). The most commonly used among them is the PC algorithm in its PCStable implementation (Colombo and Maathuis, 2014). PCStable first identifies which pairs of nodes are connected by an arc, regardless of its direction. Such nodes cannot be separated by any other subset of nodes; this condition is tested heuristically by performing conditional independence tests with increasingly large candidate separating sets. Then the algorithm identifies the vstructures among all the pairs of nonadjacent nodes and with a common neighbour using the separating sets found earlier; and sets the remaining arc directions using the rules from Chickering (1995) to obtain the CPDAG describing the identified equivalence class. More recent algorithms such as GrowShrink (Margaritis, 2003) and InterIAMB (Yaramakala and Margaritis, 2005) proceed along similar lines, but use faster heuristics to implement the first two steps.
Scorebased algorithms represent the application of general optimisation techniques to BN structure learning. Each candidate DAG is assigned a network score reflecting its goodness of fit, which the algorithm then attempts to maximise. Some examples are greedy search, simulated annealing (Bouckaert, 1995) and genetic algorithms (Larrañaga et al., 1997); a comprehensive review of these and other approaches is provided in Russell and Norvig (2009). These heuristics can also be applied to CPDAGs, as in the case of Greedy Equivalent Search (GES; Chickering, 2002).
Finally, hybrid algorithms are based on two phases: a restrict phase implementing a constraintbased strategy to reduce the space of candidate DAGs; and a maximise phase implementing a scorebased strategy to find the optimal DAG in the restricted space. The bestknown member of this family is the MaxMin Hill Climbing algorithm (MMHC) by Tsamardinos et al. (2006); another example was presented in our previous work (RSMAX2; Scutari et al., 2014).
1.2 Conditional Independence Tests and Network Scores
The choice of which conditional independence test or network score to use in structure learning depends mainly on the choice of the distribution of ; and is orthogonal to the choice of algorithm. Here we provide a brief overview of those we will use in this paper, while referring the reader to Koller and Friedman (2009) for a more comprehensive treatment.
For discrete BNs, conditional independence tests are functions of the observed frequencies for any pair of variables (, ) given the configurations of some conditioning variables . The most common is the loglikelihood ratio test
(2) 
which is equivalent to mutual information and has an asymptotic distribution. For GBNs, conditional independence tests are functions of the partial correlation coefficients . The loglikelihood ratio (and Gaussian mutual information) test takes form
(3) 
other common options are Fisher’s test and the exact test for partial correlation.
As for network scores, the Bayesian Information criterion
(4) 
is a common choice for both discrete BNs and GBNs, as it provides a simple approximation to that does not depend on any hyperparameter. is also available in closed form for both discrete BNs (Heckerman et al., 1995) and GBNs (Geiger and Heckerman, 1994).
2 Performance as a Combination of Tests, Scores and Algorithms
As it may be apparent from Sections 1.1 and 1.2, we take the view that the algorithms and the statistical criteria they use are separate and complementary in determining the overall behaviour of structure learning. Cowell (2001) followed the same reasoning when showing that constraintbased and scorebased algorithms can select identical discrete BNs. He noticed that the test in (2) has the same expression as a scorebased network comparison based on the loglikelihoods if we take . He then showed that these two classes of algorithms are equivalent if we assume a fixed, known topological ordering and we use loglikelihood and as matching statistical criteria.
In this paper we will extend that investigation by addressing the following questions:

Which of constraintbased and scorebased algorithms provide the most accurate structural reconstruction, after accounting for the effect of the choice of statistical criteria?

Are hybrid algorithms more accurate than constraintbased or scorebased algorithms?

Are scorebased algorithms slower than constraintbased and hybrid algorithms?
More precisely, we will drop the assumption that the topological ordering is known and we will compare the performance of different classes of algorithms outside of their equivalence conditions for both discrete BNs and GBNs. We choose questions Q1, Q2 and Q3 because they are most common among practitioners (e.g. Cugnata et al., 2016) and researchers (e.g., Tsamardinos et al., 2006; Koller and Friedman, 2009). Overall, there is a general view in the references above and in the literature that scorebased algorithms are less sensitive to individual errors of the statistical criteria, and thus more accurate, because they can reverse earlier decisions; and that hybrid algorithms are faster and more accurate than both scorebased and constraintbased algorithms. These differences have been found to be more pronounced at small sample sizes. Furthermore, scorebased algorithms have been found to scale less well to highdimensional data.
The main limitation we find in these studies is the confounding between the choice of the algorithms and that of the statistical criteria, which makes it impossible to assess the merits inherently attributable to the algorithms themselves. Therefore, similarly to Cowell (2001), we construct matching scores and independence tests to remove it. We consider the comparison of two DAGs and which differ by a single arc . In a scorebased approach, we can compare them using BIC from (4) and select over if
which is equivalent to testing the conditional independence of and given using the test from (2) or (3), just with a different significance threshold than the appropriate quantile. We will call this test and use it as the matching statistical criterion for to compare different learning algorithms. For discrete BNs, we will also construct a test from graph posterior probabilities using Bayes factors,
to confirm our conclusions with a second set of matching criteria.
discrete BN  discrete BN  GBN  

ALARM  MUNIN1  ARTH150  
ANDES  PATHFINDER  ECOLI72  
CHILD  PIGS  MAGICIRRI  
HAILFINDER  WATER  MAGICNIAB  
HEPAR2  WIN95PTS 
3 Simulation Study
We address Q1, Q2 and Q3 with a simulation study based on reference BNs from the Bayesian network repository (Scutari, 2012), whose conclusions will then be confirmed using realworld climate data in Section 4. Both will be implemented using the bnlearn (Scutari, 2010) and catnet (Balov and Salzman, 2017) R packages and TETRAD (Landsheer, 2010).
We assess three constraintbased algorithms (PC, GS, InterIAMB), two scorebased algorithms (tabu search, simulated annealing for BIC, GES for ) and two hybrid algorithms (MMHC, RSMAX2) on the networks in Table 1. For each BN:

We generate samples of size , , , , , and to allow for meaningful comparisons between different BNs.

We learn using (BIC, ), and (, ) as well for discrete BNs. For the latter we use the BDeu score (Heckerman et al., 1995) with a prior probability of inclusion of for each parent of each node, which is the default in TETRAD.

We measure the accuracy of the learned DAGs using the SHD distance (Tsamardinos et al., 2006) from the reference BN scaled by the number of arcs of that BN; and we measure the speed of the learning algorithms with the number of calls to the statistical criterion.
The results for (BIC, ) and the discrete BNs are illustrated in Figure 1 for small samples () and large samples (); results from (, ) are very similar and are not discussed separately for brevity. We find that 1) tabu search and simulated annealing have the highest SHDs for small samples, while tabu search has the lowest SHD for large samples, for 10/10 BNs; 2) the SHD of hybrid algorithms is comparable to that of constraintbased algorithms for all sample sizes and BNs; 3) the SHD of constraintbased algorithms is comparable to or better than that of scorebased algorithms for small sample sizes in 7/10 BNs, but it decreases more slowly as increases for all BNs. As for speed, while simulated annealing is consistently slower than other algorithms, tabu search is in the bottom left panel (“fast, accurate”) for 10/10 BNs in large samples and for 6/10 BNs in small samples.
The corresponding results for GBNs are shown in Figure 2, and confirm that for all BNs 1) tabu search and simulated annealing have a larger SHD than constraintbased or hybrid algorithms for small samples; 2) the SHD of hybrid and constraintbased algorithms is not markedly different at different sample sizes. With the exception of simulated annealing, all algorithms have very similar SHD for all large samples. However, neither tabu search nor simulated annealing achieve lower SHD than constraintbased or hybrid algorithms regardless of the sample size.
4 RealWorld Climate Data
Climate networks have recently attracted a great deal of interest due to their potential to analyse the complex spatial structure of climate data. This includes spatial dependence among nearby locations, but also longrange spatial dependencies connecting distant regions in the world, known as teleconnections (Tsonis et al., 2008). These teleconnections represent largescale oscillation patterns—such the as El Niño Southern Oscillation (ENSO)—which modulate the synchronous behaviour of distant regions (Yamasaki et al., 2008). The most popular climate network models are complex networks (Tsonis et al., 2006), which are easy to build since they are based on pairwise correlations (arcs are established between pairs of stations with correlations over a given threshold) and provide topological information in the network structure (e.g. highly connected regions). Bayesian networks have been proposed as an alternative methodology for climate networks that can model both marginal and conditional dependence structures and that allows probabilistic inference (Cano et al., 2004). However, learning such networks is computationally demanding and choosing an appropriate structure learning algorithm is crucial. Here we consider an illustrative case study modelling global surface temperature and we reassess the performance of the different learning methods we used in Section 3.
4.1 Data and Methods
We use monthly surface temperature values on a global resolution (approx. 1000 km) regular grid for a representative climatic period (1981 to 2010), as provided by the NCEP/NCAR reanalysis^{1}^{1}1https://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html. Figure 3 shows the mean temperature (climatology) for the whole period as well as the anomaly (difference from the mean climatological values) for a particular date (January 1998, from a strong El Niño episode with high tropical Pacific temperatures).
The surface temperature at each gridpoint is assumed to be normally distributed; hence we construct GBNs from the data in which nodes represent the (anomaly of) surface temperature at different gridpoint and arcs represent spatial dependencies. Thus, we define as the monthly anomaly value of the temperature at location for a period of 30 years (). The anomaly value is obtained by removing the mean annual cycle (i.e. the 30year mean monthly values) from the raw data. The location of a gridpoint is defined by its latitude and longitude. Hence the node set in the corresponding network is characterised as with .
Similarly to Section 3, we assess two constraintbased algorithms (PC, GS), two scorebased algorithms (tabu search and hill climbing, HC) and one hybrid algorithm (MMHC). Note, however, that in this case the sample size is fixed to what was considered a “small sample” even for a DAG with no arcs: .
In order to construct an appropriate pair of matching criteria, we introduce an extended version of BIC in which we introduce a regularisation parameter that penalises the number of parameters. We refer to this score as , with if , defined as
From we then construct the corresponding independence test as follows:
The additional regularisation in is required to make constraintbased algorithms reliable; using with climate data often results in learning graphs that are not valid CPDAGs. For all algorithms (, ) allow us to obtain graphs of comparable size . We have chosen to scale with the factor as in the EBIC score from Chen and Chen (2012) due to its effectiveness in feature selection. We refer to the range of s in which an algorithm can return directed acyclic graphs (DAGs) as the sampleparameter range of the algorithm.
Motivated by the above, we proceed as in Section 3 but with the following changes:

We generate 5 permutations of the order of the variables in the data to cancel local preferences in the learning algorithms (see e.g. Colombo and Maathuis, 2014).

From each permutation, we learn using (BIC, ) as well as (, ) for different values of .

Since we do not have a “true” model to use as a reference, we measure the accuracy of learned BNs along the sampleparameter range of the algorithm by their loglikelihood. We also analyse the longdistance arcs (teleconnections) established by the DAGs and assess their suitability for probabilistic inference by testing the conditional probabilities obtained when introducing some El Niño related evidence.
4.2 Results
Figure 4 shows the performance (speed, performance and number of arcs) of various structure learning algorithms as a function of , using the same colours as in Figure 1 (with the exception of hill climbing, which is new in this figure and it is shown in orange). Figure 5 (ab) shows the resulting graphs for the representative network from MMHC in Figure 4c and a comparable intermediate network of tabu search. This figure also compares the suitability of the learned BNs for probabilistic inference by propagating an El Niñolike evidence (, i.e. warm temperatures in the corresponding gridbox in tropical Pacific).
Constraintbased GS and PC produce BNs with the highest loglikelihood in the high parameter penalisation region (). However, they do not produce valid DAGs for low parameter penalisation (), yielding a maximum number of arcs (smaller than the number of nodes) with no large arcs representing teleconnections when . MMHC exhibits the poorest loglikelihood values and produces a maximum number of arcs, including only a few teleconnections (Figure 5b). The absence of a sufficient number of teleconnections makes both unsuitable for propagating evidence (Figure 5d). Therefore, tabu search and HC (with almost identical results) produce the best results, with large networks (with over arcs for ) and high likelihood values. In this case, even intermediate networks (with around arcs) include a large number of teleconnections and allow propagating evidences with realistic results (Figures 5a and c).
Finally, we find that scorebased algorithms are faster than both hybrid and constraintbased algorithms. The difference in speed is relatively amplified compared to MMHC for accounting for the fact that in this region the scorebased algorithms return DAGs containing more edges than MMHC for the same .
5 Conclusions
In this paper we revisited the problem of assessing different classes of BN structure learning algorithms; we improved over existing comparisons in the literature by removing the confounding effect of different choices of statistical criteria. Interestingly, we found that constraintbased algorithms are more accurate than scorebased algorithms for small sample sizes (Q1); and that they are as accurate as hybrid algorithms (Q2). We also found that tabu search, as a scorebased algorithm, is faster than constraintbased algorithms more often than not (Q3). For climate data we found that scorebased algorithms produce the largest networks allowing good propagation of evidence. These results, which we confirmed on both simulated data and realworld climate data, are intended to provide guidance for additional studies. The discrepancy with other analyses in the literature highlights the strong effect of the choice of the statistical criterion on learning accuracy and speed.
Acknowledgments
CEG and JMG were supported by the project MULTISDM (CGL201566583R, MINECO/FEDER).
References
 Balov and Salzman (2017) N. Balov and P. Salzman. catnet: Categorical Bayesian Network Inference, 2017. R package version 1.15.3.
 Bouckaert (1995) R. R. Bouckaert. Bayesian Belief Networks: from Construction to Inference. PhD thesis, Utrecht University, The Netherlands, 1995.
 Cano et al. (2004) R. Cano, C. Sordo, and J. M. Gutiérrez. Applications of Bayesian Networks in Meteorology. In J. A. Gámez, S. Moral, and A. Salmerón, editors, Advances in Bayesian Networks. Springer, 2004.
 Chen and Chen (2012) J. Chen and Z. Chen. Extended BIC For SmallnLargep Sparse GLM. Statistica Sinica, 22(2):555–574, 2012.
 Chickering (1995) D. M. Chickering. A Transformational Characterization of Equivalent Bayesian Network Structures. In P. Besnard and S. Hanks, editors, Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pages 87–98. Morgan Kaufmann, 1995.
 Chickering (2002) D. M. Chickering. Optimal Structure Identification With Greedy Search. Journal of Machine Learning Research, 3:507–554, 2002.
 Colombo and Maathuis (2014) D. Colombo and M. H. Maathuis. OrderIndependent ConstraintBased Causal Structure Learning. Journal of Machine Learning Research, 15:3921–3962, 2014.
 Cowell (2001) R. Cowell. Conditions Under Which Conditional Independence and Scoring Methods Lead to Identical Selection of Bayesian Network Models. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages 91–97, 2001.
 Cugnata et al. (2016) F. Cugnata, R. S. Kenett, and S. Salini. Bayesian Networks in Survey Data: Robustness and Sensitivity Issues. Journal of Quality Technology, 4(3):253–264, 2016.
 Elidan (2010) G. Elidan. Copula Bayesian Networks. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 559–567, 2010.
 Geiger and Heckerman (1994) D. Geiger and D. Heckerman. Learning Gaussian Networks. In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, pages 235–243, 1994.
 Heckerman et al. (1995) D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning, 20(3):197–243, 1995.
 Koller and Friedman (2009) D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
 Landsheer (2010) J. A. Landsheer. The Specification of Causal Models with Tetrad IV: A Review. Structural Equation Modeling, 17(4):703–711, 2010.
 Larrañaga et al. (1997) P. Larrañaga, B. Sierra, M. J. Gallego, M. J. Michelena, and J. M. Picaza. Learning Bayesian Networks by Genetic Algorithms: A Case Study in the Prediction of Survival in Malignant Skin Melanoma. In Proceedings of the 6th Conference on Artificial Intelligence in Medicine in Europe (AIME’97), pages 261–272. Springer, 1997.
 Lauritzen and Wermuth (1989) S. L. Lauritzen and N. Wermuth. Graphical Models for Associations Between Variables, Some of which are Qualitative and Some Quantitative. The Annals of Statistics, 17(1):31–57, 1989.
 Margaritis (2003) D. Margaritis. Learning Bayesian Network Model Structure from Data. PhD thesis, School of Computer Science, CarnegieMellon University, Pittsburgh, PA, May 2003.
 Russell and Norvig (2009) S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 3rd edition, 2009.
 Scutari (2010) M. Scutari. Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, 35(3):1–22, 2010.
 Scutari (2012) M. Scutari. Bayesian Network Repository. http://www.bnlearn.com/bnrepository, 2012.
 Scutari et al. (2014) M. Scutari, P. Howell, D. J. Balding, and I. Mackay. Multiple Quantitative Trait Analysis Using Bayesian Networks. Genetics, 198(1):129–137, 2014.
 Tsamardinos et al. (2006) I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The MaxMin HillClimbing Bayesian Network Structure Learning Algorithm. Machine Learning, 65(1):31–78, 2006.
 Tsonis et al. (2006) A. A. Tsonis, K. L. Swanson, and P. J. Roebber. What Do Networks Have to Do with Climate? Bulletin of the American Meteorological Society, 87(5):585–595, 2006.
 Tsonis et al. (2008) A. A. Tsonis, K. L. Swanson, and G. Wang. On the Role of Atmospheric Teleconnections in Climate. Journal of Climate, 21(12):2990–3001, 2008.
 Verma and Pearl (1991) T. S. Verma and J. Pearl. Equivalence and Synthesis of Causal Models. Uncertainty in Artificial Intelligence, 6:255–268, 1991.
 Yamasaki et al. (2008) K. Yamasaki, A. Gozolchiani, and S. Havlin. Climate Networks around the Globe are Significantly Affected by El Niño. Phys. Rev. Lett., 100:228501, 2008.
 Yaramakala and Margaritis (2005) S. Yaramakala and D. Margaritis. Speculative Markov Blanket Discovery for Optimal Feature Selection. In ICDM ’05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 809–812. IEEE Computer Society, 2005.