GradientBased Neural DAG Learning
Abstract
We propose a novel scorebased approach to learning a directed acyclic graph (DAG) from observational data. We adapt a recently proposed continuous constrained optimization formulation to allow for nonlinear relationships between variables using neural networks. This extension allows to model complex interactions while being more global in its search compared to other greedy approaches. In addition to comparing our method to existing continuous optimization methods, we provide missing empirical comparisons to nonlinear greedy search methods. On both synthetic and realworld data sets, this new method outperforms current continuous methods on most tasks while being competitive with existing greedy search methods on important metrics for causal inference.^{†}^{†}Contact: sebastien.lachapelle@umontreal.ca
GradientBased Neural DAG Learning
Sébastien Lachapelle Philippe Brouillard Tristan Deleu Simon LacosteJulien Mila, Université de Montréal Canada CIFAR AI Chair
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Structure learning and causal inference have many important applications in different areas of science such as genetics Koller and Friedman (2009); Peters et al. (2017), biology Sachs et al. (2005) and economics Pearl (2009). Bayesian networks (BN), which encode conditional independencies using directed acyclic graphs (DAG), are powerful models which are both interpretable and computationally tractable. Causal graphical models (CGM) Peters et al. (2017) are BNs which support interventional queries like: What will happen if someone external to the system intervene on variable X? Recent work suggests that causality could partially solve challenges faced by current machine learning systems such as robustness to outofdistribution samples, adaptability and explainability Pearl (2019); Magliacane et al. (2018). However, structure and causal learning are daunting tasks due to both the combinatorial nature of the space of structures and the question of structure identifiability (see Section 2.2). Nevertheless, these graphical models known qualities and promises of improvement for machine intelligence renders the quest for structure/causal learning appealing.
In this work, we propose a novel scorebased method Koller and Friedman (2009); Peters et al. (2017) for structure learning named GraNDAG which makes use of a recent reformulation of the original combinatorial problem of finding an optimal DAG into a continuous constrained optimization problem. In the original method named NOTEARS Zheng et al. (2018), the directed graph is encoded as a weighted adjacency matrix which represents coefficients in a linear structural equation model (SEM) Pearl (2009) (see Section 2.3). To enforce acyclicity, the authors propose a constraint which is both efficiently computable and easily differentiable. This makes the continuous constrained optimization problem approximately solvable by standard numerical techniques.
Our first contribution is to extend the work of Zheng et al. (2018) to deal with nonlinear relationships between variables using neural networks (NN) Goodfellow et al. (2016). GraNDAG is general enough to deal with a large variety of parametric families of conditional probability distributions. To adapt the acyclicity constraint to our nonlinear model, we use an argument similar to what is used in Zheng et al. (2018) and apply it first at the level of neural network paths and then at the level of graph paths. Our adapted constraint allows us to exploit the full flexibility of NNs. On both synthetic and realworld tasks, we show GraNDAG outperforms other approaches which leverage the continuous paradigm, including DAGGNN Yu et al. (2019), a recent nonlinear extension of Zheng et al. (2018) independently developed which uses an evidence lower bound as score.
Our second contribution is to provide a missing empirical comparison to existing methods that support nonlinear relationships but tackle the optimization problem in its discrete form using greedy search procedures, namely RESIT Peters et al. (2014) and CAM Bühlmann et al. (2014). We show that GraNDAG is competitive on the wide range of tasks we considered.
2 Background and related work
Before presenting GraNDAG, we review concepts relevant to structure and causal learning.
2.1 Causal graphical models
We suppose the natural phenomenon of interest can be described by a random vector entailed by an underlying CGM where is a probability distribution over and is a DAG Peters et al. (2017). Each node corresponds to exactly one variable in the system. Let denote the set of parents of node in and let denote the random vector containing the variables corresponding to the parents of in . Throughout the paper, we assume there are no hidden variables. In a CGM, the distribution is said to be Markov to which means we can write the probability density function (pdf) as where is the conditional pdf of variable conditioned on . A CGM can be thought of as a BN in which directed edges are given a causal meaning, allowing it to answer queries regarding interventional distributions Koller and Friedman (2009).
2.2 Structure identifiability
In general, it is impossible to recover given samples from . It is, however, customary to rely on a set of assumptions to render the structure fully or partially identifiable.
Definition 1.
Given a set of assumptions on a CGM , its graph is said to be identifiable from if there exists no other CGM satisfying all assumptions in such that and .
There are many examples of graph identifiability results for continuous variables Peters et al. (2014); Peters and Bühlman (2014); Shimizu et al. (2006); Zhang and Hyvärinen (2009) as well as for discrete variables Peters et al. (2011). Those results are obtained by assuming that the conditional pdf belongs to a specific parametric family . For example, if one assumes that
(1) 
where is a nonlinear function satisfying some mild regularity conditions, then is identifiable from (see Peters et al. (2014) for the complete theorem and its proof). We will make use of this results in Section 4.
One can consider weaker assumptions such as faithfulness. We say is faithful to when we have that whenever a conditional independence is present in , its corresponding dseparation is also present in Peters et al. (2017). This assumption allows one to identify, not itself, but the Markov equivalence class to which it belongs Spirtes et al. (2000). A Markov equivalence class is a set of DAGs which encodes exactly the same set of conditional independence statements. Such a class can be characterized by a graphical object named a completed partially directed acyclic graph (CPDAG) Koller and Friedman (2009); Peters et al. (2017). Some algorithms we use as baselines in Section 4 outputs only a CPDAG.
2.3 NOTEARS: Continuous optimization for structure learning
Structure learning is the problem of learning using a data set of samples from . In Section 5, we review popular scorebased methods which embrace the combinatorial nature of the problem via greedy search procedures. We now present the work of Zheng et al. (2018) which approaches the problem from a continuous optimization perspective.
To cast the combinatorial optimization problem into a continuous constrained one, Zheng et al. (2018) proposes to encode the graph on nodes as a weighted adjacency matrix which represents (possibly negative) coefficients in a linear structural equation model (SEM) Pearl (2009) of the form where is a noise variable. Let be the directed graph associated with the SEM and let be the (binary) adjacency matrix associated with . One can see that the following equivalence holds:
(2) 
To make sure is acyclic, the authors propose the following constraint on :
(3) 
where is the matrix exponential and is the Hadamard product.
To see why this constraint characterizes acyclicity, first note that is the number of cycles of length passing through node in graph . Clearly, for to be acyclic, we must have for . By equivalence (2), this is true when for .^{2}^{2}2 is used instead of since some entries might be negative which could yield for without having acyclicity. From there, one can simply apply the definition of the matrix exponential to see why constraint 3 characterizes acyclicity (see Zheng et al. (2018) for the full development).
The authors propose to use a regularized negative least square score (maximum likelihood for a Gaussian noise model). The resulting continuous constrained problem is
(4) 
where is the design matrix containing all samples. Even though the authors do not discuss explicitly identifiability issues, they point out consistency results for their LS score in van de Geer and Bühlmann (2012); Aragam et al. (2015); Loh and Bühlmann (2014). The nature of the problem has been drastically changed: we went from a combinatorial to a continuous problem. The difficulties of combinatorial optimization have been replaced by those of nonconvex optimization, since the feasible set is nonconvex. Nevertheless, a standard numerical solver for constrained optimization such has an augmented Lagrangian method (AL) Bertsekas (1999) can be applied to get an approximate solution. The algorithm stops when the constraint is sufficiently close to zero and returns . Finally, the values of are thresholded to zero yielding .
This method has the advantage of being more global than other approximate greedy methods in the sense that the whole matrix is updated at each iteration.^{3}^{3}3On the other hand, the update of is based on the gradient information of the augmented Lagrangian, which is local in another sense.
3 GraNDAG: Gradientbased neural DAG learning
We propose a new nonlinear extension to the framework presented in Section 2.3. For each variable , we learn a fully connected neural network with hidden layers parametrized by where is the th weight matrix of the th NN.^{4}^{4}4We omit biases for clarity. Each NN takes as input , i.e. the vector with the th component masked to zero, and outputs , the dimensional parameter vector of the desired distribution family for variable .^{5}^{5}5Not all parameter vectors need to have the same dimensionality, but to simplify the notation, we suppose The fully connected NNs have the following form
(5) 
where is a nonlinearity applied elementwise. Note that the evaluation of all NNs can be parallelized on GPU. Distribution families need not be the same for each variable. Let represents all parameters of all NNs. Without any constraint on its parameter , neural network models the conditional pdf . Note that the product is not a valid joint pdf since it does not decompose according to a DAG. We now show how one can constrain to make sure the product of all conditionals outputted by the NNs is a valid joint pdf. The idea is to define a new weighted adjacency matrix similar to the one encountered in Section 2.3, which can be directly used inside the constraint of Equation 3 to enforce acyclicity.
3.1 Neural network connectivity
Before defining the weighted adjacency matrix , we need to focus on how one can make some NN outputs unaffected by some inputs. Since we will discuss properties of a single NN, we drop the NN subscript to improve readability.
We will use the term neural network path to refer to a computation path in a NN. For example, in a NN with two hidden layers, the sequence of weights is a NN path from input to output . We say that a NN path is inactive if at least one weight along the path is zero. We can loosely interpret the path product as the strength of the NN path, where a path product equal to zero if and only if the path is inactive. Note that if all NN paths from input to output are inactive (i.e. the sum of their path products is zero), then output does not depend on input anymore since the information in input will never reach output . The sum of all path products from every input to every output can be easily computed by taking the product of all the weight matrices in absolute value.
(6) 
where is the elementwise absolute value of . Let us name the neural network connectivity matrix. It can be verified that is the sum of all NN path products from input to output . This means it is sufficient to have to render output independent of input .
Remember that each NN in our model outputs a parameter vector for a conditional distribution and that we want the product of all conditionals to be a valid joint pdf, i.e. we want its corresponding directed graph to be acyclic. With this in mind, we see that it could be useful to make a certain parameter not dependent on certain inputs of the NN. To have independent of variable , it is sufficient to have .
3.2 A weighted adjacency matrix
We now define a weighted adjacency matrix that can be used in constraint of Equation 3.
(7) 
where denotes the connectivity matrix of the NN associated with variable .
As the notation suggests, depends on all weights of all NNs. Moreover, it can effectively be interpreted as a weighted adjacency matrix similarly to what we presented in Section 2.3, since we have that
(8) 
We note to be the directed graph entailed by parameter . We can now write our adapted acyclicity constraint:
(9) 
Note that we can compute the gradient of w.r.t. (except on points of nondifferentiability arising from the absolute value function, similar to standard neural networks with ReLU activations Glorot et al. (2011)).
3.3 A differentiable score and its optimization
We propose solving the maximum likelihood optimization problem
(10) 
where denotes the set of parents of variable in graph . Note that is a valid loglikelihood function when constraint (9) is satisfied.
As suggested in Zheng et al. (2018), we apply an augmented Lagrangian approach to get an approximate solution to program (10). Augmented Lagrangian methods consist of optimizing a sequence of subproblems for which the exact solutions are known to converge to a stationary point of the constrained problem under some regularity conditions Bertsekas (1999). In our case, each subproblem is
(11) 
where and are the Lagrangian and penalty coefficients of the th subproblem, respectively. These coefficients are updated after each subproblem is solved. See Appendix A.2 for details regarding the optimization procedure.
3.4 Neural networks input masking
Without any masking, the solution outputted by the augmented Lagrangian (AL) will satisfy the constraint only up to numerical precision, i.e. some entries of will be very close to zero. Hence some thresholding is needed. To do so, we mask the inputs of each NN using a binary matrix initialized to have and zeros everywhere else. Having means the input of NN has been thresholded. This mask is integrated in the product of Equation 6 by doing without changing the interpretation of . During the AL procedure, when a certain entry is smaller than the threshold , the corresponding mask entry is set to zero. The masks are never updated via gradient descent. We provide more details on how we ensure the estimated graph is a DAG in Appendix A.3. The resulting DAG often contains spurious edges, hence we apply a final pruning step identical to what is done in CAM Bühlmann et al. (2014). We provide more details on pruning in Appendix A.4
4 Experiments
In this section, we compare GraNDAG to various baselines (both in the continuous and combinatorial paradigm), namely DAGGNN Yu et al. (2019), NOTEARS Zheng et al. (2018), RESIT Peters et al. (2014) and CAM Bühlmann et al. (2014). Those methods are discussed in Section 5. As a sanity check, we report the performance of random graphs sampled using the Erdős–Rényi (ER) scheme described in Appendix A.5 (denoted by RANDOM). For each approach, we evaluate the estimated graph on two metrics: the structural hamming distance (SHD) and the structural interventional distance (SID) Peters and Bühlmann (2013). The former simply counts the number of missing, falsely detected or reversed edges. The latter is especially well suited for causal inference since it counts the number of couples such that the interventional distribution would be miscalculated if we were to use the estimated graph to form the parent adjustement set. See Appendix A.7 for more details on SHD and SID. We consider both synthetic and realworld data sets.
Since the performance of GES Chickering (2003) and PC Spirtes et al. (2000) are almost never on par with the best methods presented in this section, we present their evaluation in Appendix A.6.
The code for all experiments can be found at https://github.com/kurowasan/GraNDAG
4.1 Synthetic data
We have generated different data set types which vary along three dimensions: number of nodes, level of edge sparsity and graph type. For each data set type, we have sampled 10 data sets of size . We consider two different types of graphs, Erdős–Rényi (ER) and scalefree (SF) graphs. Both types differ in the way graphs are randomly generated (see Appendix A.5).
Given a data set type, a data set is sampled as follows: First, a ground truth DAG is randomly sampled following either the ER or the SF scheme. Then, the data is generated following with the functions independently sampled from a Gaussian process with bandwidth one and sampled uniformly. This setup is especially interesting to consider since, as mentioned in Section 2.2, we know the DAG to be identifiable from the distribution Peters et al. (2014). This ensures that finding the correct DAG via maximum likelihood is not impossible.
In those experiments, each NN learned by GraNDAG outputs a Gaussian mean , i.e. . The parameters are learned as well, but do not depend on the parent variables . Note that the linear method NOTEARS and the nonlinear methods CAM and RESIT all make the correct Gaussian model assumption.
We considered graphs of 10, 20, 50 and 100 nodes. We only present results for 10 and 50 nodes in the main paper since the conclusions do not change with graphs of 20 or 100 nodes (see Appendix A.6 for the additional experiments). We consider graphs of and edges (respectively denoted by ER1 and ER4 in the case of ER graphs). Note that RESIT was not applied on data sets of 50 or more variables due to computational reasons (see Peters et al. (2014)). For graphs of 50 nodes or more, GraNDAG performs a preliminary neighborhood selection (PNS) similar to what is proposed by CAM Bühlmann et al. (2014). This optional preprocessing step helps reducing overfitting for large graphs (details in Appendix A.1).
ER1  ER4  
SHD  SID  SHD  SID  
GraNDAG  1.72.5  1.73.1  8.32.8  21.88.9 
DAGGNN  11.43.1  37.614.4  35.11.5  81.94.7 
NOTEARS  12.22.9  36.613.1  32.63.2  79.04.1 
CAM  1.11.1  1.12.4  12.22.7  30.913.2 
RESIT  21.14.4  19.416.4  17.33.1  45.28.6 
RANDOM  26.39.8  25.810.4  31.85.0  76.67.0 
SF1  SF4  
SHD  SID  SHD  SID  
GraNDAG  1.21.1  4.16.1  9.94.0  16.46.0 
DAGGNN  9.91.1  29.715.8  20.81.9  48.415.6 
NOTEARS  10.72.2  32.015.3  20.82.7  49.815.6 
CAM  1.41.6  5.46.1  9.84.3  19.37.5 
RESIT  21.87.5  11.913.7  15.59.4  20.419.3 
RANDOM  25.110.2  24.510.5  28.54.0  47.212.2 
ER1  ER4  
SHD  SID  SHD  SID  
GraNDAG  5.12.8  22.417.8  102.621.2  1060.1109.4 
DAGGNN  49.27.9  304.4105.1  191.915.2  2146.264 
NOTEARS  62.89.2  327.3119.9  202.314.3  2149.176.3 
CAM  4.31.9  22.017.9  98.820.7  1197.2125.9 
RANDOM  535.7401.2  272.3125.5  708.4234.4  1921.3203.5 
SF1  SF4  
SHD  SID  SHD  SID  
GraNDAG  25.56.2  90.018.9  111.312.3  271.265.4 
DAGGNN  49.81.3  182.842.9  144.913.3  540.8151.1 
NOTEARS  57.73.5  195.754.9  153.711.8  558.4153.5 
CAM  24.16.2  85.731.9  111.213.3  320.7152.6 
RANDOM  514.0360.0  381.3190.3  660.6194.9  1198.9304.6 
We now examine the different metrics reported in Tables 1 and 2 (the errors bars represent the standard deviation across datasets per task). We can see that, across all settings, GraNDAG and CAM are the best performing methods, both in terms of SHD and SID. It is not surprising that a linear method such as NOTEARS performs poorly on nonlinear data. What is maybe more surprising is the poor performance of DAGGNN in terms of SID. It performs similarly to RANDOM in almost all cases except in scalefree networks of 50 nodes or more. In terms of SHD, it performs rarely better than NOTEARS, both on ER and SF. We hypothesize that DAGGNN performs poorly in our setup because it does not do the correct modelling assumptions and because its architecture uses a strong form of parameter sharing between the functions , which is not justified in a setup like ours. Among the continuous approaches considered, GraNDAG is the best performing on all our synthetic tasks.
Even though CAM (wrongly) assumes that the functions are additive, i.e. , it manages to outperform RESIT which does not make this incorrect modelling assumption. This confirms the empirical findings of Bühlmann et al. (2014). On all the synthetic tasks, GraNDAG is on par with CAM, indicating that pursuing the continuous paradigm for structure learning is worthwhile.
4.2 Real data
We have tested all methods considered so far on a well known data set which measures the expression level of different proteins and phospholipids in human cells Sachs et al. (2005). This data set contains both observational and interventional data Koller and Friedman (2009); Peters et al. (2017). Since this paper focuses on learning structures from purely observational data, we omit interventional data, which yields samples. This dataset and its ground truth graph proposed in Sachs et al. (2005) are often used in the probabilistic graphical model literature Koller and Friedman (2009). The graph has nodes and 17 edges.
Note that in this application, it is not clear whether the DAG is identifiable from the distribution. Nevertheless, we apply procedures to estimate it. This departure from identifiable setups is an occasion to explore different modelling assumptions for GraNDAG. In addition to the model presented in Section 4.1, we consider an alternative, denoted GraNDAG++, which allows the variance parameters to depend on the parent variables through the NN, i.e. . This allows the model to capture heteroscedastic noise.
In addition to metrics used in Section 4.1, we also report SHDCPDAG. To compute the SHDCPDAG between two DAGs, we first map each of them to their corresponding CPDAG and measure the SHD between the two. This metric is useful to compare algorithms which only outputs a CPDAG like GES and PC to other methods which outputs a DAG. Note also that, for these methods, we report an approximate lower bound and upper bound on the SID. This is necessary since the SID can only be measured on DAGs (see A.7 for more details on all metrics). Results are reported in Table 3.
SHD  SHDCPDAG  SID  

GraNDAG  13  11  47 
GraNDAG++  13  10  48 
DAGGNN  16  21  44 
NOTEARS  21  21  44 
CAM  12  9  55 
RESIT  27  22  57 
GES  26  28  [34, 45] 
PC  17  11  [47, 62] 
RANDOM  21  20  60 
A first observation is that, both in terms of SID and SHD, all methods perform worse than what was reported for graphs of similar size in Section 4.1. This might be due to the lack of identifiability guarantees we face in applications. Nevertheless, all methods except GES and NOTEARS score better than RANDOM on all three metrics. Both GraNDAG experiments outperform CAM in terms of SID (which differs from the general trend of Section 4.1) and arrive almost on par in terms of SHD and SHDCPDAG. This time, DAGGNN and NOTEARS outperform both CAM and GraNDAG on SID, but not on SHD and SHDCPDAG. In terms of SHDCPDAG, PC is among the best. The low SHDCPDAG for GraNDAG and CAM indicates that such methods can be use to identify Markov equivalence classes, although their main purpose is to estimate DAGs.
5 Related Work
Most methods for structure learning from observational data make use of some identifiability results similar to the ones raised in Section 2.2. Roughly speaking, there are two classes of methods: independencebased and scorebased methods. GraNDAG falls into the second class.
Independencebased methods such as PC Spirtes et al. (2000) assume is faithful to and relies on statistical conditional independence tests (e.g. based on HSIC Gretton et al. (2007)) to recover the underlying CPDAG. One can also deal with hidden variables with FCI Spirtes et al. (2000), a modified version of PC.
Scorebased methods Koller and Friedman (2009); Peters et al. (2017) cast the problem of structure learning as an optimization problem over the space of structures (it can either be the space of DAGs or CPDAGs). Many popular algorithms tackle the combinatorial nature of the problem by performing a form of greedy search. GES Chickering (2003) is a popular example. It usually assumes a linear parametric model with Gaussian noise and greedily search the space of CPDAGs in order to optimize the Bayesian information criterion. Other greedy approaches rely on parametric assumptions which render fully identifiable. For example, Peters and Bühlman (2014) relies on a linear Gaussian model with equal variances to render the DAG identifiable. RESIT Peters et al. (2014), assumes nonlinear relationships with additive Gaussian noise and greedily maximizes an independencebased score. CAM Bühlmann et al. (2014) decouples the search for the optimal node ordering from the parents selection for each node. Moreover, CAM assumes an additive noise model (ANM) Peters et al. (2017) in which the nonlinear functions are additive, which provides a computational advantage when searching over the space of DAGs. As mentioned in Section 2.3, NOTEARS, proposed in Zheng et al. (2018), tackles the problem of finding an optimal DAG as a continuous constrained optimization program. This is a drastic departure from previous combinatorial approaches which enables the application of well studied numerical solvers for continuous optimizations. Recent independent work proposes DAGGNN Yu et al. (2019), a graph neural network architecture which can be used to learn DAGs via the maximization of an evidence lower bound (ELBO). To the best of our knowledge, DAGGNN is the first approach extending the work of Zheng et al. (2018) for structure learning supporting nonlinear relationships. Although Yu et al. (2019) provides empirical comparisons to linear approaches, namely NOTEARS and FGS (a faster extension of GES) Ramsey et al. (2017), comparisons to known greedy approaches supporting nonlinear relationships such as RESIT and CAM are missing. There exists certain scorebased approaches which uses integer linear programming (ILP) Jaakkola et al. (2010); Cussens (2011) which internally solve continuous linear relaxations. Connections between such methods and the continuous constrained approaches are to be explored.
Methods for causal discovery using NNs have already been proposed. SAM Kalainathan et al. (2018) learns conditional NN generators using adversarial losses but does not enforce acyclicity. CGNN Goudet et al. (2018), when used for multivariate data, requires an initial skeleton to learn the different functional relationships.
GraNDAG has strong connections with MADE Germain et al. (2015), a method used to learn distributions using a masked NN which enforce the socalled autoregressive property. The autoregressive property and acyclicity are in fact equivalent. MADE does not learn the weight masking, it fixes it at the beginning of the procedure. GraNDAG could be used with a unique NN taking as input all variables and outputting parameters for all conditional distributions. In this case, it would be similar to MADE, except the variable ordering would be learned from data instead of fixed a priori.
6 Conclusion
The continuous constrained approach to structure learning has the advantage of being more global than other approximate greedy methods and allows to replace taskspecific greedy algorithms by appropriate offtheshelf numerical solvers. In this work, we have introduced GraNDAG, a novel scorebased approach for structure learning supporting nonlinear relationships while leveraging a continuous optimization paradigm. The method rests on an acyclicity constraint very similar to the one proposed in Zheng et al. (2018) where the weighted adjacency matrix is adapted to deal with fully connected NNs instead of linear functions. We showed GraNDAG outperforms other gradientbased approaches, namely NOTEARS and its recent nonlinear extension DAGGNN, on the synthetic data sets considered in Section 4.1 while being competitive on the protein expression levels data set of Section 4.2. Compared to greedy approaches, GraNDAG is competitive across all datasets considered. To the best of our knowledge, GraNDAG is the first approach leveraging the continuous paradigm introduced in Zheng et al. (2018) which has been shown to be competitive with state of the art combinatorial approaches supporting nonlinear relationships.
Acknowledgments
This research was partially supported by the Canada CIFAR AI Chair Program and by a Google Focused Research award. The authors would like to thank Rémi Le Priol, Tatjana Chavdarova, Charles GuilleEscuret and Yoshua Bengio for insightful discussions as well as Florian Bordes for technical support.
References
 Aragam et al. [2015] B. Aragam, A.A. Amini, and Q. Zhou. Learning directed acyclic graphs with penalized neighbourhood regression. arXiv preprint arXiv:1511.08963, 2015.
 Barabási [2009] A.L. Barabási. Scalefree networks: a decade and beyond. Science, 2009.
 Barabási and Albert [1999] A.L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 1999.
 Bertsekas [1999] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
 Bühlmann et al. [2014] P. Bühlmann, J. Peters, and J. Ernest. CAM: Causal additive models, highdimensional order search and penalized regression. Annals of Statistics, 2014.
 Chickering [2003] D.M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 2003.
 Clauset et al. [2009] A. Clauset, C.R. Shalizi, and M.E.J. Newman. Powerlaw distributions in empirical data. SIAM review, 2009.
 Cussens [2011] J. Cussens. Bayesian network learning with cutting planes. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, 2011.
 Germain et al. [2015] M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution estimation. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
 Geurts et al. [2006] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine learning, 2006.
 Glorot and Bengio [2010] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010.
 Glorot et al. [2011] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011.
 Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
 Goudet et al. [2018] O. Goudet, D. Kalainathan, P. Caillou, D. LopezPaz, I. Guyon, and M. Sebag. Learning Functional Causal Models with Generative Neural Networks. In Explainable and Interpretable Models in Computer Vision and Machine Learning. Springer International Publishing, 2018.
 Gretton et al. [2007] A. Gretton, K. Fukumizu, C.H. Teo, L. Song, B. Schölkopf, and A.J. Smola. A kernel statistical test of independence. In Advances in neural information processing systems 20, 2007.
 Jaakkola et al. [2010] T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian Network Structure using LP Relaxations. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010.
 Kalainathan et al. [2018] D. Kalainathan, O. Goudet, I. Guyon, D. LopezPaz, and M. Sebag. Sam: Structural agnostic model, causal discovery and penalized adversarial learning. arXiv preprint arXiv:1803.04929, 2018.
 Koller and Friedman [2009] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques  Adaptive Computation and Machine Learning. MIT Press, 2009.
 Loh and Bühlmann [2014] P.L. Loh and P. Bühlmann. Highdimensional learning of linear causal networks via inverse covariance estimation. Journal of Machine Learning Research, 2014.
 Magliacane et al. [2018] S. Magliacane, T. van Ommen, T. Claassen, S. Bongers, P. Versteeg, and J.M. Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems 31, 2018.
 Pearl [2009] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2nd edition, 2009.
 Pearl [2019] J. Pearl. The seven tools of causal inference, with reflections on machine learning. Commun. ACM, 2019.
 Peters and Bühlman [2014] J. Peters and P. Bühlman. Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 2014.
 Peters and Bühlmann [2015] J. Peters and P. Bühlmann. Structural intervention distance (SID) for evaluating causal graphs. Neural Computation, 2015.
 Peters and Bühlmann [2013] J. Peters and P. Bühlmann. Structural intervention distance (sid) for evaluating causal graphs. Neural Computation, 2013.
 Peters et al. [2011] J. Peters, D. Janzing, and B. Schölkopf. Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.
 Peters et al. [2014] J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 2014.
 Peters et al. [2017] J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference  Foundations and Learning Algorithms. MIT Press, 2017.
 Ramsey et al. [2017] J. Ramsey, M. Glymour, R. SanchezRomero, and C. Glymour. A million variables and more: the fast greedy equivalence search algorithm for learning highdimensional graphical causal models, with an application to functional magnetic resonance images. I. J. Data Science and Analytics, 2017.
 Sachs et al. [2005] K. Sachs, O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan. Causal proteinsignaling networks derived from multiparameter singlecell data. Science, 2005.
 Shimizu et al. [2006] S. Shimizu, P.O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear nongaussian acyclic model for causal discovery. Journal of Machine Learning Research, 2006.
 Spirtes et al. [2000] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, 2000.
 Tieleman and Hinton [2012] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 van de Geer and Bühlmann [2012] S. van de Geer and P. Bühlmann. penalized maximum likelihood for sparse directed acyclic graphs. Annals of Statistics, 2012.
 Yu et al. [2019] Y. Yu, J. Chen, T. Gao, and M. Yu. DAGGNN: DAG structure learning with graph neural networks. In Proceedings of the 36th International Conference on Machine Learning, 2019.
 Zhang and Hyvärinen [2009] K. Zhang and A. Hyvärinen. On the identifiability of the postnonlinear causal model. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 2009.
 Zheng et al. [2018] X. Zheng, B. Aragam, P.K. Ravikumar, and E.P. Xing. Dags with no tears: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems 31, 2018.
Appendix A Appendix
a.1 Preliminary neighborhood selection
For graphs of 50 nodes or more, GraNDAG performs a preliminary neighborhood selection (PNS) similar to what has been proposed in Bühlmann et al. [2014]. This procedure applies a variable selection method to get a set of possible parents for each node. This is done by fitting an extremely randomized trees Geurts et al. [2006] (using ExtraTreesRegressor from scikitlearn) for each variable against all the other variables. For each node a feature importance score based on the gain of purity is calculated. Only nodes that have a feature importance score higher than are kept as potential parent, where is the mean of the feature importance scores of all nodes. Although the use of PNS in CAM was motivated by gains in computation time, GraNDAG uses it to avoid overfitting, without reducing the computation time.
a.2 Optimization
Let us recall the augmented Lagrangian:
(12) 
where and are the Lagrangian and penalty coefficients of the th subproblem, respectively. In all our experiments, we initialize those coefficients using and . We approximately solve each nonconvex subproblem using RMSprop Tieleman and Hinton [2012], a stochastic gradient descent variant popular for NNs. We use the following gradient estimate:
(13) 
where is a minibatch sampled from the data set and is the minibatch size. The gradient estimate can be computed using standard deep learning libraries. We consider a subproblem has converged when evaluated on a heldout data set stops increasing. Let be the approximate solution to subproblem . Then, and are updated according to the following rule:
(14) 
with and . Each subproblem is initialized using the previous subproblem solution . The augmented Lagrangian method stops when .
a.3 Thresholding
The augmented Lagrangian outputs where is the number of subproblems solved before declaring convergence. Note that the weighted adjacency matrix will most likely not represent an acyclic graph, even if we threshold as we learn, as explained in Section 3.4. We need to remove additional edges to obtain a DAG (edges are removed using the mask presented in Section 3.4). One option would be to remove edges one by one until a DAG is obtained, starting from the edge with the lowest up to the edge with the highest . This amounts to gradually increasing the threshold until is acyclic. However, this approach has the following flaw: It is possible to have significantly higher than zero while having almost completely independent of variable . This can happen for at least two reasons. First, the NN paths from input to output might end up cancelling each others, rendering the input inactive. Second, some neurons of the NNs might always be saturated for the observed range of inputs, rendering some NN paths effectively inactive without being inactive in the sense described in Section 3.1. Those two observations illustrate the fact that having is only a sufficient condition to have independent of variable and not a necessary one.
To avoid this issue, we consider the following alternative. Consider the function which maps all variables to their respective conditional likelihoods, i.e. . We consider the following expected Jacobian matrix
(15) 
where is the Jacobian matrix of evaluated at , in absolute value (elementwise). Similarly to , the entry can be loosely interpreted as the strength of edge . We propose removing edges starting from the lowest to the highest, stopping as soon as acyclicity is achieved. We believe is better than at capturing which NN inputs are effectively inactive since it takes into account NN paths cancelling each others and saturated neurons. Empirically, we found that using instead of yields better results.
a.4 DAG Pruning
Once the thresholding is performed and a DAG is obtained, GraNDAG performs a pruning step identical to CAM Bühlmann et al. [2014] in order to remove spurious edges. We use the implementation of Bühlmann et al. [2014] based on the R function gamboost from the mboost package. For each variable , a generalized additive model is fitted against the current parents of and a significance test of covariance is applied. Parents with a pvalue higher than 0.001 are removed from the parent set. Similarly to what Bühlmann et al. [2014] observed, this pruning phase generally has the effect of greatly reducing the SHD without considerably changing the SID.
a.5 Graph types
Erdős–Rényi (ER) graphs are generated by randomly sampling a topological order and by adding directed edges were it is allowed independently with probability were is the expected number of edges in the resulting DAG.
Scalefree (SF) graphs were generated using the Barabási–Albert model Barabási and Albert [1999] which is based on preferential attachment. Nodes are added one by one. Between the new node and the existing nodes, edges (where is equal to or 4) will be added. An existing node have the probability to be chosen, where represents the degree of the node . While ER graphs have a degree distribution following a Poisson distribution, SF graphs have a degree distribution following a power law: few nodes, often called hubs, have a high degree. Some authors Barabási [2009] have stated that these types of graphs have similar properties to realworld networks which can be found in many different fields, although these claims remain controversial Clauset et al. [2009].
a.6 Supplementary experiments
The results for 20 and 100 nodes are presented in Table 4 and 5 using exactly the same setting as in Section 4. The conclusions drawn remain the same as for 10 and 50 nodes. For GES and PC, the SHD and SID are respectively presented in Table 6 and 7. Figure 1 shows the entries of the weighted adjacency matrix as training proceeds in a typical run for 10 nodes.
ER1  ER4  
SHD  SID  SHD  SID  
GraNDAG  4.0 3.4  17.919.5  45.210.7  165.121.0 
DAGGNN  25.67.5  109.153.1  75.07.7  344.817.0 
NOTEARS  30.37.8  107.347.6  79.08.0  346.613.2 
CAM  2.71.8  10.68.6  41.011.9  157.941.2 
RESIT  134.716.9  39.731.4  99.223.0  185.141.7 
RANDOM  103.039.6  94.353.0  117.525.9  298.528.7 
SF1  SF4  
SHD  SID  SHD  SID  
GraNDAG  7.62.5  28.810.4  36.85.1  62.518.8 
DAGGNN  19.51.8  60.112.8  49.55.4  115.233.3 
NOTEARS  23.93.5  69.419.7  52.04.5  120.532.5 
CAM  5.72.6  23.318.0  35.64.5  59.118.8 
RESIT  137.422.0  71.838.3  119.47.5  98.633.0 
RANDOM  105.248.8  81.154.4  121.528.5  204.838.5 
ER1  ER4  
SHD  SID  SHD  SID  
GraNDAG  15.16.0  83.946.0  206.631.5  4207.3419.7 
DAGGNN  110.210.5  883.0320.9  379.524.7  8036.1656.2 
NOTEARS  125.612.1  913.1343.8  387.825.3  8124.7577.4 
CAM  17.34.5  124.965.0  186.428.8  4601.9482.7 
RANDOM  1561.61133.4  1175.3547.9  2380.91458.0  7729.71056.0 
SF1  SF4  
SHD  SID  SHD  SID  
GraNDAG  59.27.7  265.464.2  262.719.6  872.0130.4 
DAGGNN  97.61.5  438.6112.7  316.014.3  1394.6165.9 
NOTEARS  111.75.4  484.3138.4  327.215.8  1442.8210.1 
CAM  52.79.3  230.336.9  255.621.7  845.8161.3 
RANDOM  2222.21141.2  1164.2593.3  2485.01403.9  4206.41642.1 
10 nodes  20 nodes  50 nodes  100 nodes  

ER1  ER4  ER1  ER4  ER1  ER4  ER1  ER4  
GES  13.84.8  32.34.3  43.312.4  94.69.8  106.624.7  254.439.3  292.933.6  542.651.2 
PC  8.43  342.6  20.16.5  73.15.8  44.011.6  183.620  95.29.1  358.820.6 
SF1  SF4  SF1  SF4  SF1  SF4  SF1  SF4  
GES  8.12.4  17.44.5  26.27.5  50.76.2  73.97.4  178.816.5  190.322  408.724.9 
PC  4.82.4  16.42.8  13.62.1  44.44.6  43.15.7  135.410.7  97.66.6  314.217.5 
10 nodes  20 nodes  50 nodes  100 nodes  
ER1  ER4  ER1  ER4  ER1  ER4  ER1  ER4  
GES 









PC 









SF1  SF4  SF1  SF4  SF1  SF4  SF1  SF4  
GES 









PC 








a.7 Metrics
SHD takes two partially directed acyclic graphs (PDAG) and counts the number of edge for which the edge type differs in both PDAGs. There are four edge types: , , and . Since this distance is defined over the space of PDAGs, we can use it to compare DAGs with DAGs, DAGs with CPDAGs and CPDAGs with CPDAGs. When comparing a DAG with a CPDAG, having instead of counts as a mistake.
SHDCPDAG is very similar to SHD. The only difference is that both DAGs are first mapped to their respective CPDAGs before measuring the SHD.
Introduced in Peters and Bühlmann [2015], SID counts the number of interventional distribution of the form that would be miscalculated using the parent adjustment formula Pearl [2009] if we were to use the predicted DAG instead of the ground truth DAG to form the parent adjustment set. Some care needs to be taken to evaluate the SID for methods outputting a CPDAG such as GES and PC. Peters and Bühlmann [2015] proposes to report the SID of the DAGs which have approximately the minimal and the maximal SID in the Markov equivalence class given by the CPDAG. See Peters and Bühlmann [2015] for more details.
a.8 Hyperparameters
All GraNDAG runs were performed using the following set of hyperparameters. We used RMSprop as optimizer with learning rate of for the first subproblem and for all subsequent suproblems. Each NN has two hidden layers with 10 units (except for realdata experiments which uses only 1 hidden layer). LeakyReLU is used as activation functions. The NN are initialized using the initialization scheme proposed in Glorot and Bengio [2010] also known as Xavier initialization. We used minibatches of 64 samples.
For RESIT, we used the default hyperparameters found in the code available on the authors webpages. For NOTEARS and DAGGNN, we also used the default hyperparameters found in the authors code. For GES and PC, we used default hyperparameters of the pcalg R package. For CAM, we used the the default hyperparameters found in the CAM R package, with default PNS and DAG pruning.