DVAE: A Variational Autoencoder for Directed Acyclic Graphs
Abstract
Graph structured data are abundant in the real world. Among different graph types, directed acyclic graphs (DAGs) are of particular interests to machine learning researchers, as many machine learning models are realized as computations on DAGs, including neural networks and Bayesian networks. In this paper, we study deep generative models for DAGs, and propose a novel DAG variational autoencoder (DVAE). To encode DAGs into the latent space, we leverage graph neural networks. We propose a DAGstyle asynchronous message passing scheme that allows encoding the computations defined by DAGs, rather than using existing simultaneous message passing schemes to encode the graph structures. We demonstrate the effectiveness of our proposed DVAE through two tasks: neural architecture search and Bayesian network structure learning. Experiments show that our model not only generates novel and valid DAGs, but also produces a smooth latent space that facilitates searching for DAGs with better performance through Bayesian optimization.
DVAE: A Variational Autoencoder for Directed Acyclic Graphs
Muhan Zhang, Shali Jiang, Zhicheng Cui, Roman Garnett, Yixin Chen Department of Computer Science and Engineering Washington University in St. Louis {muhan, jiang.s, z.cui, garnett}@wustl.edu, chen@cse.wustl.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Many realworld problems can be posed as optimizing of a directed acyclic graph (DAG) representing some computational task. In machine learning, deep neural networks are DAGs. Although they have achieved remarkable performance on a wide range of learning tasks, tremendous efforts need to be devoted to designing their architectures, which is essentially a DAG optimization task. Similarly, optimizing the connection structures of Bayesian networks is also a critical problem in learning graphical models [1]. DAG optimization is pervasive in other fields as well. For example, in electronic circuit design, engineers need to optimize directed network structures not only to realize target functions, but also to meet specifications such as power usage and operating temperature.
DAGs, as well as other discrete structures, cannot be optimized directly with traditional gradientbased techniques, as gradients are not available. Bayesian optimization, a stateoftheart blackbox optimization technique, requires a kernel to measure the similarity between discrete structures as well as a method to explore the design space and extrapolate to new points. Principled solutions to these problems are still lacking for discrete structures. Recently, there has been increased interest in training generative models for discrete data types such as molecules [2, 3], arithmetic expressions [4], source code [5], general graphs [6], etc. In particular, Kusner et al. [3] developed a grammar variational autoencoder (GVAE) for molecules, which is able to encode and decode molecules into and from a continuous latent space, allowing one to optimize molecule properties by searching in this wellbehaved space instead of a discrete space. Inspired by this work, we propose to train variational autoencoders for DAGs, and optimize DAG structures in the latent space based on Bayesian optimization.
Existing graph generative models can be classified into three categories: tokenbased, adjacencymatrixbased, and graphbased approaches. Tokenbased graph generative models [2, 3, 7] represent a graph as a sequence of tokens (e.g., characters, grammar production rules) and model these sequences based on established RNN modules such as gated recurrent units (GRUs) [8]. Adjacencymatrixbased models [9, 10, 11, 12, 13] generate columns/entries of the adjacency matrix of a graph sequentially or generate the entire adjacency matrix in one shot. Graphbased models [6, 14, 15, 16] maintain state vectors for existing nodes and generate new nodes and new edges based on the the existing graph state and node states.
Among the three types, tokenbased models require taskspecific graph grammars such as SMILES for molecules [17], and thus is less general. Adjacencymatrixbased models leverage a proxy matrix representation of graphs and generate graphs through traditional multilayer perceptrons (MLPs) or RNNs, which could be less expressive and less suitable for graphs. On the other hand, graphbased models seem to be more general and natural, since they operate directly on graph structures instead of proxy representations. In addition, the iterative process is driven by the current graph and nodes’ states as represented by graph neural networks (GNNs), which have already shown their powerful graph feature learning ability on various tasks [18, 19, 20, 21, 22, 23, 24].
A GNN extracts node features by passing neighbor nodes’ messages to the center, so that the center node gets the summarized feature of the local substructure around it. Such message passing happens at all nodes simultaneously to extract local substructure features around nodes, and then the local feature vectors are summed to be the graph state [25]. Although such symmetric message passing works for undirected graphs, it fails to capture the computation dependencies defined by DAGs. In other words, nodes within a DAG naturally have some ordering based on the dependency structures, which is ignored by existing GNNs but crucial for performing the computation on the DAG. If we encode a DAG using existing GNNs, the embedding may only capture the DAG structure but fail to encode the DAG computation.
To encode the computation defined by a DAG, we propose an asynchronous DAG message passing scheme – the message passing no longer happens at all nodes simultaneously, but respects the computation dependencies between the nodes. For example, suppose node A has two parent nodes, B and C, in a DAG. Our scheme does not perform feature learning for A until the feature learning on B and C are both finished. Then, the aggregated message from B and C is passed to A to trigger A’s feature learning.
We incorporate this feature learning scheme in both our encoder and decoder, and propose the DAG variational autoencoder (DVAE). We theoretically prove that 1) the DVAE encoder is permutationinvariant, and 2) the mapping from computations to encodings is injective. This means that DVAE is able to encode computations defined on DAGs rather than merely substructure patterns, which is crucial for predicting computation architectures’ performance and subsequent optimizations.
We validate our proposed DVAE on two types of DAGs: neural networks and Bayesian networks. We show that our model not only generates novel, realistic, and valid DAG structures, but also produces smooth latent spaces that are effective for searching better neural architectures and Bayesian networks through Bayesian optimization.
2 Related Work
2.1 Variational Autoencoder
Variational autoencoder (VAE) [26, 27] provides a framework to learn both a probabilistic generative model (the decoder) as well as an approximated posterior distribution (the encoder). VAE is trained through maximizing the evidence lower bound (ELBO)
(1) 
The prior distribution is typically taken to be . The posterior approximation is usually a multivariate Gaussian distributions whose mean and covariance matrix are parameterized by the encoder network and the covariance matrix is often constrained to be diagonal. The generative model can in principle take arbitrary parametric forms whose parameters are output by the decoder network. After learning , we can generate new data by decoding latent vectors sampled from the prior distribution . For generating discrete data types, is often decomposed into a series of decision steps.
2.2 AutoML and Neural Architecture Search
AutoML and automated neural architecture search (NAS) have seen major advances in recent years [28, 29, 30, 31, 32, 33]. See Hutter et al. [34] for an overview. Here we only discuss the most related works.
Luo et al. [35] proposed a novel approach called Neural Architecture Optimization (NAO). The basic idea is to jointly learn an encoderdecoder between networks and a continuous space, and also a performance predictor that maps the continuous representation of a network to its performance on a given dataset; then they perform two or three iterations of gradient descent on to find better architectures in the continuous space, which are then decoded to real networks to evaluate. This methodology is very similar to that of GómezBombarelli et al. [2], Jin et al. [14] for molecule optimization; also similar to Mueller et al. [36] for slightly revising a sentence.
There are several key differences comparing to our approach. First, they use strings (e.g. “node2 conv 3x3 node1 maxpooling 3x3”) to represent neural architectures, whereas we directly use graph representations, which is more natural, and generally applicable to other graphs such as Bayesian network structures. Second, they use supervised learning instead of unsupervised learning. That means they need to first evaluate a considerable amount of randomly sampled graphs on a typically large dataset (e.g. train many neural networks), and use these results to supervise the training of the embedding; the encoding model needs to be retrained given a new dataset. In contrast, we train our variational autoencoder in a fully unsupervised manner, so the embedding is of general purposes.
Fusi et al. [37] proposed a novel AutoML algorithm also using model embedding, but with a matrix factorization approach. They first construct a matrix of performances of thousands of ML pipelines on hundreds of datasets; then they use a probabilistic matrix factorization to get latent representations of the pipelines. Given a new dataset, Bayesian optimization (expected improvement) is used to find the best pipeline. This approach only allows us to choose from predefined offtheshelf ML models, hence its flexibility is somewhat limited.
Kandasamy et al. [38] use Bayesian optimization for NAS; they define a kernel that measures the similarities between networks by solving an optimal transport problem, and in each iteration, they use some evolutionary heuristics to generate a set of candidate networks and use expected improvement to choose the next one to evaluate. This work is similar to ours in the application of Bayesian optimization, but the discrete search space is heuristically extrapolated.
2.3 Bayesian Network Structure Learning
bn structure learning dates at least back to Chow and Liu [39] where the ChowLiu tree algorithm was developed for learning tree structured models. Much effort has been devoted to this area in recent decades (see Chapter 18 of Koller and Friedman [1] for a detailed discussion), and there is still a great deal of research on this topic [40, 41, 42].
One of the main approaches for bn structure learning is scorebased search. That is, we define some “goodnessoffit” score for a given network structure, and search for one with the optimal score. Commonly used scores include BIC and BDeu etc., mostly based on marginal likelihood [1]. Which score to use is by itself an ongoing research topic [43]. It is well known that finding an optimally scored bn with indegree at most is NPhard for [44], so exact algorithms such as dynamic programming [45] or shortest path approaches [46, 47] can only solve smallscale problems. We have to resort to heuristic methods such as local search, simulated annealing etc. [48].
All these approaches optimize the network structure in the discrete space. In this paper, we propose to embed the graph space into a continuous Euclidean space, and show that model scores such as BIC can be modeled smoothly in the embedded space, and hence approximate the NPhard combinatorial search problem to a continuous optimization problem.
Using gp to approximate the bn score has been studied before. Yackley and Lane [49] analyzed the smoothness of BDe score, showing that a local change (e.g. adding an edge) can change the score by at most , where is the number of training points. They proposed to use gp as a proxy for the score to accelerate the search. Anderson and Lane [50] also used gp to model the BDe score, and showed that probability of improvement is better than hill climbing to guide the local search. However, these methods still heuristically and locally operate in the discrete space, whereas our embedded space makes both local and global methods such as gradient descent and Bayesian optimization applicable in a principled manner.
3 DAG Variational Autoencoder (DVAE)
In this section, we describe our proposed DAG variational autoencoder (DVAE). DVAE uses an asynchronous message passing scheme to encode and decode DAGs respecting the computational dependencies, thus providing a smooth latent space w.r.t. computations that is suitable for optimizing computation architectures’ performance.
3.1 Encoding
We first describe how DVAE encodes DAGs using an asynchronous message passing scheme. We assume there is a single starting node which does not have any predecessors (if there are multiple, we add a virtual starting node connecting to all of them). In neural architectures, this starting node will be the input node. We use a function to update the state of each node. The update function takes 1) the onehot encoding of the current vertex’s type and 2) the aggregated message from ’s predecessors, given by:
(2) 
where denotes there is a directed edge from to , is an aggregation function on the (ordered) multiset of ’s predecessors’ states. For the starting node without predecessors, an allzero initialization vector is fed. Having , the hidden state of is updated by
(3) 
To make sure when a new node comes, all of its predecessors’ states have already been computed, we can feed in nodes following a topological ordering of the DAG.
Finally, after all the node states are computed, we use , the hidden state of (the ending node without any successors) as the output of the encoder, and use two multilayer perceptrons (MLPs) to predict the mean and variance parameters of in (1) with as the input. If there are multiple nodes without successors, we again add a virtual ending node connecting from all these “loose ends”. In Figure 1, we use a real neural architecture to illustrate how the encoding works.
Note that although topological orderings are usually not unique for a DAG, we can take either one of them as the input node order while ensuring the encoding result is always the same, formalized by the following theorem.
Theorem 1.
If the aggregation function is invariant to the order of its inputs, then the DVAE encoder is permutationinvariant.
Proof.
For the starting node , since it has no predecessors, the update function will take an empty set as input. Thus, its output is invariant to node ordering. Subsequently, the hidden state of output by is permutationinvariant.
Now we prove the theorem by induction. Consider node . Suppose the hidden state of every predecessor of is permutationinvariant. Then in (2), the output by will be permutationinvariant since is invariant to the order of its inputs . Subsequently, the output of (3) is permutationinvariant. By induction, we know that every node’s hidden state will be invariant to node ordering, including the ending node’s hidden state, i.e., the output of DVAE’s encoder. ∎
Theorem 1 means that we can get the same encoding result for isomorphic DAGs, no matter what node ordering/indexing is used. This property is much desirable since we do not want two computational graphs defining exactly the same function to be encoded into different latent vectors just because they have different indexings or layouts.
The following theorem shows that DVAE is able to injectively encode computations on DAGs.
Theorem 2.
Let be any DAG representing some computation . Let be its nodes following a topological order each representing some operation , where is the ending node. Then, the encoder of DVAE maps to injectively if is injective and is injective.
Proof.
Suppose there is an arbitrary input signal fed to the starting node . For convenience, we will use to denote the output signal at vertex , where represents the composition of all the operations along the paths from to .
For the starting node , remember we feed . Since (3) is injective, we know the mapping from to is injective. We prove the theorem by induction. Assume the mapping from to is injective for all . We will prove that the mapping from to is also injective.
Let where is injective. Consider the output signal , which is given by feeding to . Thus,
(4) 
In other words, we can write as
(5) 
where is an injective function used for defining the composition computation from and .
With (2) and (3), we can write the hidden state as follows:
(6) 
where is the injective onehot encoding function mapping to . In the above equation, are all injective. Since the composition of injective functions is injective, there exists an injective function so that
(7) 
Then combining (5) we have:
(8) 
is injective since the composition of injective functions is injective. Thus, we have proved that the mapping from to is injective. ∎
Note that Theorem 2 does not state that DVAE injectively maps structures to , since otherwise it will provide an efficient algorithm for DAG isomorphism problem which is known to be GIcomplete. Instead, it shows that the mapping from computations defined on DAGs to the encodings is injective, which is enough for us – we do not need to differentiate two different structures and as long as they represent the same computation.
Theorem 2 shows that DVAE always maps DAGs representing the same computation to the same encoding, and will never map two DAGs representing different computations to the same encoding. This computationencoding property is crucial for modeling DAGs, because a DAG’s performance depends on its computation. For example, under the same training conditions, two neural networks with the same computation will have the same performance on a given dataset. Since DVAE encodes computations instead of structures, it allows encoding DAGs with similar performances to the same regions in the latent space, rather than encoding graphs with merely similar structure patterns to the same regions. Subsequently, the latent space can be smooth w.r.t. the performances instead of the structures. Such smoothness can greatly facilitate searching for betterperformance DAGs in the latent space, since local methods will find similarly good DAGs much easier if they are nearby.
To model and learn the functions and , we resort to neural networks. Specifically, we let be a gated sum:
(9) 
where is a mapping network and is a gating network. This is because gated sum can model injective multiset functions [51].
To model the injective update function , we use a gated recurrent unit (GRU):
(10) 
due to the universal approximation theorem [52]. Here the subscript denotes “encoding”. Using a GRU also allows reducing our framework to traditional RNNs for sequences, as discussed in 3.4.
We encode all general computation graphs (including neural networks) using the above encoding scheme. For Bayesian networks, we make some modifications to their encoding due to the special dseparation properties of Bayesian networks (discussed in Appendix A).
3.2 Decoding
We now describe how DVAE generates DAGs from the latent space. The DVAE decoder uses the same asynchronous message passing scheme as in the encoder to learn intermediate node and graph states. The decoder is based on another GRU, denoted by . Given the latent vector to decode, we first use an MLP to map to as the initial hidden state to . Then, the decoder constructs a DAG node by node based on existing graph’s state. For the generated node , the following steps are performed:

Compute ’s type distribution using an MLP based on the current graph’s state .

Sample ’s type. If the sampled type is the ending type, stop the decoding, connect all loose ends to , and output the DAG.

Update ’s state by , where if ; otherwise, is given by equation (9).

For : (a) predict the edge probability of (, ) using an MLP based on and ; (b) sample the edge; and (c) if a new edge is added, update using step 3.
The above steps are recursively applied to each new generated node, until step 2 samples the ending type. For each generated node, we first predict its node type based on the current graph state. We then sequentially predict whether existing nodes have directed edges to it based on the existing node states and the current node’s state. Note that we maintain node states for both the current node and existing nodes, and update node states during the generation.
In step 4, when sequentially predicting incoming edges from previous nodes, we choose the reversed order instead of or any other order. This is based on the prior knowledge that a new node is more likely to firstly connect from the node immediately before it. For example, in neural architecture design, when adding a new layer, we often first connect it from the last added layer, and then decide whether there should be skip connections from other previous layers. Note that however, such an order is not fixed and can be changed according to specific applications. We empirically find that the reversed order works better in our applications.
Note that there exist various other decoder choices (discussed in the introduction), such as those based on adjacency matrices. So why do we choose the current decoder framework? This is because, firstly, it uses the same asynchronous message passing scheme as the DVAE encoder to learn intermediate node states so that the encoder and decoder are “symmetric” (e.g., in traditional autoencoders, encoders and decoders usually take the same kind of neural networks, such as MLP + MLP and CNN + CNN, but rarely MLP + CNN). Secondly, the current decoder updates intermediate node states throughout the generation and predicts edges based on node states, which makes the decoding process more flexible and structureaware. In contrast, adjacencymatrixbased approaches do not maintain node states, and generate next adjacency matrix entry only based on the current RNN state [10].
3.3 Training
To measure the reconstruction loss, we use teacher forcing [14]: following the topological order with which the input DAG’s nodes are consumed, we sum the negative loglikelihood of each decoding step by forcing them to generate the ground truth node type or edge at each step. This ensures that the model makes predictions based on the correct histories. Then, we optimize the VAE loss (the negative of (1)) using gradient descent following [14]. More training details are in Appendix D.
3.4 Discussion
Reduction to RNN. The DVAE encoder and decoder can be reduced to ordinary RNNs when the input DAGs are reduced to linked lists. Although we propose DVAE from a GNN’s perspective, our model can also be seen as a generalization of the traditional sequence modeling framework [53] where a timestamp depends only on the timestamp immediately before it, to the DAG cases where a timestamp has multiple previous dependencies.
Bidirectional encoding. DVAE encoder simulates how an input signal goes through a DAG, which is also known as forward propagation in neural networks. Inspired by the bidirectional RNN [54], we further propose to use another GRU to reversely encode a DAG, thus simulating the backward propagation process too. After reverse encoding, we get two ending states, which are concatenated and linearly mapped to their original size as the final output state. We find this bidirectional encoding increases the performance and convergence speed of DVAE on neural architectures, which is used in our neural architecture search experiments.
Incorporating vertex semantics. Note that DVAE currently uses onehot encoding of node types as , which does not consider the functional similarities of different node types. For example, a convolution layer might be functionally very similar to a convolution layer, while being functionally distinct from a max pooling layer. We expect incorporating such semantic meanings of node types to be able to further improve DVAE’s performance, which is left for future work.
3.5 Comparison with other possible approaches
As discussed in the introduction, there are other types of graph generative models that can potentially work for DAGs. We explore three possible approaches and compare them with DVAE.
3.5.1 SVAE and GraphRNN
The SVAE baseline treats a DAG as a sequence of node strings, which we call stringbased variational autoencoder (SVAE). In SVAE, each node is represented as the onehot encoding of its type number concatenated with a 0/1 indicator vector indicating which previous nodes have directed edges to it (i.e., a column of the adjacency matrix). For example, suppose there are two node types and five nodes, then node 4’s string “0 1 0 1 1 0 0” means this node has type 2, and has directed edges from previous nodes 2 and 3 (we pad zeros for future nodes). We train a standard GRUbased RNN variational autoencoder [53] on the topologically sorted node sequences, with these strings treated as the input vectors.
One similar generative model is GraphRNN [10]. Different from SVAE, it further decomposes an adjacency column into entries and generates the entries one by one using another edgelevel GRU. However, GraphRNN does not have an encoder, thus cannot optimize DAG performances in a latent space. To compare with GraphRNN, we equip it with SVAE’s encoder and use it as another baseline. Both GraphRNN and SVAE treat DAGs as bit strings and use RNNs to model them. We discuss the risk of such string representations as follows.
The brittleness of SVAE and GraphRNN. Here, we use a simple example to show that the string representations can be very brittle in terms of modeling DAGs’ computational purposes. In Figure 2, the left and right DAGs’ string representations are only different by two bits, i.e., the edge (1,3) in the left is changed to the edge (2,3) in the right. Although these two graphs are structurally very similar, their computational purposes are vastly different – the left DAG computes a constant function , while the right DAG computes a complex function. In SVAE and GraphRNN, since the bit representations of the left and right DAGs are very similar, it is very likely that they are encoded to similar latent vectors, which makes the latent space less smooth thus making the optimization more difficult. In contrast, DVAE can better differentiate such subtle differences, as the two bits substantially change how a signal flows in the two DAGs.
3.5.2 Gcn
The graph convolutional network (GCN) [20] is one representative graph neural network with a simultaneous message passing scheme. In GCN, all the nodes simultaneously take their neighbors’ incoming messages to update their own states without following an order. After message passing, the summed node states is used as the graph state. To demonstrate the advantage of the proposed asynchronous message passing, we include GCN as the third baseline. Since GCN is a graph embedding model, we use GCN as the encoder and use DVAE’s decoder in this baseline.
The main disadvantage of using GCN to encode DAGs is that the simultaneous message passing ignores the computation order of DAG nodes. It only focuses on learning local substructure patterns but fails to encode the computations.
4 Experiments
We validate the proposed DAG variational autoencoder (DVAE) through two DAG modeling tasks: 1) neural architecture search and 2) Bayesian network structure learning.
We compare DVAE with the baselines: SVAE, GraphRNN, and GCN described in Section 3.5. Following [3], we do four experiments for each task:

Reconstruction accuracy, prior validity, uniqueness and novelty. We test the VAE models by measuring 1) how often they can reconstruct input DAGs perfectly (Accurate), 2) how often they can generate valid neural architectures or Bayesian networks from the prior distribution (Valid), 3) the portion of unique DAGs out of the valid generations (Unique), and 4) the portion of valid generations that are never seen in the training set (Novel).

Predictive performance of embeddings. We test how well we can use the latent embeddings of neural architectures and Bayesian networks to predict their performance.

Bayesian optimization. We test how well the learned latent space can be used for searching for highperformance neural architectures and Bayesian networks through Bayesian optimization.

Visualizing the latent space. We decode latent points nearby in the latent space into DAGs to visualize the smoothness of the learned latent space.
Below we briefly describe the settings of the two tasks.
Neural architecture search. Our neural network dataset contains 19,020 neural architectures from efficient neural architecture search (ENAS) [33]. Each neural architecture has 6 layers (excluding input and output layers) sampled from: convolution, convolution, depthwiseseparable convolution, depthwiseseparable convolution [55], max pooling, and average pooling. We evaluate each neural architecture’s weightsharing (WS) accuracy [33] on CIFAR10 [56] as its performance measure. We split the dataset into 90% and 10% train and heldout test sets. We use the train set to train the VAE models, and use the test set for evaluation. For more details please refer to Appendix B.
Bayesian network structure learning. Our Bayesian network dataset contains 200,000 random 8node Bayesian networks from the bnlearn^{1}^{1}1http://www.bnlearn.com/ package [57] in R. For each network, we compute the Bayesian Information Criterion (BIC) score to measure the performance of the network structure for fitting the Asia dataset [58]. We split the Bayesian networks into 90% train and 10% test sets. For more details, please refer to Appendix C.
Neural architectures  Bayesian networks  

Methods  Accurate  Valid  Unique  Novel  Accurate  Valid  Unique  Novel 
DVAE  99.96  100.00  37.26  100.00  99.94  98.84  38.98  98.01 
SVAE  99.98  100.00  37.03  99.99  99.99  100.00  35.51  99.70 
GraphRNN  99.85  99.84  29.77  100.00  96.71  100.00  27.30  98.57 
GCN  5.42  99.37  41.18  100.00  99.07  99.89  30.53  98.26 
4.1 Reconstruction accuracy, prior validity, uniqueness and novelty
We include the training details in Appendix D. After training the VAE models, we first measure their reconstruction accuracies on the heldout test set. Since both encoding and decoding are stochastic in VAEs, for each test DAG, we encode it 10 times and decode each encoding 10 times too following previous work [3, 14]. Then we report the average portion of the 100 decoded DAGs that are identical to the input.
To calculate prior validity, we sample 1000 latent vectors from the prior distribution . In practice, we find different models might have different posterior distribution means and variances due to different levels of KLD convergence. To remove such effects on evaluating validity, we apply for each model. We decode each latent vector 10 times, and report the portion of these 10,000 generated DAGs that are valid. A generated DAG is valid if it can be read by the original software which generated the training data. More details are in Appendix E.
We show the results in Table 1. Among all the models, DVAE and SVAE generally have the highest performance. We find that DVAE, SVAE and GraphRNN perform similarly well in reconstruction accuracy, prior validity and novelty. However, DVAE and SVAE have much higher uniqueness, meaning they generate more diverse examples. We find that GCN does not suit neural architectures by only reconstructing 5.42% unseen inputs, which is explained by their not encoding the computation orders. However, GCN does work for Bayesian networks, due to the different computation manners of Bayesian networks (Appendix A).
Neural architectures  Bayesian networks  

Methods  RMSE  Pearson’s  RMSE  Pearson’s 
DVAE  0.3840.002  0.9200.001  0.3000.004  0.9590.001 
SVAE  0.4780.002  0.8730.001  0.3690.003  0.9330.001 
GraphRNN  0.7260.002  0.6690.001  0.7740.007  0.6410.002 
GCN  0.8320.001  0.5270.001  0.4210.004  0.9140.001 
4.2 Predictive performance of latent representation.
An important criterion to evaluate the smoothness of the latent space is to measure the predictivity of a model’s latent space. This is because, a smooth latent space w.r.t. DAG’s performance means nearby latent points will have similar performance, which makes predicting latent points’ performance much easier. Following [3], we train a sparse Gaussian Process (SGP) regression model [59] with 500 inducing points on the training data’s embeddings to predict the performance of the unseen testing data’s embeddings.
We use two metrics to evaluate the predictive performance of the latent embeddings (given by the posterior means). One is the RMSE between the SGP predictions and the true performances. The other is the Pearson correlation coefficient (or Pearson’s ), measuring how well the prediction and real performance tend to go up and down together. A small RMSE and a large Pearson’s indicate better predictive performance. Table 2 show the results. All the experiments are repeated 10 times and the means and standard deviations are reported. More SGP settings are in Appendix F.
From Table 2, we find that both the RMSE and Pearson’s of DVAE are significantly better than other models. This verifies DVAE’s advantages of encoding the computations on DAGs. SVAE follows closely by achieving the second best performance. GraphRNN and GCN have less satisfying performance in this experiment. The better predictive power of DVAE’s latent space means performing Bayesian optimization in this space is more likely to find highperformance points.
4.3 Bayesian Optimization
We perform Bayesian optimization using the two best models, DVAE and SVAE, validated by previous experiments. Based on the SGP model from the last experiment, we perform 10 iterations of batch Bayesian optimization with batch size = 50 using the expected improvement (EI) heuristic [60]. Concretely speaking, we start from the training data’s embeddings, and iteratively propose new points that are most promising to be decoded to DAGs with better performance. Such decoded DAGs’ performances are evaluated and the new data are added back to the training data. Finally, we check the bestperforming DAGs found by each model.
Neural architectures. For searching neural architectures, we select the top 15 found architectures in terms of their weightsharing accuracies and fully train them on CIFAR10’s train set to get their true test accuracies. More details can be found in Appendix B. We visualize the 5 architectures with the highest true test accuracies in Figure 3. As we can see, DVAE on average found much better neural architectures than SVAE. Among the selected architectures, the best true test accuracy 94.80% of DVAE is 2.01% higher than the best true test accuracy of SVAE. Although not outperforming stateoftheart NAS techniques which achieved an error rate of 2.11% on CIFAR10 [35], our architecture search space is much smaller, and we did not apply any data augmentation techniques such as Cutout [61], nor did we copy multiple folds or add more filters after finding the architecture. Furthermore, our found architecture only contains 3 million parameters, while the stateoftheart NAONet + Cutout has 128 million parameters. In this paper, we mainly focus on illustrating this idea of training VAE models for DAG optimization, since beating stateoftheart NAS methods require a lot of engineering and computation resources.
Bayesian networks. We similarly report the top 5 Bayesian networks found by each model ranked by their BIC scores in Figure 4. As we can see, DVAE generally found better Bayesian networks than SVAE. The best network discovered achieved a BIC of 11125.75, which is better than the best train example which has a BIC of 11141.89. For reference, the true Bayesian network used to generate the Asia data has a BIC of 11109.74. Although we have not been able to exactly find the true network, embedding Bayesian networks to vectors does show a promising direction for Bayesian network structure learning.
4.4 Latent Space Visualization
In this experiment, we visualize the latent space of the VAE models to get a sense of the smoothness of the learned latent space. For every latent point, we decode it 500 times and select the DAG that appears most often for visualization.
For neural architectures, we visualize the decoded architectures from points along a great circle in the latent space. We start from the latent embedding of a flat network comprising of only separable_conv_33 (orange) layers. Then, imagine this point as a point on the surface of a sphere (visualize the earth). We randomly pick a great circle starting from this point and returning to itself around the surface. Along this circle, we evenly pick 35 points and visualize their decoded nets in Figure 5. As we can see, both DVAE and SVAE show relatively smooth interpolations by changing only a few node types or edges each time. However, SVAE’s structure changing seems to be smoother, indicating that it focuses more on modeling the structures. In contrast, DVAE changes structures in a way that might be less visually continuous, because it focuses on the smoothness of the computation purposes.
For Bayesian networks, we visualize the latent space by linear interpolation between the latent vectors of two random test samples. We further mark the BIC score of each Bayesian network on its top. Figure 6 shows that both DVAE and SVAE have smooth interpolations visually. However, the BIC scores of DVAE seem to change smoother than those of SVAE by having less large bouncing ups/downs.
5 Conclusion
In this paper, we have proposed DVAE, a variational autoencoder for directed acyclic graphs. Inspired by previous work on training VAEs for molecule discovery, we applied DVAE to neural architecture search and Bayesian network structure learning and demonstrated its great potential in these two fields. We believe this approach will be broadly useful for other computation graph optimization tasks, such as circuit design.
References
 Koller and Friedman [2009] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 GómezBombarelli et al. [2018] Rafael GómezBombarelli, Jennifer N Wei, David Duvenaud, José Miguel HernándezLobato, Benjamín SánchezLengeling, Dennis Sheberla, Jorge AguileraIparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
 Kusner et al. [2017] Matt J Kusner, Brooks Paige, and José Miguel HernándezLobato. Grammar variational autoencoder. In International Conference on Machine Learning, pages 1945–1954, 2017.
 Kusner and HernándezLobato [2016] Matt J Kusner and José Miguel HernándezLobato. Gans for sequences of discrete elements with the gumbelsoftmax distribution. arXiv preprint arXiv:1611.04051, 2016.
 Gaunt et al. [2016] Alexander L Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow. Terpret: A probabilistic programming language for program induction. arXiv preprint arXiv:1608.04428, 2016.
 Li et al. [2018] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324, 2018.
 Dai et al. [2018] Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntaxdirected variational autoencoder for structured data. arXiv preprint arXiv:1802.08786, 2018.
 Cho et al. [2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 Simonovsky and Komodakis [2018] Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. arXiv preprint arXiv:1802.03480, 2018.
 You et al. [2018a] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep autoregressive models. In International Conference on Machine Learning, pages 5694–5703, 2018a.
 De Cao and Kipf [2018] Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.
 Bojchevski et al. [2018] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. Netgan: Generating graphs via random walks. arXiv preprint arXiv:1803.00816, 2018.
 Ma et al. [2018] Tengfei Ma, Jie Chen, and Cao Xiao. Constrained generation of semantically valid graphs via regularizing variational autoencoders. In Advances in Neural Information Processing Systems, pages 7113–7124, 2018.
 Jin et al. [2018] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In Proceedings of the 35th International Conference on Machine Learning, pages 2323–2332, 2018.
 Liu et al. [2018a] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander L Gaunt. Constrained graph variational autoencoders for molecule design. arXiv preprint arXiv:1805.09076, 2018a.
 You et al. [2018b] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goaldirected molecular graph generation. In Advances in Neural Information Processing Systems, pages 6412–6422, 2018b.
 Weininger [1988] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
 Duvenaud et al. [2015] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
 Li et al. [2015] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
 Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Niepert et al. [2016] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023, 2016.
 Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
 Zhang et al. [2018] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An endtoend deep learning architecture for graph classification. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Zhang and Chen [2018] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pages 5165–5175, 2018.
 Gilmer et al. [2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Zoph and Le [2016] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 Real et al. [2017] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc Le, and Alex Kurakin. Largescale evolution of image classifiers. arXiv preprint arXiv:1703.01041, 2017.
 Elsken et al. [2017] Thomas Elsken, JanHendrik Metzen, and Frank Hutter. Simple and efficient architecture search for convolutional neural networks. arXiv preprint arXiv:1711.04528, 2017.
 Zoph et al. [2018] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8697–8710, 2018.
 Liu et al. [2018b] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018b.
 Pham et al. [2018] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
 Hutter et al. [2018] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, editors. Automatic Machine Learning: Methods, Systems, Challenges. Springer, 2018. In press, available at http://automl.org/book.
 Luo et al. [2018] Renqian Luo, Fei Tian, Tao Qin, EnHong Chen, and TieYan Liu. Neural architecture optimization. In Advances in neural information processing systems, 2018.
 Mueller et al. [2017] Jonas Mueller, David Gifford, and Tommi Jaakkola. Sequence to better sequence: continuous revision of combinatorial structures. In International Conference on Machine Learning, pages 2536–2544, 2017.
 Fusi et al. [2018] Nicolo Fusi, Rishit Sheth, and Melih Elibol. Probabilistic matrix factorization for automated machine learning. In Advances in Neural Information Processing Systems, pages 3352–3361, 2018.
 Kandasamy et al. [2018] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing. Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, 2018.
 Chow and Liu [1968] C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462–467, 1968.
 Gao et al. [2017] Tian Gao, Kshitij Fadnis, and Murray Campbell. Localtoglobal Bayesian network structure learning. In International Conference on Machine Learning, pages 1193–1202, 2017.
 Gao and Wei [2018] Tian Gao and Dennis Wei. Parallel Bayesian network structure learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1685–1694, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/gao18b.html.
 Linzner and Koeppl [2018] Dominik Linzner and Heinz Koeppl. Cluster Variational Approximations for Structure Learning of ContinuousTime Bayesian Networks from Incomplete Data. In Advances in Neural Information Processing Systems, pages 7891–7901, 2018.
 Silander et al. [2018] Tomi Silander, Janne Leppäaho, Elias Jääsaari, and Teemu Roos. Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures. In International Conference on Artificial Intelligence and Statistics, pages 948–957, 2018.
 Chickering [1996] David Maxwell Chickering. Learning Bayesian networks is NPcomplete. In Learning from data, pages 121–130. Springer, 1996.
 Singh and Moore [2005] Ajit P. Singh and Andrew W. Moore. Finding optimal bayesian networks by dynamic programming, 2005.
 Yuan et al. [2011] Changhe Yuan, Brandon Malone, and Xiaojian Wu. Learning optimal bayesian networks using a* search. In Proceedings of the TwentySecond International Joint Conference on Artificial Intelligence  Volume Three, IJCAI’11, pages 2186–2191. AAAI Press, 2011. ISBN 9781577355151. doi: 10.5591/9781577355168/IJCAI11364. URL http://dx.doi.org/10.5591/9781577355168/IJCAI11364.
 Yuan and Malone [2013] Changhe Yuan and Brandon Malone. Learning optimal bayesian networks: A shortest path perspective. Journal of Artificial Intelligence Research, 48(1):23–65, October 2013. ISSN 10769757. URL http://dl.acm.org/citation.cfm?id=2591248.2591250.
 Chickering et al. [1995] Do Chickering, Dan Geiger, and David Heckerman. Learning Bayesian networks: Search methods and experimental results. In Proceedings of Fifth Conference on Artificial Intelligence and Statistics, pages 112–128, 1995.
 Yackley and Lane [2012] Benjamin Yackley and Terran Lane. Smoothness and Structure Learning by Proxy. In International Conference on Machine Learning, 2012.
 Anderson and Lane [2009] Blake Anderson and Terran Lane. Fast Bayesian network structure search using Gaussian processes. 2009. Available at https://www.cs.unm.edu/ treport/tr/0906/paper.pdf.
 Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
 Hornik et al. [1989] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
 Bowman et al. [2015] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 Schuster and Paliwal [1997] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
 Chollet [2017] François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, pages 1610–02357, 2017.
 Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Scutari [2010] Marco Scutari. Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, Articles, 35(3):1–22, 2010. ISSN 15487660. doi: 10.18637/jss.v035.i03. URL https://www.jstatsoft.org/v035/i03.
 Lauritzen and Spiegelhalter [1988] Steffen L Lauritzen and David J Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological), pages 157–224, 1988.
 Snelson and Ghahramani [2006] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudoinputs. In Advances in neural information processing systems, pages 1257–1264, 2006.
 Jones et al. [1998] Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive blackbox functions. Journal of Global optimization, 13(4):455–492, 1998.
 DeVries and Taylor [2017] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
Appendix A Encoding Bayesian Networks
We make some modifications when encoding Bayesian networks. One modification is that the input state for is given by:
(11) 
Compared to (9), we replace with the node type feature . This is due to the differences between computations on a neural architecture and a Bayesian network. In a neural network, the initial input signal is passed through multiple layers to be processed, and a final output signal is returned. In other words, in order to compute the output signal of an intermediate layer, we need to know the output signals of its predecessor layers. However, for a Bayesian network, the graph represents a set of conditional independences among variables instead of a computational flow. In particular, for structure learning, we are often concerned about computing the (log) marginal likelihood score of a dataset given a graph structure, which is often decomposed into individual variables given their parents (see Definition 18.2 in Koller and Friedman [1]). For example, in Figure 7, the overall score can be decomposed into . To compute the score for , we only need the values of and ; its grandparents and should have no influence on . Based on this intuition, when computing the state of a node, we use the features of its parents instead of , which “dseparates” the node from its grandparents.
Also based on the decomposibility of the score, we make another modification for encoding Bayesian networks by using the sum of all node states as the final output state instead of only using the ending node state. Similarly, when decoding Bayesian networks, the graph state .
Appendix B More Details about Neural Architecture Search
We use the efficient neural architecture search (ENAS)’s software to generate training neural architectures [33]. These ENAS architectures are regarded as the real data which we want our generative models to model. With these seed architectures, we can train a VAE model and thus search for new highperformance architectures in the latent space.
ENAS alternately trains a controller which is used to propose new architectures as well as the shared weights of the proposed architectures. It uses a weightsharing (WS) scheme to obtain a quick but rough estimate of how good an architecture is. It assumes that an architecture with a high validation accuracy using the shared weights, or weightsharing accuracy, is more likely to have a high test accuracy after fully retraining its weights from scratch.
We first run ENAS in macro space (section 2.3 of Pham et al. [33]) for 1000 epochs with 20 architectures proposed in each epoch. For all the proposed architectures excluding the first 1000 burnin ones, we evaluate their weightsharing accuracies using the shared weights from the last epoch. We further split the data into 90% and 10% train and heldout test set. Then our task becomes to train a VAE on the training neural architectures, and then generate new architectures with high weightsharing accuracies based on Bayesian optimization. Note that our target variable here is the weightsharing accuracy, not the true validation/test accuracy after fully retraining the architecture. This is because the weightsharing accuracy takes around 0.5 second to evaluate, while fully training a network takes over 12 hours. In consideration of our limited computational resources, we choose the weightsharing accuracy as our target variable in the Bayesian optimization experiments.
One might wonder what is the meaning of training another generative model after we already have one. There are two reasons. One is that ENAS is not generalpurpose, but task specific. It leverages the validation accuracy signals to train the controller based on reinforcement learning. In contrast, DVAE is completely unsupervised. After the training is done, DVAE can be applied to other neural architecture search tasks without retraining. The second reason is that, training a VAE provides a way to embed neural architectures into vectors, which can be used for downstream tasks such as visualization, classification, searching, measuring similarity etc.
Note that in this paper, we only use neural architectures from the ENAS macro space, i.e., each architecture is an endtoend convolutional neural network instead of a convolutional cell. When we fully train a found architecture, we follow the original setting of ENAS to train on CIFAR10’s train data for 310 epochs and report the test accuracy. We leave the generation of convolutional/RNN cells (ENAS micro space) and the generation of convolutional neural networks with more layers (e.g., 12, 24) to future work.
Due to our constrained computational resources, we choose not to perform Bayesian optimization on the true validation accuracy, which would be a more principled way for searching neural architectures. We describe its procedure here for future explorations: After training the DVAE, we have no architectures at all to initialize a Gaussian process regression on the true validation accuracy. Thus, we need to randomly pick up some points in the latent space, decode them into neural architectures, and get their true validation accuracies after full training. Then with these initial points, we start the Bayesian optimization the same as in section 4.3, with the target value replaced by the true validation accuracy. Note that the evaluation process of each new point found by BO will take much longer time now, which might result in months of GPU time. Thus, making the evaluations parallel is very necessary.
Appendix C More details about Bayesian network structure learning
We consider a small synthetic problem called Asia [58] as our target Bayesian network structure learning problem. The Asia dataset is composed of 5,000 samples, each is generated by a true network with 8 binary variables^{2}^{2}2http://www.bnlearn.com/documentation/man/asia.html. Bayesian Information Criteria (BIC) score is used to evaluate how well a Bayesian network fits the 5,000 samples. Our task is to train a VAE model on the training Bayesian networks, and search in the latent space for Bayesian networks with high BIC scores using Bayesian optimization. In this task, we consider a simplified case where the topological order of the true network is known, which is a reasonable assumption for many practical applications, e.g., when the variables have temporal order [1]. The generated train and test Bayesian networks have topological orders consistent with the true network of Asia. The probability of a node having an edge with a previous node (as specified by the order) is set by the default option , where is the number of nodes, which results in sparse graphs where number of edges is in the same order of the number of nodes.
For the evaluation metric, although we choose the default BIC metric of the package, one could also use BDe score; the two scores have 99.96% corrleation here. Also note that this Asia problem can be solved by hill climbing or exact methods. We study it only for demonstration purposes. We leave the more general setting (when the order is unknown) and learning larger networks as our future work.
Appendix D Training Details
We train all the four models using similar settings to be as fair as possible. Many hyperparameters are inherited from Kusner et al. [3]. Singlelayer GRUs are used in all models requiring recurrent units, with the same number of hidden dimensions 501. The MLPs used to output the mean and variance parameters of are all implemented as single linear layer networks. We set the dimension of latent space to be 56 for all models.
For the decoder network of DVAE, we let and be twolayer MLPs with ReLU nonlinearities, where the hidden layer sizes are set to two times of the input sizes. Softmax activation is used after , and sigmoid activation is used after . For the gating network , we use a single linear layer with sigmoid activation. For the mapping function , we use a linear mapping without activation.
When optimizing the VAE loss, we use as the loss function. In original VAE framework, is set to 1. However, we found that it led to poor reconstruction accuracies, similar to the findings of previous work [3, 7, 14]. Following the implementation of Jin et al. [14], we set .
For DVAE on neural architectures, we use the bidirectional encoding discussed in section 3.4. For Bayesian networks and other models, we find bidirectional encoding to be less useful, sometimes even hurt the performance, thus we only use unidirectional encoding. All DAGs are fed to the models according to their nodes’ topological orderings. Since there is always a path connecting all nodes in the neural architectures generated by ENAS, their topological orderings are unique. For Bayesian networks, we feed nodes by a fixed order “ASTLBEXD" (which is always a topological order).
Minibatch SGD with Adam optimizer is used for both models. For neural architectures, we use a batch size of 32 and train for 300 epochs. For Bayesian networks, we use a batch size of 128 and train for 100 epochs. An initial learning rate of 1E4 is constantly used, and we multiply the learning rate by 0.1 if the training loss does not decrease for 10 epochs. We use PyTorch to implement all the models.
Appendix E More about Validity Experiments
Since different models can have different levels of convergence w.r.t. the KLD loss in (1), their posterior distribution may have different degrees of alignment with the prior distribution . If we evaluate prior validity by sampling from for all models, we will favor those models that have a higherlevel of KLD convergence. To remove such effects and focus purely on models’ intrinsic ability to generate valid DAGs, we apply for each model, so that the latent vectors are scaled and shifted to the center of the training data’s embeddings. If we do not perform such operations, we find that we can easily control the prior validity numbers by optimizing more or less epochs or putting more or less weight on the KLD loss.
For a generated neural architecture to be read by ENAS, it has to pass the following validity checks: 1) It has one and only one starting type (the input node); 2) It has one and only one ending type (the output node); 3) Other than the input node, there are no nodes which do not have any predecessors (an isolated path); 4) Other than the output node, there are no nodes which do not have any successors (a blocked path); 5) Each node must have a directed edge from the node immediately before it (the constraint of ENAS), i.e., there is always a main path connecting all the nodes; and 6) It is a DAG (no directed cycles).
For a generated Bayesian network to be read by bnlearn and evaluated on the Asia dataset, it has to pass the following validity checks: 1) It has exactly 8 nodes; 2) Each type in "ASTLBEXD" appears exactly once; and 3) It is a DAG.
Note that the training graphs generated by the original software all satisfy these validity constraints already.
Appendix F More Details about SGP
We use sparse Gaussian process (SGP) as the predictive model in BO. We use the open sourced SGP implementation in [3]. Both the train and test data are standardized according to the mean and std of the train data before feeding to the SGP. And the predictive performance are also calculated on the standardized data. We use the default Adam optimizer to train the SGP for 100 epochs constantly with a minibatch size of 1,000 and learning rate of 5E4.
For neural architectures, we use all the training data to train the SGP. For Bayesian networks, we randomly sample 5,000 training examples each time, due to two reasons: 1) using all the 180,000 examples to train the SGP is not realistic for a typical scenario where network/dataset are large and evaluating a score is expensive; and 2) we found using a smaller sample of training data even results in more stable BO performance due to the less probability of duplicate rows which might result in ill conditioned matrices. Note also that, when training the variational autoencoders, all the training data are used, since it is purely unsupervised.
Appendix G More Experimental Results
g.1 Bayesian optimization vs. random search
To validate that Bayesian optimization (BO) in the latent space does provide guidance in searching better DAGs, we compare with a random search baseline which randomly samples points from the latent space of DVAE. Figure 9 and 9 show the results. As we can see, BO consistently finds better DAGs than random search.