Designing Random Graph Models Using Variational Autoencoders With Applications to Chemical Design
Abstract
Deep generative models have been praised for their ability to learn smooth latent representation of images, text, and audio, which can then be used to generate new, plausible data. However, current generative models are unable to work with graphs due to their unique characteristics—their underlying structure is not Euclidean or gridlike, they remain isomorphic under permutation of the nodes labels, and they come with a different number of nodes and edges. In this paper, we propose a variational autoencoder for graphs, whose encoder and decoder are specially designed to account for the above properties by means of several technical innovations. Moreover, the decoder is able to guarantee a set of local structural and functional properties in the generated graphs. Experiments reveal that our model is able to learn and mimic the generative process of several wellknown random graph models and can be used to create new molecules more effectively than several state of the art methods.
1 Introduction
Graphs emerge as fundamental data structures in a wide range of social, biological and chemical networked systems. In each of these systems, the nodes and edges of the corresponding graphs have different, distinctive meanings—they represent different types of entities and relationships. For example, in social networks, each node represents a person and there is an edge between two nodes if they are friends. In protein interaction networks, nodes and edges represent proteins and physical interactions between proteins. In chemical networks, nodes are atoms and there is an edge between two nodes if there is a chemical bond between the corresponding atoms. In all these cases, the underlying generative processes that determine the absence or presence of nodes and edges are highly complex, domain dependent, and typically stochastic.
In recent years, there has been an increasing interest on deep generative models in machine learning, fueled by the success of generative adversarial networks (GANs) [14, 6] and variational autoencoders (VAEs) [23, 37]. These models are trained using unlabeled data and, once trained, they are able to generate new plausible data. However, such models and its variations have been mostly designed for image, audio and text generation, and deep generative models for graphs have been largely lacking until very recently [25, 1, 2, 31]^{1}^{1}1Our work is contemporary to these closely related works, which have been only published in preprint servers or openreview platforms, with the exception of the work by Li et al. [31], which just got accepted at ICLR 2018 at the time of writing.. In this work, our goal is to develop deep generative models that, once trained on a collection of graphs, are capable of creating plausible new graphs—they are able to learn and mimic the underlying processes that determine the absence or presence of nodes and edges in the graphs they are trained on. Such generative models can be useful in a wide range of highimpact applications, e.g., automatic chemical design [13], and they take a fundamental departure from traditional random graph models, such as BarabásiAlbert [7], stochastic block models [21], Kronecker graph models [28] or exponential random graphs (ERGM) [18], which make strong (parametric) assumptions about the underlying generative process of the graphs they model and, as a consequence, have a limited expressive power.
The design of deep generative models for graphs face several technical challenges, which we need to address:

Discrete structure: graphs are mathematical objects whose underlying structure is not Euclidean or gridlike. This implies that there is not a common system of coordinates, vectors space structure, or shiftinvariance. As a consequence, some of the standard techniques from deep learning models, used by VAEs and GANs, are insufficient.

Permutation invariant: graphs remain isomorphic under permutation of the node labels. Therefore, it seems a natural requirement for a deep generative model for graphs to be invariant to the permutation of the node labels.

Variable size: graphs often come with a different number of nodes and edges, even if they come from the same domain or are generated from the same underlying generative process. Thus, it would be desirable to design deep generative models able to generate (and be trained using) graphs with different number of nodes and edges.
In this paper, we introduce a variational autoencoder (VAE) for graphs^{2}^{2}2For ease of exposition, we focus on undirected graphs, however, our model can be easily extended to directed graphs by considering in and outneighborhoods., which solves the above mentioned challenges. At the heart of our approach, there are several technical innovations:

Our probabilistic decoder jointly represents all edges as an unnormalized log probability vector (or ‘logit’), which then feeds a single edge distribution, and this allows for efficient inference and decoding.

Our probabilistic decoder is able to guarantee a set of local structural and functional properties in the generated graphs by using a mask in the edge distribution definition, which can prevent the generation of certain undesirable edges during the decoding process.
We evaluate our model using both synthetic and realworld data. In a first set of experiments, we first show that our model is able to generate graphs with a predefined local topological property, i.e., graphs without triangles. Then, we show that it can learn and mimic the underlying processes that determine the absence or presence of nodes and edges in two wellknown random graph models, BarabásiAlbert [7] and Kronecker graphs [28]. Remarkably, the continuous latent representations that our model finds can be used to smoothly interpolate between model parameter values in these well known random graph models. In a second set of experiments, we train our variational autoencoder using real data data from two publicly available molecular datasets, ZINC [20] and QM9 [38, 36], and show that the trained autoencoders are able to find more valid and novel molecules than the state of the art [26, 13, 31]. Moreover, the resulting latent space representations of molecules exhibit powerful semantics—we can smoothly interpolate between molecules and, since each node in a molecule has a latent representation, we can make finegrained node level interpolations.
2 Background on Variational Autoencoders
In this section, we briefly revisit the framework of variational autoencoders (VAEs) [23, 37], which is at the heart of our deep generative model for graphs.
Variational autoencoders are characterized by a probabilistic generative model of the observed variables given the latent variables , a prior distribution over the latent variables and an approximate probabilistic inference model . In this characterization, and are arbitrary distributions parametrized by two (deep) neural networks and and one can think of the generative model as a probabilistic decoder, which decodes latent variables into observed variables, and the inference model as a probabilistic encoder, which encodes observed variables into latent variables.
Ideally, if we use the maximum likelihood principle to train a variational autoencoder, we should optimize the marginal loglikelihood of the observed data, i.e., , where is the data distribution. Unfortunately, computing requires marginalization with respect to the the latent variable , which is typically intractable. Therefore, one resorts to maximizing a variational lower bound or evidence lower bound (ELBO) of the loglikelihood the observed data, i.e.,
Finally, note that the quality of this variational lower bound (and thus the quality of the resulting VAE) depends on the expressive ability of the approximate inference model , which is typically assumed to be a normal distribution whose mean and variance are parametrized by a (deep) neural network with the observed data as an input.
3 A Variational Autoencoder for Graphs
In this section, we first give a highlevel overview of the design of our variational autoencoder for graphs, starting from the data it is designed for. Then, we elaborate on the key technical aspects of its individual components. Finally, we conclude by further describing the inference procedure and implementation details.
Highlevel overview. We observe a collection of graphs , where and denote the corresponding set of vertices and edges, respectively, and this collection may contain graphs with a different number of nodes and edges. Moreover, for each graph , we also observe a set of node features and edge weights^{3}^{3}3For ease of exposition, we assume the edge weights take discrete values, however, our model could be augmented to continuous values. . Our goal is then to design a variational autoencoder for graphs that, once trained on this collection of graphs, has the ability of creating plausible new graphs. In doing so, it will also provide a latent representation of any graph in the collection (or elsewhere) with (hopefully) meaningful semantics.
Following the above background on variational autoencoder, we can characterize our variational autoencoder by means of:
— Prior:  
— Inference model (encoder):  
— Generative model (decoder): 
In the above characterization, note that: (i) we define one latent variable per node, i.e., we have a nodebased latent representation; (ii) the number of nodes and edges are random variables and, as a consequence, both the latent representation as well as the graph can vary in size; (iii) the generative model does not characterize the node features since our focus is on generating the edges and edge weights of the graph, however, it could be easily augmented to also model node features.
Next, we formally define the functional form of the inference model, the generative model, and the prior.
Inference model (encoder). Given a graph with node features and edge weights , our inference model defines a probabilistic encoding for each node in the graph by aggregating information from different distances. More formally, for each node , the inference model is defined as follows:
(1) 
where is the latent variable associated to node , , and aggregates information from hops away from , i.e.,
Here, are weight matrices, which propagate information between different search depths, and are (possibly nonlinear) differentiable functions, is a neural network, and denotes pairwise product. Note that the above representation for aggregating node feature generalizes the methods from [27, 16]. Figure 1 summarizes our encoder architecture.
Note that the above encoder has several desirable properties: (i) for each node , its corresponding embedding is given by a (function of) a symmetric aggregation function (i.e., a sum) and thus it is invariant to permutations of the node labels of its neighbors; (ii) the weight matrices , to be learned, do not depend on the number of nodes and edges; and, (iii) given a (trained) inference model and the latent distribution for the nodes of a given graph, if the graph changes (e.g., nodes or edges are added or removed), the latent distribution for the nodes can be efficiently updated.
Layer  Architecture  Inputs 


Output  

Input 



Encoder 





Decoder 

Softplus 
Generative model (decoder). Given a set of of nodes with latent variables , our generative model is defined as follows:
(2) 
where and denote the th edge and edge weight respectively, and denote the previously generated edges and edge weights respectively. Then, the model characterizes the conditional probabilities on the right hand side of the above equation as follows. First, it jointly represents all potential edges as an unnormalized log probability vector (or ‘logit’) and, for each edge, all potential edge weights as another logit. Then, it feeds the former vector into a single edge distribution and the latter vectors each into an edge weight distribution. These distributions depend on a set of binary masks, which get updated every time a new edge and edge weight are sampled and prevent the generation of certain undesirable edges and edges features, allowing for the generated graph to fulfill a set of predefined local structural and functional properties. For example, in molecule design, mask facilitates the generation of molecules with a valid structure, as shown in Section 5.
More formally, the distribution for the th edge and edge weight is given by:
(3)  
(4) 
where and are the sets of previously generated edges and edge weights respectively, is the binary mask for edge and is the binary mask for feature edge value , and and are neural networks. Figure 2 summarizes our decoder architecture.
We would like to further clarify several design choices in the above generative model: (i) the dependency of the binary masks and on the previously generated edges and edge weights is deterministic and domain dependent; and, (ii) by first modeling the number of edges in the network, the model only needs to consider the presence of edges, not their absence, increasing the model scalability.
Prior. Given a set of nodes with latent variables , we consider a Gaussian prior .
Inference and implementation details. Given a collection of graphs , each with nodes, edges, a set of node features and set of edge weights , we train our variational autoencoder for graphs by maximizing the evidence lower bound (ELBO), as described in Section 2, plus the loglikelihood of two Poisson distributions and modeling the number of nodes and edges in each graph, i.e.,
(5) 
More specifically, we implemented our model using Tensorflow [3] and use Adam [22] for parameter tuning. Moreover, since the per edge partition function in the likelihood of the generative model defined by Eq 3, i.e.,
is expensive to compute for large networks, we approximate it using negative sampling [32]. Table 1 provides additional details for our implementation, where it is important to notice that the parameters to be learned do not depend on the size of the graphs (i.e., number of nodes and edges).
4 Experiments on Synthetic Graphs
In this section, we show that our model is able to generate graphs with a predefined local topological property, i.e., graphs without triangles, as well as learn and mimic the generative processes that determine the absence or presence of nodes and edges in two wellknown random graph models, BarabásiAlbert [7] and Kronecker graphs [28]. In doing so, we also investigate the behavior of our model with respect to the search depths used in the encoder, and show that the continuous latent representations that our model finds can be used to smoothly interpolate between model parameter values in the above random graph models.
Experimental setup. We first generate three sets of synthetic networks, each containing graphs, with an average number of nodes . The first set contains triangle free graphs, the second set contains BarabásiAlbert graphs with generation parameter , and the third set contains a 50%50% mixture of Kronecker graphs with initiator matrices: , and .
For each dataset, we train our variational autoencoder for graphs by maximizing the corresponding evidence lower bound (ELBO), given by Eq. 5, as shown in Figure 3. Then, we use the trained models to generate three sets of graphs by sampling from the decoders, i.e., , where . Finally, we evaluate the quality of the generated graphs using the following evaluation metrics:
Generation  

Train 

0.57  –  
0.68  1.00 

Validity: we use this metric to test to which extent the model we trained using triangle free graphs does generate triangle free graphs. We define validity as follows:
where and is the set of graphs generated using the trained model. Here, note that a definition of validity for BarabásiAlbert and Kronecker graphs is not forthcoming and thus we resort to other evaluation metrics.

Rank correlation: we use this metric to test to which extent the models we trained using BarabásiAlbert and Kronecker graphs do generate plausible BarabásiAlbert and Kronecker graphs, respectively. Intuitively, if the trained models generate plausible graphs, we expect that a graph with a very high value of likelihood under the true model, , should also have a high value of likelihood, , and ELBO under our trained model. For a set of graphs, we verify this expectation by computing the rank correlation between lists of graphs as follows. First, for each set of generated graphs , we order them in decreasing order of and keep the top in a ranked list^{4}^{4}4We discard the remaining graphs since their likelihood is very similar., which we denote as . Then, we take the graphs in and create two ranked lists, one in decreasing order of , which we denote as , and another one in decreasing order of ELBO, which we denote as . Finally, we compute two Spearman’s rank correlation coefficients between these lists:
where .

Precision: we use this metric, which we compute as follows, as an alternative to the rank correlation above for BarabásiAlbert and Kronecker graphs. For each set of generated graphs , we also order them in decreasing order of and create an ordered list , and select as the top % and as the bottom % of . Then, we rerank this list in decreasing order of and ELBO to create two new ordered lists, and . Here, if the trained models generate plausible graphs, we expect that each of the top and bottom halves of and should have a high overlap with and , respectively. Then, we define top and bottom precision as:
where and () is the top (bottom) half of either or .
BarabásiAlbert  

Kronecker 
Here, note that the higher the value of rank correlation and (top and bottom) precision, the more accurately the trained models mimic the generative processes for BarabásiAlbert and Kronecker graphs.
Quality of the generated graphs. In terms of triangle free graphs, we experiment both with and without masking during training and during test time. Table 2 summarizes the results, which show that, if we train and test our model with masking, it achieves a validity of %, i.e., it always generates triangle free graphs. Moreover, even in the absence of masking during test time, our model is able to achieve a validity of % if used during training.
In terms of BarabásiAlbert and Kronecker graphs, Table 3 summarizes the results, which show that our model is able to learn the generative process of BarabásiAlbert more accurately than Kronecker graphs. This may be due to the higher complexity of the generative process Kronecker graph use. That being said, it is remarkable that our model is able to achieve correlation and precision values over in both cases.
Next, we investigate the behavior of our model with respect to the search depths used in the decoder. Figure 7 summarizes the results, which show that, for BarabásiAlbert graphs, VGG performs consistently well for low values of , however, for Kronecker graphs, the performance is better for high values of . A plausible explanation for this is that BarabásiAlbert networks are generated sequentially using only local topological features (only nodedegrees), whereas the generation process of Kronecker graphs incorporates global topological features.
Generalization ability. In this section, we show that our encoder, once trained, creates a latent space representation with useful semantics. In particular, we can smoothly interpolate between Kronecker graphs, as if they were generated by true Kronecker random graph models with different parameters. More specifically, we proceed as follows.
First, we select two graphs ( and ) from the training set, one generated using an initiator matrix and the other using . Then, we sample the latent representations and for and , respectively, and sample new graphs from latent values in between these latent representations (using a linear interpolation), i.e., , where and , and the node labels, which define the matching between pairs of nodes in both graphs, are arbitrary. Figure 8 provides an example, which shows that, remarkably, as moves towards (), the sampled graph becomes similar to that of () and the inferred initiator matrices along the way smoothly interpolate between both initiator matrix. Here, we infer the initiator matrices of the graphs generated by our trained decoder using the method by Leskovec et al. [28].
5 Experiments on Real Graphs
In this section, we utilize our variational autoencoder for chemical design. More specifically, we train our variational autoencoder using molecules, i.e., molecular graphs, from two publicly available molecular datasets, ZINC [20] and QM9 [38, 36], and show that the trained autoencoder is able to find a smooth latent representation of molecules, which allows us to generate valid and novel molecules more effectively than several state of the art methods [13, 26, 2].
Experimental setup. We sample druglike commercially available molecules from the ZINC dataset with atoms and molecules from the QM9 dataset with atoms. For each molecule, we construct a molecular graph, which is nothing but a weighted undirected graph, where nodes are the atoms, the node features are the valences of the atoms, edges are the bonds between two atoms, and the weight associated to an edge is the type of bonds (single, double or triple)^{5}^{5}5We have not selected any molecule whose bond types are others than these three.. Then, for each dataset, we train our variational autoencoder for graphs using batches comprised of molecules the same number of nodes^{6}^{6}6We batch graphs with respect to the number of nodes for efficient reasons since, every time that the number of nodes changes, we need to change the size of the computational graph in Tensorflow.. Finally, we sample molecular graphs from each of the (two) trained variational autoencoders using: (i) , where and (ii) , where is a molecular graph from the corresponding (training) dataset. Here, from each sampled molecular graph, we build the final molecule by assigning the node labels to one of the atoms {C, H, N, O} based on their valency or degree. In the above procedure, we only use masking on the weight (i.e., type of bond) distributions both during training and sampling to ensure that the valence of the nodes at both ends are valid at all times, i.e., , where is the current valence of node and is the maximum valence per node. Moreover, during sampling, if there is no valid weight value for a sampled edge, we reject it.
We compare the quality of the molecules generated by our VAE for graphs and the molecules generated by several state of the art competing methods: (i) GraphVAE [2], (ii) GrammarVAE [26], and (iii) ChemicalVAE [13]. The first two methods utilize chemical SMILES while GraphVAE uses molecular graphs. To that aim, we use the following evaluation metrics:

Validity: we use this metric to evaluate to which degree a method generates chemically valid molecules.^{7}^{7}7We have used the opensource cheminformatics suite RDkit (http://www.rdkit.org) to check whether a generated molecules is chemically valid. It is formally defined as follows:
(6) where is the number of generated molecules, is the set of generated molecules which are chemically valid, and note that .

Novelty: we use this metric to evaluate to which degree a method generates novel molecules, i.e., molecules which were not present in the (training) dataset. It is formally defined as follows:
(7) where is the set of generated molecules which are chemically valid, is the original dataset comprises of both training and testing data, and .
Here, note that the higher the value of above metrics, the more useful a method will be.
Novelty  
Dataset  Sampling type  Our model  Our model (no mask)  GraphVAE  GrammarVAE  ChemicalVAE  
ZINC 



  1.00  0.98  
QM9 




1.00  0.900.00  
Validity  
Dataset  Sampling type  Our model  Our model (no mask)  GraphVAE  GrammarVAE  ChemicalVAE  
ZINC 



0.130.00  0.310.07  0.170.05  
QM9 




0.30 0.01  0.100.05 
Quality of the generated molecules. Table 4 compares our VAE for graphs to the state of the art methods above in terms of validity and novelty for both datasets. For GraphVAE, we just report the results reported in the paper since there is no public domain implementation of their method. For ChemicalVAE and GrammarVAE however, we run their public domain implementations in the same set of molecules that we used. The results reveal several interesting findings:

Our method achieves almost a perfect validity score (), beating the state of the art by large margins. Remarkably, even if we train our VAE without masks, it outperforms alternatives significantly;

Both our method and all baselines except for the GraphVAE are able to always generate novel molecules; and,

We hypothesize that the subpar performance of ChemicalVAE and GrammarVAE, even though the latter uses a grammar to favor valid molecules, may be due to the limited expressiveness of text based methods.
Note that novelty is only defined over chemically valid molecules. Therefore, despite having perfect novelty scores, both ChemicalVAE and GrammarVAE generate fewer novel molecules than our method.
Smooth latent space of molecules. In this section, we show that our encoder, once trained, creates a latent space representation of molecules with powerful semantics. In particular, we can smoothly interpolate between molecules and, since each node in a molecule has a latent representation, we can make finegrained node level interpolations. To this aim, we proceed as follows.
First, we select two molecular graphs with nodes from each of the ZINC dataset or the QM9 dataset. Given their corresponding graphs, node features and edge weights, , and , we sample their latent representation and . Then, we sample new molecular graphs from latent values in between these latent representations using a finedgrained node level linear interpolation, i.e., , where and the node labels, which define the matching between pairs of nodes in both graphs, are arbitrary. Figure 11 provides several examples for both datasets, which show that our variational autoencoder and latent space representation can be used to smoothly interpolate between molecules at a node level. Here, note that the interpolation is smooth both in terms of graph structure and relevant chemical properties, e.g., synthetic accessibility score (SA score) and molecular weight (MW).


6 Related Work
Our model is conceptually related to previous work on random graph models, network representation learning and recent advances in variational autoencoders and their applications to complex data.
Random graph models. Since the seminal random graph model by ErdosRényi [12], where each pair of nodes is connected by an edge with probability , there is a long history of random graph models in the literature [40, 41, 7, 21, 29, 28, 18]. Most of these random graph models aim to generate graphs that match one or more properties observed in of realworld graphs, e.g., heavytailed node degree distributions [7, 21], small diameter [40, 41], or densification powerlaw [29, 28]. However, most of the above random graph models share one or more of the following limitations: (i) they make strong (parametric) assumptions about the underlying generative process of the graphs they model; (ii) their parameters are difficult to learn from realworld graph data; and, (iii) they have a limited expressive power and are unable to generate diverse set of graphs. Only very recently, there have been a paucity of work on deep generative model for graphs [25, 1, 2, 31]. Among them, the most closely related works [25, 2], which also propose variational autoencoders for graphs, are not invariant to permutation of the node labels, cannot generate (or be trained with) graphs with different number of nodes nor guarantee a set of local structural and functional properties in the generated graphs.
Network representation learning. In recent years, there has been a flurry of research that seeks to learn low dimensional representations—embeddings—that encode the structural information about a graph. These embeddings can be then used as input features for several downstream machine learning tasks such as link prediction or node classification. Most of the approaches in the literature are based on matrix factorization [8, 4, 11], random walks [15, 34], or neighborhood aggregation functions [16], which include graph convolution networks (GCNs) [24, 39, 9, 30, 43]. Hamilton et al. [17] provides a comprehensive survey. However, this line of work is orthogonal to our work since it does not provide a random graph model to generate new, plausible graphs as we do.
Variational autoencoders (VAEs). The framework of variational autoencoders has emerged as a powerful technique to design complex, datadriven nonlinear generative models that are able to generate new plausible data. However, VAEs for graphs are still in their infancy [25, 2], as discussed above. Currently, VAEs have been mostly designed for image, audio and text generation [10, 19, 42, 35]. Very recently, the development of VAEs for text generation with a grammar [13, 33, 26] have facilitated their use in molecule discovery. However, such models use a string representations of molecules (e.g., SMILES) and, as a consequence, are more prone to generating invalid molecules than our variational autoencoder for graphs.
7 Conclusions
In this work, we have introduced a variational autoencoder for graphs, whose encoder and decoder are specially designed to account for the non Euclidean structure of graphs, be invariant to the permutation of the nodes labels of the graphs they are trained with, and allow for graphs with different number of nodes and edges. Moreover, the decoder is able to guarantee a set of local structural and functional properties in the generated graphs by using a mask in the edge distribution definition, which can prevent the generation of certain undesirable edges during the decoding process. Finally, we have shown that our variational autoencoder can learn and mimic the generative process of wellknown random graph models and can also be used to discover valid and diverse molecules more effectively than several state of the art methods.
Our work also opens many interesting venues for future work. For example, in the design of our variational autoencoder, we have assumed graphs to be static, however, it would be interesting to augment our design to dynamic graphs, in which nodes and edges arrive and leave over time, by, e.g., incorporating a recurrent neural network or long shortterm memory (LSTM) units. In our design, each node has a latent representation, however, it would be interesting to consider latent representations for multiple nodes and edges, e.g., graphlets, which could improve the scalability. In our realworld application setting, it would be interesting to use Bayesian optimization over the latent representations to identify molecules with particular desirable properties, similarly as in previous work [26]. Finally, we have performed experiments on a single realworld application, e.g., automatic chemical design, however, it would be very interesting to use our autoencoder in other applications [5].
References
 [1] Graphgan: Generating graphs via random walks. OpenReview 2018.
 [2] Graphvae: Towards generation of small graphs using variational autoencoders. OpenReview 2018.
 [3] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 [4] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola. Distributed largescale natural graph factorization. In WWW, 2013.
 [5] M. Allamanis, M. Brockschmidt, and M. Khademi. Learning to represent programs with graphs. In ICLR, 2018.
 [6] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. ICML, 2017.
 [7] A.L. Barabási and R. Albert. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
 [8] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, 2002.
 [9] R. v. d. Berg, T. N. Kipf, and M. Welling. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263, 2017.
 [10] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 [11] S. Cao, W. Lu, and Q. Xu. Grarep: Learning graph representations with global structural information. In CIKM, 2015.
 [12] P. Erdos and A. Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.
 [13] R. GómezBombarelli, D. Duvenaud, J. M. HernándezLobato, J. AguileraIparraguirre, T. D. Hirzel, R. P. Adams, and A. AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. arXiv preprint arXiv:1610.02415, 2016.
 [14] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 [15] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In KDD, 2016.
 [16] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. NIPS, 2017.
 [17] W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, 2017.
 [18] P. W. Holland and S. Leinhardt. An exponential family of probability distributions for directed graphs. Journal of the american Statistical association, 76(373):33–50, 1981.
 [19] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled generation of text. In ICML, 2017.
 [20] J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012.
 [21] B. Karrer and M. E. Newman. Stochastic blockmodels and community structure in networks. Physical Review E, 83(1):016107, 2011.
 [22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
 [23] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [24] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. 2016.
 [25] T. N. Kipf and M. Welling. Variational graph autoencoders. arXiv preprint arXiv:1611.07308, 2016.
 [26] M. J. Kusner, B. Paige, and J. M. HernándezLobato. Grammar variational autoencoder. arXiv preprint arXiv:1703.01925, 2017.
 [27] T. Lei, W. Jin, R. Barzilay, and T. Jaakkola. Deriving neural architectures from sequence and graph kernels. ICML, 2017.
 [28] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11(Feb):985–1042, 2010.
 [29] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD, 2005.
 [30] R. Li, S. Wang, F. Zhu, and J. Huang. Adaptive graph convolutional neural networks. In AAAI, 2018.
 [31] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia. Learning deep generative models of graphs. In ICLR, 2018.
 [32] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
 [33] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen. Molecular de novo design through deep reinforcement learning. arXiv preprint arXiv:1704.07555, 2017.
 [34] B. Perozzi, R. AlRfou, and S. Skiena. Deepwalk: Online learning of social representations. In KDD, 2014.
 [35] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [36] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1:140022, 2014.
 [37] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [38] L. Ruddigkeit, R. Van Deursen, L. C. Blum, and J.L. Reymond. Enumeration of 166 billion organic small molecules in the chemical universe database gdb17. Journal of chemical information and modeling, 52(11):2864–2875, 2012.
 [39] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. v. d. Berg, I. Titov, and M. Welling. Modeling relational data with graph convolutional networks. arXiv preprint arXiv:1703.06103, 2017.
 [40] D. J. Watts and S. H. Strogatz. Collective dynamics of “smallworld”networks. nature, 393(6684):440–442, 1998.
 [41] B. M. Waxman. Routing of multipoint connections. IEEE journal on selected areas in communications, 6(9):1617–1622, 1988.
 [42] Z. Yang, Z. Hu, R. Salakhutdinov, and T. BergKirkpatrick. Improved variational autoencoders for text modeling using dilated convolutions. arXiv preprint arXiv:1702.08139, 2017.
 [43] M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An endtoend deep learning architecture for graph classification. In AAAI, 2018.