Physics-Constrained Predictive Molecular Latent Space Discovery with Graph Scattering Variational Autoencoder

# Physics-Constrained Predictive Molecular Latent Space Discovery with Graph Scattering Variational Autoencoder

## Abstract

Recent advances in artificial intelligence have propelled the development of innovative computational materials modeling and design techniques. In particular, generative deep learning models have been used for molecular representation, discovery and design with applications ranging from drug discovery to solar cell development. In this work, we assess the predictive capabilities of a molecular generative model developed based on variational inference and graph theory. The encoder network is based on the scattering transform, which allows for a better generalization of the model in the presence of limited training data. The scattering layers incorporate adaptive spectral filters which are tailored to the training dataset based on the molecular graphs’ spectra. The decoding network is a one-shot graph generative model that conditions atom types on molecular topology. We present a quantitative assessment of the latent space in terms of its predictive ability for organic molecules in the QM9 dataset. To account for the limited size training data set, a Bayesian formalism is considered that allows us capturing the uncertainties in the predicted properties.

1\makenomenclature

## I Introduction

Optimizing molecules for desired properties is a challenging task. Molecular property optimization has applications in a broad range of fields ranging from drug discovery Drews (2000) to organic solar cells Hoppe and Sariciftci (2004). Computer simulations of molecular systems remain the most prominent method for guiding molecular design Jorgensen (2009). In drug design, molecular dynamics (MD) methods are used to predict how strongly would a given molecule bind to a biological target, such as a protein or a nucleic acid. These computationally expensive methods usually rely on density functional theory (DFT).

With the recent advances in Artificial Intelligence (AI), deep learning techniques have become the lead contender for future direction in molecular design optimization Elton et al. (2019). AI approaches give accurate predictions of molecular properties faster than computer simulations Faber et al. (2017). Moreover, deep learning methods have ignited automatic exploration of chemical space for molecular design by providing an underlying low-dimensional latent representation of the molecular space through generative models Gómez-Bombarelli et al. (2018). These have significantly reduced the required resources for synthesizing new drugs and molecules.

The first attempts of such methods Gómez-Bombarelli et al. (2018) were based on Simplified Molecular-Input Line-Entry System (SMILES) Weininger (1988) and used variational autoencoders to learn a latent representation of molecules. One major problem with such methods was the validity of the generated molecules. To overcome this problem, Kusner et. al. Kusner et al. (2017) proposed grammar Variational AutoEncoder (VAE), which took advantage of parse trees to represent SMILES string structures and subsequently, learned these parse trees instead of SMILES. While this helped with syntactical validity, that is, the set of rules that define a correct combination of symbols for a particular representation, they do not guarantee semantic validity, which are the set of physical constraints that define a chemically viable molecule. With the recent advances in geometric deep learning Bronstein et al. (2017), however, researchers have shifted their focus to models utilizing graph based molecular representations Gilmer et al. (2017). This shift was influenced by the richness of graph representation in topological information of the molecules and flexibility of this representation in adding desired chemical parameters to each node. Moreover, molecules can be uniquely represented by a graph.

In a molecular graph, each atom is represented by a vertex and each covalent bond is defined by an edge that connects the corresponding atoms. Furthermore, weighted graphs can account for the proximity and strength of bonds (valence of bonds).

Various deep learning frameworks have been used for generation of molecular graphs. These include Generative Adversarial Networks (GAN) De Cao and Kipf (2018) and VAEs Simonovsky and Komodakis (2018). More recently, a flow-based generative model Dinh et al. (2014) was used Madhawa et al. (2019) for generating molecules. Depending on the molecular representation used, there have been many encoding networks proposed in recent years. These include Cartesian, Thomas et al. (2018); Weiler et al. (2018); Kondor (2018) SMILES, Gómez-Bombarelli et al. (2018); Jastrz\kebski et al. (2016); Segler et al. (2018) and non-Euclidean (geometric deep learning) Son et al. (2019); Chen et al. (2019); Coley et al. (2019) representations.

Wavelet scattering transform Mallat (2012) is a deep convolutional neural network that uses a cascade of wavelet filters followed by a nonlinearity operator to generate invariant features from an input signal. These features are constructed by linearizing the variations within predefined local symmetry groups Mallat (2016). A map of these local invariants to linear space can handle learning tasks such as regression and classification. In these networks, the number of nodes, layers, filters, and non-linearity layers are predefined. This eliminates the need for training the network and hence, limits the uncertainty introduced by the encoder. Moreover, scattering improves the generalization of a network in the presence of limited data Oyallon et al. (2018). In recent years, the scattering transform has been extended to non-Euclidean domains including graphs Chen et al. (2014); Zou and Lerman (2019); Gama et al. (2018). A graph scattering network takes the weight matrix of the graph and signals residing on its vertices as input and transforms it to features that are invariant to permutation and stable to graph and signal manipulation Zou and Lerman (2019). This transform typically uses spectral-design wavelets to perform convolutions; that is, wavelets designed in the eigenspace defined by the matrix representation of the graph. Gama et al. Gama et al. (2018) used diffusion wavelets Coifman and Maggioni (2006), which are defined using diffusion processes on graphs as filters. On the other hand, Zou and Lerman Zou and Lerman (2019) used Hammond et al.’s Hammond et al. (2011) spectral wavelets for feature extraction. These wavelets, however, are only adapted to the length of the graph spectrum. Therefore, in the cases where the graph eigenvalues are irregularly spaced, this may result in highly-correlated wavelets.

The graph VAEs, which generate a latent space from input graphs, can roughly be divided into two categories based on their decoding network. The first group are the autoregressive models Jin et al. (2018); Li et al. (2018), which generate a molecule by sequentially adding components. The second group includes one-shot models Simonovsky and Komodakis (2018); Bresson and Laurent (2019), where the model simultaneously outputs weighted adjacency matrix and signals residing on its nodes. In the former, each iteration is conditioned on the previous iterations. These models become harder to train as the length of the sequence increases. The latter is not suitable for generating graphs larger than order of nodes, however, it is faster and computationally more efficient. This type of models, however, need the maximum graph size to be predetermined Simonovsky and Komodakis (2018). Many recent works on one-shot graph VAEs assume independence of nodes and edges in their graph model. Despite their success in many applications, these implementations may not learn validity constraints that dictate certain combination of nodes and edges. As a result, they do not guarantee the semantic validity of the generated molecular graph, i.e. a single connected graph which complies with the valency of the nodes. Simonovsky and Komodakis Simonovsky and Komodakis (2018) investigated a remedy by defining node presence probability as a function of the existence probabilities of the incident edges. While this helps with such problems as isolated nodes, it does not consider valency of the atoms. Furthermore, methods have been proposed to impose chemical constraints on the decoding network Ma et al. (2018). In this work, Ma et al. have reformulated a constrained optimization as an unconstrained regularization problem by imposing validity constraints using penalty terms on the VAE’s standard objective function, which penalize the network for semantically invalid generated outputs. Although this method can significantly increase the validity of the molecule, at the same time, it considerably reduces the uniqueness of the generated molecules.

Bayesian approaches have been used to model predictive uncertainty in neural networks Schöberl et al. (2019). It is common to approximate the full posterior over model parameters using the Laplace approximation MacKay and Mac Kay (2003). In practice, however, approximating the distribution of model parameters of a graph generative model with a Gaussian distribution may not be feasible as the sampled model could suffer from low validity of the outputs. In order to develop a predictive method for cases in which an accurate model over parameters is not available, Harris Harris (1989) proposed a bootstrap predictive distribution as an approximation to the predictive distribution. Fushiki et al. Fushiki et al. (2005) further extended this to non-parametric bootstrap distribution. Fushiki recently proposed Bayesian bootstrap prediction Fushiki (2010), which takes advantage of Rubin’s Rubin (1981) Bayesian bootstrap by imposing a non-informative prior over bootstrap sampling weights.

In this work, we are interested in computing the statistics of molecular properties given a small size training data set. We use a deep generative model based on variational inference and perform accurate Monte Carlo estimation of molecular properties. We consider Bayesian bootstrap resampling to yield a predictive distribution over estimated properties of the generated molecules. We show our probabilistic confidence on the estimated properties. As physical constraints are hard to satisfy when sampling nodes and edges independently, in this work, we formulate the probabilistic output as the joint distribution of nodes and edges, where probabilities of the nodes are defined as conditional distributions given the edges. This provides a direct map from the edge type scores to the node probabilities. This map implicitly encourages learning the physical constraints on consistent node and edge types without imposing explicit physical constraints. We have developed a hybrid neural network which is constructed of a graph scattering network coupled with a fully-connected network (FCN). Furthermore, in order to remedy the issues arising from the non-uniform spectra of the molecular graphs, in this work we take advantage of adaptive spectral filters Shuman et al. (2015) to construct our inference network. These filters are designed to adapt to the given training dataset, which increases the discriminatory power of the encoding network.

The rest of this work is structured as follows. In Section II we describe the inference problem and the general setup. Section III discusses the encoding and decoding network architectures used. Lastly, in Section IV, we train the network to discover a latent representation for the QM9 molecular database and to generate new molecules from samples in the latent space. We further assess the uncertainty of the network providing probabilistic evaluation of a number of properties.

## Ii Problem definition

Graphs Cayley (1874); Sylvester (1878), along with SMILES Weininger (1988), are two of the most common ways to represent a molecule in computational chemistry. Graphs consist of a set of vertices with and a set of edges , defined by distinctive pairs of vertices with and . A weighted graph , is made of weights assigned to each edge . The assigned value can represent proximity or strength of the relationship between the pair of vertices . In a molecular graph, atoms are represented by graph vertices while edges represent the atomic bonds. In this setting, the type of the bond is shown by the weights assigned to the edges. Graph is a domain on which we can represent a signal . In a molecular graph, signals are the type of the atoms sitting on the corresponding vertices. In some applications, these signals are extended to include extra atomic features, including hybridization, hydrogen neighbors, etc.

Variational Graph AutoEncoders Kipf and Welling (2016b) are designed to analyze molecular data defined on a graph domain. In the present work, the encoding network has a hybrid architecture which takes advantage of the graph scattering transform Gama et al. (2018); Zou and Lerman (2019) in combination with a fully-connected neural network to extract features from the given input graphs. The graph scattering transform is a type of deep neural network with predefined parameters that takes the graph as input and performs convolutions using spectral graph filters, combined with modulus non-linearity to generate features Mallat (2012); Chen et al. (2014). These features are invariant to permutation and stable with respect to graph and signal manipulation Zou and Lerman (2019).

For a molecular graph , we define the structure of the graph using a weight matrix and the data on the graph using a signal vector . Each element of the signal vector represents the atom label for the corresponding vertex of the graph. A categorical probability for the type of the atom on node of the graph is a vector of probability values for each atom type class, such that

 fi=arg max(~fi). (1)

Similarly, each element of the weight matrix is a sample of a discrete variable that gives the label for the possible weight values. These values correspond to the type of the covalent bond. If is the categorical probability distribution for , then it gives a vector of probability values of each bond type, such that

 Wi,j=arg max(~Wi,j). (2)

With these two components in hand, we can write the probability distribution of a molecular graph as a joint distribution of atom and bond types for all the nodes and edges in the molecular graph as follows:

 p(G)=p(f,W) (3)

Given a finite-size dataset of molecular graphs , the objective is to reveal an underlying latent representation for this dataset. Denoting the latent space variable with and given a prior distribution of , the joint distribution over the observed data and latent variable can be written as

 p(G,z)=p(G|z)p(z). (4)

In Eq. 4, represents the probability of the molecular graph conditioned on its -dimensional latent representation . A generative model for the molecular graph can be achieved by marginalizing the joint distribution over the latent representation

 p(G)=∫p(G,z)dz=∫p(G|z)p(z)dz. (5)

To address the intractable calculation of the marginal likelihood , where the model is parameterized by , we introduce standard variational inference in the context of a VAE framework and maximize the following lower-bound on the marginal log-likelihood:

 L(θ,ϕ;G)=K∑i=1L(θ,ϕ;G(i))=K∑i=1Eqϕ(z(i)|G(i))[logpθ(G(i)|z(i))]−K∑i=1DKL[qϕ(z(i)|G(i))∥pθ(z(i))]. (6)

Here, is a decoding distribution parametrized by , represents the variational distribution of the encoding network that approximates the posterior of and refers to the Kullback-Leibler distance between two distributions. The first term in the lower-bound represents the negative expected reconstruction error, while the second term introduces regulation by forcing the posterior of the latent variables to be close to the prior , here taken as standard Gaussian. For a Gaussian , it can be shown Kingma and Welling (2013)

 −DKL[qϕ(z|G)∥pθ(z)]=12J∑j=1(1+log((σj)2)−(μj)2−(σj)2),

where and the prior model is a standard Gaussian distribution. The reconstruction loss term in is discussed in Section III.2.

Maximizing the lower-bound will lead to the desired maximum likelihood estimate for the parameters and of the variational posterior and decoding distribution , respectively. In a Variational AutoEncoder (VAE) setting, provides the embedding of molecular graphs, whereas allows the generation of molecular graphs using different samples of in the latent space.

## Iii Model

With the rise of signal processing methods on graphs Shuman et al. (2013); Bronstein et al. (2017); Ortega et al. (2018), various graph-based networks have been introduced for encoding in VAEs. Similarly, there are various decoding networks used for graph generation. We discuss next the encoding and decoding networks developed in our VAE framework.

### iii.1 Encoding

In this work, we use a hybrid scattering network for the embedding of molecular graphs. The encoding network is initialized with layers of scattering transform Mallat (2012) which is a generic feature extraction network that uses predefined (non-trainable) parameters to construct feature maps from the input. These parameters include multiresolution filters and modulus non-linearity. These layers are followed by an FCN. In this sense, the encoding network has a hybrid structure, where the graph scattering transform acts as an interface between graph input and the conventional FCN layers that output a latent representation of the molecular graph.

Graph filters can roughly be divided into two categories: (i) Vertex-design filters, which are defined on the vertices of the graph as the linear combination of k-hop neighbors of each vertex and (ii) spectral-design filters, which are defined in the spectral domain of the graph. The latter is used in this work to define the kernels. The spectral domain of a graph is defined based on the graph Laplacian , which is a symmetric, positive semidefinite matrix defined as

 L:=Δ−W, (7)

where is the vertex-degree matrix with diagonal elements defined by the sum of the weights of the edges incident to node and off-diagonal elements equal to zero. The eigen-decomposition of the Laplacian matrix yields real, non-negative eigenvalue and the corresponding eigenvectors , which compose the graph spectral domain.

An important application of graph spectral theory is to provide tools for adapting the Fourier transform to graphs. In the graph domain, the Fourier transform is defined using the correspondence between the eigen-decomposition of the Laplace operator in the Euclidean domain and of the Laplacian matrix in the graph domain Shuman et al. (2013):

 ^f(ℓ):=N∑i=1χ∗ℓ(i)f(i), (8)

where is the index in the vertex domain and is the index in the spectral domain. The corresponding inverse transform is defined as

 f(i):=N−1∑ℓ=0^f(ℓ)χℓ(i). (9)

The convolution of a signal with a filter g in the vertex domain is equivalent to the multiplication of their Fourier transforms and in the frequency domain. This can be used to reformulate the convolution in the spectral domain as follows:

 f∗{g}=χ^{g}(Λ)χ∗f=^{g}(L)f. (10)

Here, the Laplacian is decomposed as , is the diagonal matrix of the eigenvalues of , and the diagonal filtering matrix is defined as

 ^{g}(Λ)=⎡⎢ ⎢ ⎢ ⎢⎣^\textslg(λ0)0⋱0^\textslg(λN−1)⎤⎥ ⎥ ⎥ ⎥⎦. (11)

The filter-bank used in this work consists of a list of filters localized around different frequencies. We adapt the method introduced in Shuman et al., 2015 to construct band-pass filters by translating a main window in the spectral domain. We select half-cosine kernel to define the main window as

 ^\textslg′(λ):=K∑k=0dk[cos(2πk(λ\textsla−12))⋅1{0≤λ<\textsla}], (12)

where

 \textsla=RγJ−R+1, (13)

is dilation factor, is the kernel overlap which tunes the kernel’s width in the spectral domain, is the number of filters in the filter-bank, and is the largest eigenvalue in the spectrum . With the main window selected, we define the system of filters as

 ^\textslg′j(λ):=^\textslg′(λ−\textsla(j−R+1)R), (14)

which are uniform translates of one another in the spectral domain. Note that the use of in Eq. 13 adapts these filters to the length of the spectrum.

As the eigenvalues of molecular graphs are unevenly spaced throughout the spectrum of the dataset, Eq. 14 may result in kernels that are highly correlated with the ones at the neighboring nodes and scales. To overcome this issue, we adapt the kernels to the spectrum of the molecular graphs by the means of a warping function. The warping function is defined as an approximation of the cumulative spectral density function of the union of the eigenvalues of the molecular graphs from dataset . In Shuman et al., 2015, this method is used for designing filters on a single graph or a family of random graphs with known empirical spectral distribution. In this work, given the training data, we use kernel density estimation (KDE) to approximate the empirical spectral distribution of the Laplacian eigenvalues of the molecular graphs. Using the warping function , the adaptive filter-bank is defined as

 ^\textslgj(λ):=^\textslg′j(ω(λ)). (15)

Consequently, we use in Eq. 13. Note that while the filter-bank is not adapted to each training data, overall, the density of the kernels is higher in the dense regions of the spectral domain. This effectively boosts the discriminatory capabilities of the encoding network. Figure 1 shows the histogram of the eigenvalues, the empirical spectral cumulative distribution function, and the resulting filter-bank kernels.

In the vertex domain, kernels are localized on each vertex by filtering a Kronecker delta function placed on the specified vertex , leading to the dictionary of kernels defined as

 \textslgi,j=√Nδi∗\textslgj=√NN−1∑ℓ=0^\textslgj(λℓ)χ∗ℓ(i)χℓ. (16)

Using this, we find analysis coefficients for a convolution with kernel of scale as

 {⟨f,gi,j⟩}Ni=1=^% {g}j(L)f, (17)

where is a matrix function which defines a frame for the associated spectral filter as shown in Eqs. 10 and 11. For a filter-bank , the frame consists of a collection of frames for each kernel

 ^{g}(L)=⎡⎢ ⎢ ⎢⎣^{g}1(L)⋮^{g}J(L)⎤⎥ ⎥ ⎥⎦. (18)

Analysis with a filter-bank of kernels extracts features from the signal. Given Eq. 18, we can extend Eq. 17 to all kernels in a filter-bank

 {⟨f,\textslgi,j⟩}i=1,…,N, j=1,…,J=^{g}(L)f, (19)

where is the graph Laplacian. Alternatively, one can use the normalized Laplacian

 ~L=Δ−1/2LΔ−1/2, (20)

to define the Fourier basis Shuman et al. (2015). By using the latter, strength of filtering is not affected by the degree of the node and hence in this work we take advantage of the normalized Laplacian matrix as the tool for performing convolution on input signal.

After defining the spectral filters, we can describe the scattering layers. Mallat Mallat (2012) introduced scattering networks as networks that use wavelet transform building blocks to generate invariants with respect to different groups of symmetry. In each layer, the scaling function creates feature maps of the input data and the wavelet filters extract higher frequency information from the input and propagate it forward to the next layer after application with a non-linear function. In the graph domain Zou and Lerman (2018); Gama et al. (2018, 2019), we are interested in graph-level feature maps. Hence, instead of scaling function, we adapt the average pooling operator from Gama et al., 2019 to construct feature maps at each layer

 η=1TN. (21)

Thus, the zeroth-order scattering coefficient is achieved by averaging the signals across all the nodes of the graph.

In order to create more feature maps, higher frequency information are retrieved from the signal using the spectrum-adapted band-pass filters . Zou and Lerman Zou and Lerman (2018) perform this using spectral wavelets Hammond et al. (2011) while Gama et al. Gama et al. (2019) use tight frame wavelets and scaling function Shuman et al. (2015). Unlike Gama et al., 2019, we don’t include scaling function while retrieving the higher frequency information and do so with the band-pass filters (Eqs. 12-15) that are localize in the vertex and spectral domain. Moreover, in our work filters are tailor made for the particular dataset so as to minimize the correlation among neighboring filters. This high frequency data is propagated to the next layer after application of the non-linearity operator to generate more feature maps. The non-linearity prevents the network from generating trivial feature maps by averaging oscillatory outputs of the convolution.

In summary, layer of the scattering transform can be written as

 Wmf={Sm−1f, Umf}={η(Um−1f), ρ(^{g}j(~L)Um−1f)}j∈Γ, (22)

where denotes the th-order scattering coefficients, shows the remaining high frequencies that are propagated to the next layer of the network after applying non-linearity , and is the set of indices for filters in the filter-bank.

Fig. 2 illustrates the feature space generated by the scattering transform, which is the hidden layer input to the FCN layers. To visualize this space, we take advantage of principal component analysis (PCA) and project the high-dimensional feature space into a two-dimensional space.

In Table 1 we compare the proposed method with benchmark methods Zou and Lerman (2019); Gama et al. (2019) within a regression problem setting. We generate features from molecular graphs using each method and use these features to predict molecular properties using ridge regression. We use data with fold cross-validation to compute errors. In the present work, node features consist of atom types. It is noteworthy that we can improve the performance of the prediction model by including additional atom features. However, this is out of the scope of the current work. The reader is referred to Ref. Faber et al., 2017 for a more detailed discussion.

After extracting features in the scattering layers, we pass them to conventional FCN layers followed by batch normalization. In this sense, the encoder is regarded as a hybrid network. In the first FCN layer, a linear layer followed by a batch normalization and an activation function are used to extract features from the input scattering coefficients. In other words, the additional trainable layer learns features that were not extracted in the predefined scattering layers.

Finally, two linear layers are used to find the hyperparameters for the variational approximate posterior distribution of the latent variable . These layers take the high-dimensional data from the previous hidden layer and project it to a lower-dimensional latent space. In this sense, these linear layers learn a probabilistic projection of the extracted feature maps to the latent space. It is worth emphasizing that the parameters for the classical FCN layers of the encoder are learned during the training, whereas scattering layers do not require training.

### iii.2 Decoding

After discussing the encoding architecture of the model, we focus on detailing the decoding network. In this work, the decoding network takes samples from the latent space and generates molecular graphs. Hence, the output includes two components: (i) a weight matrix that represents the topology of the graph and (ii) a signal vector that indicates the atoms of the molecule. To fulfill this task, the decoder is constructed of two generators that are jointly trained on a graph dataset. The input to the decoder is from the latent space constructed by the encoder. Given a sample from the latent space, generates a graph by means of its weighted adjacency matrix . Given the output from this generator, the signal decoder generates a signal which describes the atom sitting on each node. The output from provides a domain for the signal from to reside on.

The decoder defines the discrete probability distribution of the discrete variable . This can be seen as the joint distribution of bond type for all the edges in the graph. The categorical distribution gives the probability value for each possible category of types of the covalent bond between a pair of atoms

 ˜W=pθ(W|z)=N∏i=1N∏j=1j≠ipθ(Wi,j|z). (23)

Then, the argmax function in Eq. 2 is used to turn these values into a weighted adjacency matrix by mapping discrete variable to the highest probability class of weights.

The decoder yields a conditional probability distribution of the discrete variable given the discrete variable for each point in the latent space . A categorical distribution contains the probability values for each class of the variable .

 (24)

Finally, the argmax function (Eq. 1) is used to convert the discrete probability vectors into a set of one-hot vectors . This maps variable to the class with the highest probability, which is the mode of the categorical distribution. Here, an FCN followed by a softmax layer is used as the atom generator. As the FCN layers find the unnormalized scores for each category, this softmax layer turns these scores that can take any positive or negative values into normalized probability values.

The weight matrix decoding network is constructed in three steps. First, an FCN takes samples from the latent space and output a non-symmetric tensor. Then, unnormalized score values for each class are constructed by

 ¯W(z)=σ(^D1(z)^D1(z)T), (25)

where is a non-linear layer. The output from the first FCN layers is multiplied by its transpose to ensure symmetry of the final output probability tensor of the weight matrix for the undirected molecular graph. Lastly, a softmax layer turns these scores into probability values .

To sum it up, , along with , define a probabilistic mapping from the latent space representation to the molecular graph domain

 pθ(G|z)=pθ(W,f|z)=pθ(f|z,W)pθ(W|z). (26)

This yields a probability distribution from which we take graph samples

 W,f∼pθ(G|z) (27)

When considering a one-to-one mapping, we map the variable to the most probable class in the categorical distribution for each possible node and edge in the graph using Eqs. 1 and 2 as follows:

 G=(arg max(˜W),arg max(˜f)). (28)

In essence, the decoding network tackles a classification problem for every node and edge in the graph. When dealing with molecular graphs, for the node signal , the target class includes the heavy atom types in the dataset along with the case of empty node. An empty node, or null vertex, means that no atoms reside on this node and the molecule has less atoms than the predefined maximum possible number of atoms. In a similar manner, the classes for each edge include the possible types of the covalent bond between the respective atoms plus null, which means that there are no covalent bonds between the corresponding pair of atoms.

Using the probabilistic graph model (Eq. 26), we can write the negative expected reconstruction loss term in Eq. 6 as

 Eqϕ[logpθ(G|z)]=Eqϕ[logpθ(W,f|z)]=Eqϕ[log(pθ(W|z)pθ(f|z,W))]=Eqϕ[logpθ(W|z)+logpθ(f|z,W)]=Eqϕ⎡⎢ ⎢⎣N∑i=1N∑j=1j≠ilogpθ(Wi,j|z)+N∑i=1logpθ(fi|z,W)⎤⎥ ⎥⎦=N∑i=1N∑j=1j≠iEqϕ[logpθ(Wi,j|z)]+N∑i=1Eqϕ[logpθ(fi|z,W)]. (29)

Note that computing the loss for each node and each edge is a multi-class classification problem where we want to find the correct class for each node and edge in the graph. We take advantage of generalized form of cross-entropy for multi-class problems to compute the reconstruction error. This computes the relative entropy between the predicted probability and the true probability over node and edge classes. Given probability vector for edge , we can write the negative reconstruction error term for edge as

 logpθ(Wi,j|z)=−H(tεi,j,~Wi,j)=CW∑c=1tεi,jclog(~Wi,j,c), (30)

where denotes the cross-entropy between the target distribution and predicted distribution for node , index denotes class index, and is the total number of classes for edges. Furthermore, given the probability vector for node , we compute the negative reconstruction error term for node as

 logpθ(fi|z,W)=−H(tvi,~fi)=Cf∑c=1tviclog(~fi,c), (31)

where and are the target and predicted distributions for node , respectively, and is the total number of classes for nodes.

Hence, we can write the reconstruction loss of the decoding network by summing the loss over all possible nodes and edges in the graph. Note that this biases the loss toward the edges. To overcome this issue, we average the edge reconstruction losses and bond reconstruction loss and then add these two terms to obtain the reconstruction loss for the whole network. Hence, the reconstruction loss for data point has the form

 L(t)(D1,D2)=1NN∑i=iCf∑c=1−tviclog(D2(z(t))i,c)+1N(N−1)N∑i=1N∑j=1j≠iCW∑c=1−tεi,jclog(D1(z(t))i,j,c). (32)

## Iv Results

In this section, we provide details of the training of the encoder and graph and signal decoders. The implementation would be available at https://github.com/zabaras/GSVAE after publication. The latent representation of the molecular dataset is obtained and the generative model is used to produce realistic molecules. We assess the model in terms of the chemical validity of the generated molecules, their novelty, and uniqueness. In addition, we provide probabilistic estimates of molecular properties. Our focus is on computing these statistics using a limited number of training data points.

In this work, we use a subset of molecular graphs from the QM9 dataset to train our network. The QM9 database Ramakrishnan et al. (2014); Ruddigkeit et al. (2012) consists of small drug-like organic molecules. These molecules are constructed of a maximum of heavy atoms, which include Carbon, Oxygen, Nitrogen, and Florine. To visualize the latent space, we use a subset of molecular graphs from the test set, which were not seen by the network before. As the latent space dimension is higher than , we cannot directly visualize this space on the plane. For the illustration purposes, we perform PCA to transform the data to a two-dimensional space. These points are then colored based on various corresponding molecular properties.

Furthermore, we have computed a number of molecular properties to show the performance of the model. The physicochemical properties used here include Polar Surface Area (PSA) Ertl et al. (2000), which is a measure of polarity of a molecule and is the sum of the surface areas of all polar atoms in the molecule, Molecular Weight (MolWt), which is the sum of atomic weights for the atoms of the molecule, where the atomic weights are the weighted average of atomic isotopes based on their abundance in nature, and octanol-water partition coefficient (LogP), which amounts for lipophilicity of the molecule.

#### Model specification

We provide here details of the generative model and the approximate variational posterior inference network. The generative model is constructed of a prior on the latent space variable and a mapping from the latent space to the graph domain . The variational posterior approximates the posterior on .

In this model, we assume a standard normal prior distribution over

 pθ(z)=N(z;0,I). (33)

The probabilistic mapping from the latent space to the molecular domain consists of two components: (i) a map from the latent space to the graph signals and (ii) a projection from the latent space to the weighted graph adjacency matrix. These are modeled with the probabilities

 pθ(W|z)=softmax(hWθ(z))andpθ(f|z,W)=softmax(hfθ(z)), (34)

where the non-linear mapping is parameterized by a deep neural network and the superscript indicates the output. Moving to the inference network, we parameterize the variational approximate of the posterior by a Gaussian distribution

 qϕ(z|G)=N(z;μϕ(G),Sϕ(G)). (35)

We assume that the covariance matrix is a diagonal matrix. The hyperparameters of the distribution are found using the networks

 μϕ(G)=hμϕ(G)% and logσ2ϕ(G)=hσϕ(G). (36)

Note that the variance has to be a positive value by definition. In order to make training easier, we train the model to find instead and use its exponential to define the covariance matrix as .

Using Eq. 36, we can generate a latent space . We reformulate the variational approximation model in Eq. 35 as

 z=μϕ(G)+Sϕ(G)⊙ϵ, (37)

where has a standard normal distribution

 p(ϵ)=N(ϵ;0,I), (38)

with denoting element-wise product. Given the input latent space variable , the decoding network is trained to generate the graph weight matrix . To this end, we incorporate the following structure

 hWθ(z)=(a(5)∘hθhTθ)(z), (39)

where

 hθ(z)=(a(4)∘l(4)θ∘a(3)∘l(3)θ∘a(2)∘l(2)θ∘a(1)∘l(1)θ)(z). (40)

The term ensures the symmetry of the weight matrix for the output undirected graph. In the next step, given the weight matrix, the network computes the signal residing on the graph’s nodes, as illustrated in Fig. 3. As shares layers with , we can write

 hfθ(z)=(a(6)∘l(6)θ∘hWθ)(z). (41)

After going through a cascade of convolutions with spectral kernels and modulus non-linear activation function , information from different layers is aggregated in layer , where an average pooling operation constructs invariant feature maps from the input graphs. For layer scattering, the output can be formulated as

 A(G)=(η∘ρ∘^{g}∘ρ∘^{g}⌢η∘ρ∘^{g}⌢η)(G), (42)

where shows concatenation which gathers th, st, and nd order scattering coefficients. As discussed earlier, the scattering layers use predefined filters to extract feature maps and act as a bridge between the traditional FCN encoder and the input molecular graphs. However, the scattering layers reduce the need for a deeper FCN. As a result, we merely incorporate a single linear layer followed by a batch normalization layer and a non-linear layer to learn features that were not captured in the scattering layers, after which two linear layers and , the former of which is followed by a batch normalization layer , are used to compute the parameters and of the variational posterior of the variable . We can summarize these networks as

 hμϕ(G)=(l(3)ϕ∘hϕ)(G) and hσϕ(G)=(n(2)ϕ∘l(2)ϕ∘hϕ)(G), (43)

with the shared network structured as

 hϕ(G)=(~a(1)∘n(1)ϕ∘l(1)ϕ∘A)(G). (44)

In these equations, a linear layer is representing the operation , where and are learnable parameters and batch normalization represents , where is normalized input, and are learnable parameters, and is element-wise product. The indices and distinguish between the weights and biases of the decoder and the encoder . denotes the non-linear activation functions for the decoding network and shows the non-linear layers of the encoding network. Fig. 3 summarizes the encoding and decoding network architectures and the training procedure.

In order to visualize the latent space , we start with projecting molecular graphs to the latent space using the encoding network. These points however are in a -dimensional feature space and cannot be shown on the 2D plane. Hence, we use Principle Component Analysis (PCA) to map this -dimensional data onto a dimensional plane. In Fig. 4 we have trained the model using training data and encoded molecular graphs from test set to a dimensional feature space and used principal axis to represent the feature space on the plane. We notice that in Fig. 4-b, molecules with different number of heavy atoms are distinct in the shades of green color, where molecules with heavy atoms are clustered in the center with a light green color. As we move in the positive direction of principle axis , molecules have lower number of heavy atoms. Fig. 8 compares the latent space of the model trained with two different training data size . We can observe similar pattern to Fig. 4-b with the model trained with molecular graphs.

In order to observe the variability within each cluster of particular heavy atom number, in Fig. 5, we have isolated the visualization of molecules with heavy atoms from Fig. 4. Comparing Fig. 5 to Fig. 4, we see that the patterns in Figs. 5-a and 5-b are repeated in each molecule size in Figs. 4-a and 4-c, respectively.

Comparing these results with the latent representation of SMILES VAEs in Gómez-Bombarelli et al., 2018, we notice that unlike SMILES VAEs that do not have an organized latent space and need to be hooked with a property predictor to arrange the molecules based on their properties, graph VAEs have some degree of organization in their latent representation. This is due to the fact that graph VAEs also incorporate the graph that represents the underlying structure of the molecule, which provides some of the information needed to estimate the molecular properties. This can be improved further by incorporating additional atomic features such as hybridization.

We further analyze the feature space by constructing a latent contour map in Fig. 6. We construct a dimensional grid and project it onto the dimensional latent space using a projection matrix. In order to find the projection map, we encode the training data onto the dimensional latent space and find a projection matrix to the dimensional space using PCA. We use the transpose of this matrix to yield points in the feature space whose projection is a grid on the plan of the principle axis. In addition, this helps the grid to represent the same region of the dimensional space that the molecules are present. We then pass the dimensional grid points to the decoding network which in return outputs molecular graphs. As discussed later on in this section, there is no guarantee that the generated molecules have a chemically valid combination of atoms and bonds. As a result, some grid points might correspond to molecules that are chemically invalid, for which the physicochemical properties cannot be estimated. After filtering the invalid molecules, we estimate the property values which gives us the property map. Note that since the grid points corresponding to the invalid molecules have been eliminated, Gaussian Process (GP) regression is used to provide a smooth contour map.

The chemical space of the molecules defined by their physicochemical properties plays an important role in drug design Kwon (2001). As an example, Lipinski’s rule of five (Ro5) Lipinski et al. (1997) suggests limits on LogP, molecular weight, and hydrogen bond donors and acceptors to ensure that the drug-like properties are maintained during the optimization of the molecular structure. As a result, it is important for the predictions of the model to give a reliable estimate of these chemical spaces. Figure 7 compares the chemical space distribution of the training set constructed by LogP and molecular weight with the one for the predicted molecules. The last column is the chemical space of the database, which is constructed by samples from the QM9 database.

In Fig. 7, we further compare the models trained with various sizes of the training set. We observe that the model performs well with a training size as low as training data and can yield a reliable estimate of the chemical space of the database. Figure 8 shows a comparison between the latent space of the models with two different training size and .

Finally, in Fig. 9 we examine the latent space by interpolating between the feature space representation of two molecules. We randomly select two molecules from the dataset and project them onto the latent space using the encoding network. Then, we take samples in equal intervals on the path that connects the two representations. These latent space samples are projected to the graph domain using the decoding network. Comparing the generated molecules we see the transition from one molecule to the other in the latent space.

As noted earlier, not all the molecular graphs sampled using the generative model have chemically valid structures. To assess the quality of the results, we use triple quality scores including validity, uniqueness, and novelty scores. The validity score indicates if a predicted graph has a valid molecular representation

 Validity =|valid(¯G)||¯G|, (45)

where denotes the collection of sampled molecules. Note that a valid molecule refers to a single connected graph with no violation of the valency of the atoms. Figure 10 illustrates examples of valid molecules sampled using the generative model. Moreover, the uniqueness score indicates what portion of the molecular graphs are unique among the sampled outputs

 Uniqueness =∣∣¯G∗∣∣|valid(¯G)|. (46)

Here, shows the set of sampled molecules. Lastly, the novelty score indicates if the generated molecular graphs were present in the training dataset or if they are novel molecules

 Novelty =|valid(¯G)|−|G∩valid(¯G)||valid(¯G)|, (47)

where stands for the training molecules.

In Table 3, we examine the quality scores for molecules generated by the model, which is trained using molecular graphs. In this table we have distinguished the percentage of the molecules with each validity issues, namely connectivity and valency. Note that the number of the valid molecules and molecules with each validity issue might not sum to as some molecules might have both issues. Table 4 further compares the present work with benchmark models. It can be seen that the current model outperforms the benchmarks in novelty and uniqueness scores, while it is second only to MolGAN De Cao and Kipf (2018) in validity score. We believe that a balance between quality scores of the model is preferred over a high score in one category and a low score in the other, as in De Cao and Kipf, 2018.

We further examine the performance of the model under physical constraints. We use the method proposed by Ma et al., 2018 to impose validity constraints, including connectivity and valency. As can be seen in Table 3, using these constraints generally results in a decrease in the uniqueness score. In addition, Ma et al., 2018 uses a trial and error approach to tweak the Lagrange multipliers for yielding the best validity score, while this does not guarantee satisfaction of the Karush-Kuhn-Tucker (KKT) conditions. Addressing these issues is subjected to further investigation.

#### Model uncertainty

Bayesian inference has been used for quantifying uncertainty in VAE models Schöberl et al. (2019). Quantifying uncertainties in model parameters results in error bars over estimated physicochemical properties which shows our confidence over them.

Given a set of i.i.d. observations sampled from , we are interested in finding a model parameterized by that closely resembles . This can be achieved by minimizing the KL distance between and .

 DKL(ptarget(G)||p(G))=−∫ptarget(G)logp(G)dG+∫ptarget(G)logptarget(G)dG. (48)

It can be shown that we can equivalently train a model by maximizing marginal log-likelihood

 (49)

Computation of marginal log-likelihood requires intractable integration. To overcome this, Eq. 6 defines a lower-bound on marginal log-likelihood by introducing an auxiliary density parameterized by . Therefore, we find MLE estimate by maximizing the lower-bound on marginal log-likelihood

 θMLE,ϕMLE=argmaxθ,ϕL(θ,ϕ;G)=∗argmaxθ,ϕK∑k=1L(θ,ϕ;G(k)). (50)

To study the uncertainty in the trained model, we can take advantage of predictive distribution

 p(¯G∣G)=∫p(¯G∣θ)p(θ∣G)dθ. (51)

In Eq. 51, quantifying uncertainties in enables us to capture the epistemic uncertainties introduced by the training data. We can summarize the procedure in the following steps:

• Draw a posterior sample .

• Obtain predictive samples , with , given .

As noted above, this requires a full posterior over model parameters . Finding this posterior is computationally impractical. One common way is to approximate the posterior distribution with a Gaussian distribution using Laplace method. This method requires computation of Hessian of the log-posterior.

In some problems, this normal distribution gives a poor approximation to the full posterior of model parameters. We observed that the models with drawn from the approximated Gaussian distribution suffer from extremely low validity of the sampled molecules. Additionally, computation of Hessian can be overwhelmingly expensive. Taking these into consideration, Newton and Raftery Newton and Raftery (1994) proposed weighted likelihood bootstrap (WLB) as a way to simulate approximately from the posterior distribution. This method is a direct extension of Rubin’s Bayesian bootstrap Rubin (1981).

Given dataset , bootstrap method Efron (1992) generates multiple samples by sampling from , with replacement. In classical bootstrap, the sampling weights associated with data are drawn from the discrete set , where the numerator is the number of times that data is in the resampled dataset and the denominator is the size of the dataset .

Rubin Rubin (1981) presented a Bayesian analog for bootstrap by treating sampling weight as an unknown variable and drawing them from a posterior distribution over . By imposing an improper, non-informative, Dirichlet prior distribution

 p(π)=Dir(π;α),with α=[0,…,0]∈RK. (52)

over and observing sample , Bayes rule can be used to derive the posterior distribution over sampling weights

 p(π|G)∝p(G|π)p(π)∝∏kπnkk∏kπαk−1k∝∏kπnk+αk−1k. (53)

where for observation , . From Eqs 52 and 53, the posterior over bootstrap sampling weights follows a Dirichlet distribution over the bounded finite-dimensional space

 p(π|G)=Dir