All SMILES Variational Autoencoder
Abstract
Variational autoencoders (VAEs) defined over SMILES string and graphbased representations of molecules promise to improve the optimization of molecular properties, thereby revolutionizing the pharmaceuticals and materials industries. However, these VAEs are hindered by the nonunique nature of SMILES strings and the computational cost of graph convolutions. To efficiently pass messages along all paths through the molecular graph, we encode multiple SMILES strings of a single molecule using a set of stacked recurrent neural networks, pooling hidden representations of each atom between SMILES representations, and use attentional pooling to build a final fixedlength latent representation. By then decoding to a disjoint set of SMILES strings of the molecule, our All SMILES VAE learns an almost bijective mapping between molecules and latent representations near the highprobabilitymass subspace of the prior. Our SMILESderived but moleculebased latent representations significantly surpass the stateoftheart in a variety of fully and semisupervised property regression and molecular property optimization tasks.
All SMILES Variational Autoencoder
Zaccary Alperstein Quadrant zac@quadrant.ai Artem Cherkasov Vancouver Prostate Centre, UBC acherkasov@prostatecentre.com Jason Tyler Rolfe Quadrant jason@quadrant.ai
noticebox[b]Preprint. Under review.\end@float
1 Introduction
The design of new pharmaceuticals, OLED materials, and photovoltaics all require optimization within the space of molecules PyzerKnapp et al. (2015). While wellknown algorithms ranging from gradient descent to the simplex method facilitate efficient optimization, they generally assume a continuous search space and a smooth objective function. In contrast, the space of molecules is discrete and sparse. Molecules correspond to graphs, with each node labeled by one of ninetyeight naturally occurring atoms, and each edge labeled as a single, double, or triple bond. Even within this discrete space, almost all possible combinations of atoms and bonds do not form chemically stable molecules, and so must be excluded from the optimization domain, yet there remain as many as small molecules to consider Reymond et al. (2010). Moreover, properties of interest are often sensitive to even small changes to the molecule Stumpfe & Bajorath (2012), so their optimization is intrinsically difficult.
Efficient, gradientbased optimization can be performed over the space of molecules given a map between a continuous space, such as or the sphere, and the space of molecules and their properties SanchezLengeling & AspuruGuzik (2018). Initial approaches of this form trained a variational autoencoder (VAE) Kingma & Welling (2013); Rezende et al. (2014) on SMILES string representations of molecules Weininger (1988) to learn a decoder mapping from a Gaussian prior to the space of SMILES strings (GómezBombarelli et al., 2018). A sparse Gaussian process on molecular properties then facilitates Bayesian optimization of molecular properties within the latent space Dai et al. (2018); GómezBombarelli et al. (2018); Kusner et al. (2017); Samanta et al. (2018), or a neural network regressor from the latent space to molecular properties can be used to perform gradient descent on molecular properties with respect to the latent space AumentadoArmstrong (2018); Jin et al. (2018); Liu et al. (2018); Mueller et al. (2017). Alternatively, semisupervised VAEs condition the decoder on the molecular properties Kang & Cho (2018); Lim et al. (2018), so the desired properties can be specified directly. Recurrent neural networks have also been trained to model SMILES strings directly, and tuned with transfer learning, without an explicit latent space or encoder Gupta et al. (2018); Segler et al. (2017).
SMILES, the simplified molecularinput lineentry system, defines a character string representation of a molecule by performing a depthfirst preorder traversal of a spanning tree of the molecular graph, emitting characters for each atom, bond, treetraversal decision, and broken cycle Weininger (1988). The resulting character string corresponds to a flattening of a spanning tree of the molecular graph, as shown in Figure 1. The SMILES grammar is restrictive, and most strings over the appropriate character set do not correspond to welldefined molecules. Rather than require the VAE decoder to explicitly learn this grammar, contextfree grammars Kusner et al. (2017), and attribute grammars Dai et al. (2018) have been used to constrain the decoder, increasing the percentage of valid SMILES strings produced by the generative model. Invalid SMILES strings and violations of simple chemical rules can be avoided entirely by operating on the space of molecular graphs, either directly De Cao & Kipf (2018); Li et al. (2018); Liu et al. (2018); Ma et al. (2018); Simonovsky & Komodakis (2018) or via junction trees Jin et al. (2018).
Every molecule is represented by many wellformed SMILES strings, corresponding to all depthfirst traversals of every spanning tree of the molecular graph. The distance between different SMILES strings of the same molecule can be much greater than that between SMILES strings from radically dissimilar molecules Jin et al. (2018), as shown in Figure 8 of Appendix A. A generative model of individual SMILES strings will tend to reflect this geometry, complicating the mapping from latent space to molecular properties and creating unnecessary local optima for property optimization Vinyals et al. (2015). To address this difficulty, sequencetosequence transcoders Sutskever et al. (2014) have been trained to map between different SMILES strings of a single molecule Bjerrum & Sattarov (2018); Winter et al. (2019a, b).
Reinforcement learning, often combined with adversarial methods, has been used to train progressive molecule growth strategies Guimaraes et al. (2017); Jaques et al. (2017); Olivecrona et al. (2017); Putin et al. (2018); You et al. (2018); Zhou et al. (2018). While these approaches have achieved stateoftheart optimization of simple molecular properties that can be evaluated quickly in silico, criticfree techniques generally depend upon property values of algorithmgenerated molecules (but see De Cao & Kipf (2018); Popova et al. (2018)), and so scale poorly to realworld properties requiring timeconsuming wetlab experiments.
Molecular property optimization would benefit from a generative model that directly captures the geometry of the space of molecular graphs, rather than SMILES strings, but efficiently infers a latent representation sensitive to spatially distributed molecular features. To this end, we introduce the All SMILES VAE, which uses recurrent neural networks (RNNs) on multiple SMILES strings to implicitly perform efficient message passing along and amongst many flattened spanning trees of the molecular graph in parallel. A fixedlength latent representation is distilled from the variablelength RNN output using attentional mechanisms. From this latent representation, the decoder RNN reconstructs a set of SMILES strings disjoint from those input to the encoder, ensuring that the latent representation only captures features of the molecule, rather than its SMILES realization. Simple property regressors jointly trained on this latent representation surpass the stateoftheart for molecular property prediction, and facilitate exceptional gradientbased molecular property optimization when constrained to the region of prior containing almost all probability. We further demonstrate that the latent representation forms a nearbijection with the space of molecules, and is smooth with respect to molecular properties, facilitating effective optimization.
2 Efficient molecular encoding with multiple SMILES strings
A variational autoencoder (VAE) defines a generative model over an observed space in terms of a prior distribution over a latent space and a conditional likelihood of observed states given the latent configuration Kingma & Welling (2013); Rezende et al. (2014). The true loglikelihood is intractable, so the evidence lower bound (ELBO), based upon a variational approximation to the posterior distribution, is maximized instead: The ELBO implicitly defines a stochastic autoencoder, with encoder and decoder .
Many effective molecule encoders rely upon graph convolutions: local message passing in the molecular graph, between either adjacent nodes or adjacent edges Duvenaud et al. (2015); Kearnes et al. (2016); Kipf & Welling (2016a); Li et al. (2015); Lusci et al. (2013). To maintain permutation symmetry, the signal into each node is a sum of messages from the adjacent nodes, but may be a function of edge type, or attentional mechanisms dependent upon the source and destination nodes Ryu et al. (2018). This sum of messages is then subject to a linear transformation and a pointwise nonlinearity. Messages are sometimes subject to gating Li et al. (2015), like in long shortterm memories (LSTM) Hochreiter & Schmidhuber (1997) and gated recurrent units (GRU) Cho et al. (2014), as detailed in Appendix B.1.
Message passing on molecular graphs is analogous to a traditional convolutional neural network applied to images Krizhevsky et al. (2012); LeCun et al. (1990), with constantresolution hidden layers He et al. (2016) and two kernels: a averagepooling kernel that sums messages from adjacent pixels (corresponding to adjacent nodes in a molecular graph), and a trainable kernel that transforms the message from each pixel (node) independently, before a pointwise nonlinearity. While convolutional networks with such small kernels are now standard in the visual domain, they use hundreds of layers to pass information throughout the image and achieve effective receptive fields that span the entire input Szegedy et al. (2016). In contrast, molecule encoders generally use between three and seven rounds of message passing Duvenaud et al. (2015); Gilmer et al. (2017); Jin et al. (2018); Kearnes et al. (2016); Liu et al. (2018); Samanta et al. (2018); You et al. (2018). This limits the computational cost, since molecule encoders cannot use highlyoptimized implementations of spatial 2D convolutions, but each iteration of message passing only propagates information a geodesic distance of one within the molecular graph.^{1}^{1}1Alltoall connections allow fast information transfer, but computation is quadratic in graph size Gilmer et al. (2017); Kearnes et al. (2016). Lusci et al. (2013) considered a set of DAGs rooted at every atom, with full message propagation in a single pass. In the case of the commonly used dataset of 250,000 druglike molecules GómezBombarelli et al. (2018), information cannot traverse these graphs effectively, as their average diameter is 11.1, and their maximum diameter is 24, as shown in Appendix A.
Nonlocal molecular properties, requiring longrange information propagation along the molecular graph, are of practical interest in domains including pharmaceuticals, photovoltaics, and OLEDs. The pharmacological efficacy of a molecule generally depends upon high binding affinity for a particular receptor or other target, and low binding affinity for other possible targets. These binding affinities are determined by the maximum achievable alignment between the molecule’s electromagnetic fields and those of the receptor. Changes to the shape or charge distribution in one part of the molecule affect the position and orientation at which it fits best with the receptor, inducing shifts and rotations that alter the binding of other parts of the molecule, and changing the binding affinity Clayden et al. (2001). Similarly, efficient nextgeneration OLEDs depend on properties, such as the singlettriple energy gap, that are directly proportional to the strength of longrange electronic interactions across the molecule Im et al. (2017). The latent representation of a VAE can directly capture these nonlocal, nonlinear properties only if the encoder passes information efficiently across the entire molecular graph.
Analogous to graph convolutions, gated RNNs defined directly on SMILES strings effectively pass messages, via the hidden state, through a flattened spanning tree of the molecular graph (see Figure 1). The message at each symbol in the string is a weighted sum of the previous message and the current input, followed by a pointwise nonlinearity and subject to gating, as reviewed in Appendix B.1. This differs from explicit graphbased message passing in that the molecular graph is flattened into a chain corresponding to a depthfirst preorder traversal of a spanning tree, and the set of adjacent nodes that affect a message only includes the preceding node in this chain. Rather than updating all messages in parallel, RNNs on SMILES strings move sequentially down the chain, so earlier messages influence all later messages, and information can propagate through all branches of a flattening of a spanning tree in a single pass. With a wellchosen spanning tree, information can pass the entire width of the molecular graph in a single RNN update. The relationship between RNNs on SMILES strings and graphbased architectures is further explored in Appendix B.
3 Model architecture
To marry the latent space geometry induced by graph convolutions to the information propagation efficiency of RNNs on SMILES strings, the All SMILES encoder combines these architectures. It takes multiple distinct SMILES strings of the same molecule as input, and applies RNNs to them in parallel. This implicitly realizes a representative set of message passing pathways through the molecular graph, corresponding to the depthfirst preorder traversals of the spanning trees underlying the SMILES strings. Between each layer of RNNs, the encoder pools homologous messages between parallel representations, so that information flows along the union of the implicit SMILES pathways.
The characters of the multiple SMILES strings are linearly embedded, and each string is preprocessed by a BiGRU Cho et al. (2014), followed by a linear transformation, to produce the initial hidden representation for each SMILES string . The encoder then applies a stack of modules, each of which pools between homologous atoms in the parallel representations, followed by layer norm, concatenation with the linearly embedded SMILES input, and a GRU applied to the parallel representations independently, as shown in Figures 2 and 3. Each such parallel representation comprises a sequence of vectors, one for each character in the original SMILES string.
Multiple SMILES strings representing a single molecule need not have the same length, and syntactic characters indicating branching and ring closures rather than atoms and bonds do not generally match. However, the set of atoms is always consistent, and a bijection can be defined between homologous atom characters. At the beginning of each encoder module (Figure 2), the parallel inputs corresponding to a single, common atom of the original molecule are pooled, as shown in Figure 3. This harmonized representation of the atom replaces the original in each of the input streams for the subsequent layer normalizations and GRUs, reversing the information flow of Figure 3. While we experimented with average and max pooling, we found elementwise sigmoid gating to be most effective Dauphin et al. (2017); Li et al. (2015); Ryu et al. (2018): , where is the concatenation of vectors and , and the logistic function is applied elementwise. The pooling effectively sums messages propagated from many adjacent nodes in the molecular graph, analogous to a graph convolution, but the GRUs efficiently transfer information through many edges in each layer, rather than just one. The hidden representations associated with nonatom, syntactic input characters, such as parentheses and digits, are left unchanged by the pooling operation.
The approximating posterior distills the resulting variablelength encodings into a fixedlength hierarchy of autoregressive Gaussian distributions Rolfe (2016). The mean and logvariance of the first layer of the approximating posterior, , is parametrized by maxpooling the terminal hidden states of the final encoder GRUs, followed by batch renormalization Ioffe (2017) and a linear transformation, as shown in Figure 4. Succeeding hierarchical layers use Bahdanaustyle attention Bahdanau et al. (2014) (reviewed in Appendix B.2) over the pooled final atom vectors, with the query vector defined by a onehiddenlayer network of rectified linear units (ReLUs) given the concatenation of the previous latent layers as input. This is analogous to the orderinvariant encoding of set2set, but an output is produced at each step, and processing is not gated Vinyals et al. (2015). The attentional mechanism is also effectively available to property regressors that take the fixedlength latent representation as input, allowing them to aggregate contributions from across the molecule. The output of the attentional mechanism is subject to batch renormalization and a linear transformation to compute the conditional mean and logvariance of the layer. The prior has a similar autoregressive structure, but uses neural networks of ReLUs in place of Bahdanaustyle attention, since it does not have access to the atom vectors. For molecular optimization tasks, we usually scale up the term in the ELBO by the number of SMILES strings in the decoder, analogous to multiple singleSMILES VAEs in parallel; we leave this KL term unscaled for property prediction.
The decoder is a singlelayer LSTM, for which the initial cell state is computed from the latent representation by a neural network, and a linear transformation of the latent representation is concatenated onto each input. It is trained with teacher forcing to reconstruct a set of SMILES strings disjoint from those provided to the encoder, but representing the same molecule. Grammatical constraints Dai et al. (2018); Kusner et al. (2017) can naturally be enforced within this LSTM by parsing the unfolding character sequence with a pushdown automaton, and constraining the final softmax of the LSTM output at each time step to grammatically valid symbols. This is detailed in Appendix D, although we leave the exploration of this technique to future work.
Since the SMILES inputs to the encoder are different from the targets of the decoder, the decoder is effectively trained to assign equal probability to all SMILES strings of the encoded molecule. The latent representation must capture the molecule as a whole, rather than any particular SMILES input to the encoder. To accommodate this intentionally difficult reconstruction task, facilitate the construction of a bijection between latent space and molecules, and following prior work Kang & Cho (2018); Winter et al. (2019a), we use a width5 beam search decoder to map from the latent representation to the space of molecules at testtime. Further architectural details are presented in Appendix B.
3.1 Latent space optimization
Unlike many models that apply a sparse Gaussian process to fixed latent representations to predict molecular properties Dai et al. (2018); Jin et al. (2018); Kusner et al. (2017); Samanta et al. (2018), the All SMILES VAE jointly trains property regressors with the generative model Liu et al. (2018).^{2}^{2}2GómezBombarelli et al. (2018) jointly train a regressor, but still optimize using a Gaussian process. We use linear regressors for the log octanolwater partition coefficient (logP) and molecular weight (MW), which have unbounded values; and logistic regressors for the quantitative estimate of druglikeness (QED) Bickerton et al. (2012) and twelve binary measures of toxicity Huang et al. (2016); Mayr et al. (2016), which take values in . We then perform gradientbased optimization of the property of interest with respect to the latent space, and decode the result to produce an optimized molecule.
Naively, we might either optimize the predicted property without constraints on the latent space, or find the maximum a posteriori (MAP) latent point for a conditional likelihood over the property that assigns greater probability to more desirable values. However, the property regressors and decoder are only accurate within the domain in which they have been trained: the region assigned high probability mass by the prior. For a dimensional standard Gaussian prior, almost all probability mass lies in a practical support comprising a thin spherical shell of radius (Blum et al., 2017, Gaussian Annulus Theorem). With linear or logistic regressors, predicted property values increase monotonically in the direction of the weight vector, so unconstrained property maximization diverges from the origin of the latent space. Conversely, MAP optimization with a Gaussian prior is pulled towards the origin, where the density of the prior is greatest. Both unconstrained and MAP optimization thus deviate from the practical support in each layer of the hierarchical prior, resulting in large prediction errors and poor optimization.
We can use the reparametrization trick Kingma & Welling (2013); Rezende et al. (2014) to map our autoregressive prior back to a standard Gaussian. The image of the thin spherical shell through this reparametrization still contains almost all of the probability mass. We therefore constrain optimization to the reparametrized dimensional sphere of radius for each dimensional layer of the hierarchical prior by optimizing the angle directly.^{3}^{3}3This is generalizes the slerp interpolations of GómezBombarelli et al. (2018) to optimization. Although the reparametrization from the standard Gaussian prior to our autoregressive prior is not volume preserving, this hierarchical radius constraint holds us to the center of the image of the thin spherical shell. The distance to which the image of the thin spherical shell extends away from the dimensional sphere at its center is a highly nonlinear function of the previous layers.
4 Results
We evaluate the All SMILES VAE on standard 250,000 and 310,000 element subsets GómezBombarelli et al. (2018); Kang & Cho (2018) of the ZINC database of small organic molecules Irwin et al. (2012); Sterling & Irwin (2015). We also evaluate on the Tox21 dataset Huang et al. (2016); Mayr et al. (2016) in the DeepChem package Wu et al. (2018), comprising binarized binding affinities of 7831 compounds against 12 proteins. For further details, see Appendix A. Additional experiments, including ablations of novel model components, are described in Appendix C.
The full power of continuous, gradientbased optimization can be brought bear on molecular properties given a bijection between molecules and contractible regions of a latent space, along with a regressor from the latent space to the property of interest that is differentiable almost everywhere. Such a bijection is challenging to confirm, since it is difficult to find the full latent space preimage of a molecule implicitly defined by a mapping from latent space to SMILES strings, such as our beam search decoder. As a necessary condition, we confirm that it is possible to map from the space of molecules to latent space and back again, and that random samples from the prior distribution in the latent space map to valid molecules. The former is required for injectivity, and the latter for surjectivity, of the mapping from molecules to contractible regions of the latent space.
Using the approximating posterior as the encoder, but always selecting the mean of each conditional Gaussian distribution (the maximum conditional a posteriori point), and a using beam search over the conditional likelihood as the decoder, 87.4% 1% of a heldout test set of ZINC250k (80/10/10 train/val/test split) is reconstructed accurately. With the same beam search decoder, 98.5% 0.1% of samples from the prior decode to valid SMILES strings. We expect that enforcing grammatical constraints in the decoder LSTM, as described in Appendix D, would further increase these rates. All molecules decoded from a set of 50,000 independent samples from the prior were unique, 99.958% were novel relative to the training dataset, and their average synthetic accessibility score Ertl & Schuffenhauer (2009) was , compared to in the ZINC250k dataset used for training.
4.1 Property prediction
Ultimately, we would like to optimize molecules for complicated physical properties, such as fluorescence quantum yield, solar cell efficiency, binding affinity to selected receptors, and low toxicity. Networks can only be trained to predict such physical properties if their true values are known on an appropriate training dataset. While proxy properties can be accurately computed from first principles, properties like drug efficacy arise from highly nonlinear, poorly characterized processes, and can only be accurately determined through timeconsuming and expensive experimental measurements. Since such experiments can only be performed on a tiny fraction of the druglike molecules, we evaluate the ability of the All SMILES VAE to perform semisupervised property prediction.
As Figure 5 and Table 4 in Appendix C demonstrate, we significantly improve the stateoftheart in the semisupervised prediction of simple molecular properties, including the log octanolwater partition coefficient (logP), molecular weight (MW), and quantitative estimate of druglikeness (QED) Bickerton et al. (2012), against which many algorithms have been benchmarked. We achieve a similar improvement in fully supervised property prediction, as shown in Table 1, where we compare to extended connectivity fingerprints (ECFP) Rogers & Hahn (2010), the character VAE (CVAE) GómezBombarelli et al. (2018), and graph convolutions Duvenaud et al. (2015). Table 3 in Appendix C documents an even larger improvement compared to models that use a sparse Gaussian process for property prediction. We also surpass the stateoftheart in toxicity prediction on the Tox21 dataset Huang et al. (2016); Mayr et al. (2016), as shown in Table 1, despite refraining from ensembling our model, or engineering features using expert chemistry knowledge, as in previous stateoftheart methods Zaslavskiy et al. (2019).
Accurate property prediction only facilitates effective optimization if the true property value is smooth with respect to the latent space. In Figure (a)a, we plot the true (not predicted) logP over a densely sampled 2D slice of the latent space, where the y axis is aligned with the logP linear regressor.
4.2 Molecular optimization
We maximize the output of our linear and logistic property regressors, plus a logprior regularizer, with respect to the latent space, subject to a hierarchical radius constraint. After optimizing in the latent space with ADAM, we project back to a SMILES representation of a molecule with the decoder. Following prior work, we optimize QED and logP penalized by the synthetic accessibility score and the number of large rings Dai et al. (2018); Jin et al. (2018); Kusner et al. (2017); Samanta et al. (2018); You et al. (2018); Zhou et al. (2018). Figure (b)b depicts the predicted and true logP over an optimization trajectory, while Table 2 compares the top three values found amongst 100 such trajectories to the previous stateoftheart.^{4}^{4}4Recently, Winter et al. (2019b) reported molecules with penalized logP as large as 26.1, but train on an enormous, nonstandard dataset of 72 million compounds aggregated from the ZINC15 and PubChem databases. The molecules realizing these property values are shown in Figure 7. Leaving the KL term in the ELBO unscaled by the number of SMILES strings in the decoder reduces the regularization of the latent space embeddings, allowing latent space optimization to search a wider space of molecules that are less similar to the training set, as shown in Figure 14 of Appendix C.3. Unlike reinforcement learning methods that progressively evaluate the properties of novel molecules generated during training, including Graph Convolutional Policy Networks (GCPN) and Molecule Deep QNetworks (MolDQN) You et al. (2018); Zhou et al. (2018), the All SMILES VAE only requires a fixed training dataset. This is critical when optimizing properties, including pharmacological efficacy, toxicity, and OLED efficiency, for which accurate in silico approximations do not exist.


5 Conclusion
For each molecule, the All SMILES encoder uses stacked, pooled RNNs on multiple SMILES strings to efficiently pass information throughout the molecular graph. The decoder targets a disjoint set of SMILES strings of the same molecule, forcing the latent space to develop a consistent representation for each molecule. Attentional mechanisms in the approximating posterior summarize spatially diffuse features into a fixedlength, nonfactorial approximating posterior, and construct a latent representation on which linear regressors achieve stateoftheart semi and fullysupervised property prediction. Gradientbased optimization of these regressor outputs with respect to the latent representation, constrained to a subspace near almost all probability in the prior, produces stateoftheart optimized molecules when coupled with a simple RNN decoder.
Acknowledgements
References
 Appleyard et al. (2016) Appleyard, J., Kocisky, T., and Blunsom, P. Optimizing performance of recurrent neural networks on gpus. arXiv preprint arXiv:1604.01946, 2016.
 AumentadoArmstrong (2018) AumentadoArmstrong, T. Latent molecular optimization for targeted therapeutic design. arXiv preprint arXiv:1809.02032, 2018.
 Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Bickerton et al. (2012) Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S., and Hopkins, A. L. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90, 2012.
 Bjerrum & Sattarov (2018) Bjerrum, E. and Sattarov, B. Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules, 8(4):131, 2018.
 Blum et al. (2017) Blum, A., Hopcroft, J., and Kannan, R. Foundations of Data Science. June 2017. URL https://www.microsoft.com/enus/research/publication/foundationsofdatascience2/.
 Cho et al. (2014) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 Clayden et al. (2001) Clayden, J., Greeves, N., Warren, S., and Wothers, P. Organic chemistry. Oxford University Press, 2001.
 Dai et al. (2018) Dai, H., Tian, Y., Dai, B., Skiena, S., and Song, L. Syntaxdirected variational autoencoder for structured data. arXiv preprint arXiv:1802.08786, 2018.
 Dauphin et al. (2017) Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 933–941. JMLR. org, 2017.
 De Cao & Kipf (2018) De Cao, N. and Kipf, T. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.
 Duvenaud et al. (2015) Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., AspuruGuzik, A., and Adams, R. P. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232, 2015.
 Ertl & Schuffenhauer (2009) Ertl, P. and Schuffenhauer, A. Estimation of synthetic accessibility score of druglike molecules based on molecular complexity and fragment contributions. Journal of cheminformatics, 1(1):8, 2009.
 Feinberg et al. (2018) Feinberg, E. N., Sur, D., Wu, Z., Husic, B. E., Mai, H., Li, Y., Sun, S., Yang, J., Ramsundar, B., and Pande, V. S. Potentialnet for molecular property prediction. ACS central science, 4(11):1520–1530, 2018.
 Gilmer et al. (2017) Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. JMLR. org, 2017.
 GómezBombarelli et al. (2018) GómezBombarelli, R., Wei, J. N., Duvenaud, D., HernándezLobato, J. M., SánchezLengeling, B., Sheberla, D., AguileraIparraguirre, J., Hirzel, T. D., Adams, R. P., and AspuruGuzik, A. Automatic chemical design using a datadriven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
 Guimaraes et al. (2017) Guimaraes, G. L., SanchezLengeling, B., Outeiral, C., Farias, P. L. C., and AspuruGuzik, A. Objectivereinforced generative adversarial networks (organ) for sequence generation models. arXiv preprint arXiv:1705.10843, 2017.
 Gupta et al. (2018) Gupta, A., Müller, A. T., Huisman, B. J., Fuchs, J. A., Schneider, P., and Schneider, G. Generative recurrent networks for de novo drug design. Molecular informatics, 37(12):1700111, 2018.
 Hammond et al. (2011) Hammond, D. K., Vandergheynst, P., and Gribonval, R. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Huang et al. (2016) Huang, R., Xia, M., Nguyen, D.T., Zhao, T., Sakamuru, S., Zhao, J., Shahane, S. A., Rossoshek, A., and Simeonov, A. Tox21challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Frontiers in Environmental Science, 3:85, 2016.
 Im et al. (2017) Im, Y., Kim, M., Cho, Y. J., Seo, J.A., Yook, K. S., and Lee, J. Y. Molecular design strategy of organic thermally activated delayed fluorescence emitters. Chemistry of Materials, 29(5):1946–1963, 2017.
 Ioffe (2017) Ioffe, S. Batch renormalization: Towards reducing minibatch dependence in batchnormalized models. In Advances in neural information processing systems, pp. 1945–1953, 2017.
 Irwin et al. (2012) Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S., and Coleman, R. G. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012.
 Jaques et al. (2017) Jaques, N., Gu, S., Bahdanau, D., HernándezLobato, J. M., Turner, R. E., and Eck, D. Sequence tutor: Conservative finetuning of sequence generation models with klcontrol. In Proceedings of the 34th International Conference on Machine Learning, pp. 1645–1654. JMLR.org, 2017.
 Jin et al. (2018) Jin, W., Barzilay, R., and Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. arXiv preprint arXiv:1802.04364, 2018.
 Kang & Cho (2018) Kang, S. and Cho, K. Conditional molecular design with deep generative models. arXiv preprint arXiv:1805.00108, 2018.
 Kearnes et al. (2016) Kearnes, S., McCloskey, K., Berndl, M., Pande, V., and Riley, P. Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design, 30(8):595–608, 2016.
 Kim et al. (2018) Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B. A., Thiessen, P. A., Yu, B., et al. Pubchem 2019 update: improved access to chemical data. Nucleic acids research, 47(D1):D1102–D1109, 2018.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kipf & Welling (2016a) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016a.
 Kipf & Welling (2016b) Kipf, T. N. and Welling, M. Variational graph autoencoders. arXiv preprint arXiv:1611.07308, 2016b.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Kusner et al. (2017) Kusner, M. J., Paige, B., and HernándezLobato, J. M. Grammar variational autoencoder. arXiv preprint arXiv:1703.01925, 2017.
 Landrum et al. (2006) Landrum, G. et al. Rdkit: Opensource cheminformatics, 2006.
 LeCun et al. (1990) LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., and Jackel, L. D. Handwritten digit recognition with a backpropagation network. In Advances in neural information processing systems, pp. 396–404, 1990.
 Li et al. (2017) Li, J., Cai, D., and He, X. Learning graphlevel representation for drug discovery. arXiv preprint arXiv:1709.03741, 2017.
 Li et al. (2015) Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
 Li et al. (2018) Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, P. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324, 2018.
 Lim et al. (2018) Lim, J., Ryu, S., Kim, J. W., and Kim, W. Y. Molecular generative model based on conditional variational autoencoder for de novo molecular design. Journal of cheminformatics, 10(1):31, 2018.
 Liu et al. (2018) Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A. L. Constrained graph variational autoencoders for molecule design. arXiv preprint arXiv:1805.09076, 2018.
 Lusci et al. (2013) Lusci, A., Pollastri, G., and Baldi, P. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for druglike molecules. Journal of chemical information and modeling, 53(7):1563–1575, 2013.
 Ma et al. (2018) Ma, T., Chen, J., and Xiao, C. Constrained generation of semantically valid graphs via regularizing variational autoencoders. In Advances in Neural Information Processing Systems, pp. 7113–7124, 2018.
 Mayr et al. (2016) Mayr, A., Klambauer, G., Unterthiner, T., and Hochreiter, S. Deeptox: toxicity prediction using deep learning. Frontiers in Environmental Science, 3:80, 2016.
 Mueller et al. (2017) Mueller, J., Gifford, D., and Jaakkola, T. Sequence to better sequence: continuous revision of combinatorial structures. In International Conference on Machine Learning, pp. 2536–2544, 2017.
 Olivecrona et al. (2017) Olivecrona, M., Blaschke, T., Engkvist, O., and Chen, H. Molecular denovo design through deep reinforcement learning. Journal of cheminformatics, 9(1):48, 2017.
 Popova et al. (2018) Popova, M., Isayev, O., and Tropsha, A. Deep reinforcement learning for de novo drug design. Science advances, 4(7):eaap7885, 2018.
 Putin et al. (2018) Putin, E., Asadulaev, A., Ivanenkov, Y., Aladinskiy, V., SanchezLengeling, B., AspuruGuzik, A., and Zhavoronkov, A. Reinforced adversarial neural computer for de novo molecular design. Journal of chemical information and modeling, 58(6):1194–1204, 2018.
 PyzerKnapp et al. (2015) PyzerKnapp, E. O., Suh, C., GómezBombarelli, R., AguileraIparraguirre, J., and AspuruGuzik, A. What is highthroughput virtual screening? a perspective from organic materials discovery. Annual Review of Materials Research, 45:195–216, 2015.
 Reymond et al. (2010) Reymond, J.L., Van Deursen, R., Blum, L. C., and Ruddigkeit, L. Chemical space as a source for new drugs. MedChemComm, 1(1):30–38, 2010.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286, 2014.
 Rogers & Hahn (2010) Rogers, D. and Hahn, M. Extendedconnectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
 Rolfe (2016) Rolfe, J. T. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.
 Ryu et al. (2018) Ryu, S., Lim, J., and Kim, W. Y. Deeply learning molecular structureproperty relationships using graph attention neural network. arXiv preprint arXiv:1805.10988, 2018.
 Samanta et al. (2018) Samanta, B., De, A., Jana, G., Chattaraj, P. K., Ganguly, N., and GomezRodriguez, M. NeVAE: a deep generative model for molecular graphs. arXiv preprint arXiv:1802.05283, 2018.
 SanchezLengeling & AspuruGuzik (2018) SanchezLengeling, B. and AspuruGuzik, A. Inverse molecular design using machine learning: Generative models for matter engineering. Science, 361(6400):360–365, 2018.
 Segler et al. (2017) Segler, M. H., Kogej, T., Tyrchan, C., and Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science, 4(1):120–131, 2017.
 Simonovsky & Komodakis (2018) Simonovsky, M. and Komodakis, N. Graphvae: Towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pp. 412–422. Springer, 2018.
 Snelson & Ghahramani (2006) Snelson, E. and Ghahramani, Z. Sparse gaussian processes using pseudoinputs. In Advances in neural information processing systems, pp. 1257–1264, 2006.
 Sterling & Irwin (2015) Sterling, T. and Irwin, J. J. Zinc 15–ligand discovery for everyone. Journal of chemical information and modeling, 55(11):2324–2337, 2015.
 Stumpfe & Bajorath (2012) Stumpfe, D. and Bajorath, J. Exploring activity cliffs in medicinal chemistry: miniperspective. Journal of medicinal chemistry, 55(7):2932–2942, 2012.
 Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
 Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
 Vinyals et al. (2015) Vinyals, O., Bengio, S., and Kudlur, M. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391, 2015.
 Weininger (1988) Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
 Winter et al. (2019a) Winter, R., Montanari, F., Noé, F., and Clevert, D.A. Learning continuous and datadriven molecular descriptors by translating equivalent chemical representations. Chemical Science, 2019a.
 Winter et al. (2019b) Winter, R., Montanari, F., Steffen, A., Briem, H., Noe, F., and Clevert, D. Efficient multiobjective molecular optimization in a continuous latent space. 2019b.
 Wu et al. (2018) Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V. MoleculeNet: A benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
 You et al. (2018) You, J., Liu, B., Ying, R., Pande, V., and Leskovec, J. Graph convolutional policy network for goaldirected molecular graph generation. arXiv preprint arXiv:1806.02473, 2018.
 Zaslavskiy et al. (2019) Zaslavskiy, M., Jégou, S., Tramel, E. W., and Wainrib, G. Toxicblend: Virtual screening of toxic compounds with ensemble predictors. Computational Toxicology, 10:81–88, 2019.
 Zhou et al. (2018) Zhou, Z., Kearnes, S., Li, L., Zare, R. N., and Riley, P. Optimization of molecules via deep reinforcement learning. arXiv preprint arXiv:1810.08678, 2018.
Appendix A Datasets
SMILES strings, as well as the true values of the log octanolwater partition coefficient (logP), molecular weight (MW), and the quantitative estimate of druglikeness (QED), are computed using RDKit Landrum et al. (2006).
(a) COCOC1CNCC(C)N1  (b) CCCCC(CCCC)CCCC 
CC1NC(CNC1)OCOC 
a.1 Zinc
For molecular property optimization and fully supervised property prediction, we train the All SMILES VAE on the ZINC250k dataset of 250,000 organic molecules with between 6 and 38 heavy atoms, and penalized logPs^{5}^{5}5The log octanolwater partition coefficient minus the synthetic accessibility score and the number of rings with more than six atoms. from 13 to 5 GómezBombarelli et al. (2018). This dataset is curated from the full ZINC12 dataset Irwin et al. (2012), and available from https://github.com/aspuruguzikgroup/chemical_vae. The distribution of molecular diameters in ZINC250k is shown in Figure 9.
For semisupervised property prediction on logP, MW, and QED, we train on the ZINC310k dataset of 310,000 organic molecules with between 6 and 38 heavy atoms Kang & Cho (2018). This dataset is curated from the full ZINC15 dataset Sterling & Irwin (2015), and available from https://github.com/nyudl/conditionalmoleculardesignssvae.
a.2 Tox21
For the semisupervised prediction of twelve forms of toxicity, we train on the Tox21 dataset Huang et al. (2016); Mayr et al. (2016), accessed through the DeepChem package Wu et al. (2018), with the provided random train/validation/test set split. This dataset contains binarized binding affinities against up to 12 proteins for 6264 training, 783 validation, and 784 test molecules. Tox21 contains molecules with up to 140 atoms, ranging from large peptides to lanthanide, actinide and other metals. Many of these metal atoms are not present in any of the standard molecular generative modeling datasets, and there are metal atoms in the validation and test set that never appear in the training set. To address these difficulties, we curated an unsupervised dataset of 1.5 million molecules from the PubChem database Kim et al. (2018), which we will make available upon publication. To maintain commensurability with prior work, this additional unsupervised dataset is only used on the Tox21 prediction task.
Appendix B Extended model architecture
The full All SMILES VAE architecture is summarized in Figure 10. The evidence lower bound (ELBO) of the loglikelihood,
(1) 
is the sum of the conditional loglikelihoods of in Figure 10, minus the KullbackLeibler divergence between the approximating posterior, , computed by node AP in Figure 10, and the prior depicted in Figure 11.
In all experiments, we use encoder stacks of depth three, with 512 hidden units in each GRU. The approximating posterior uses four layers of hierarchy, with 128 hidden units in the onehiddenlayer neural network that computes the attentional query vector. In practice, separate GRUs were used to produce the final hidden state for and the atom representations for . The singlelayer LSTM decoder has 2048 hidden units. Training was performed using ADAM, with a decaying learning rate and KL annealing. In all multiple SMILES strings architectures, we use 5 SMILES strings for encoding and decoding.
In contrast to many previous molecular VAEs, we do not scale down the term in the ELBO by the number of latent units Dai et al. (2018); Kusner et al. (2017). However, our loss function does include separate reconstructions for multiple SMILES strings of a single molecule. For molecular optimization tasks, we usually scale up this KL term by the number of SMILES strings in the decoder, analogous to multiple singleSMILES VAEs in parallel; we leave the KL term unscaled for property prediction.
b.1 Gated recurrent neural networks
Convolutional neural networks on images Krizhevsky et al. (2012); LeCun et al. (1990) leverage the inherent geometry of the visual domain to perform local message passing. At every spatial location of each layer, a convolutional network computes a message comprising the weighted sum of messages from the surrounding region in the preceding layer, followed by a pointwise nonlinearity. Each round of messages propagates information a distance equal to the convolutional kernel diameter multiplied by the current spatial resolution.
Recurrent neural networks, such as long shortterm memories (LSTMs) Hochreiter & Schmidhuber (1997) and gated recurrent units (GRUs) Cho et al. (2014), model text, audio, and other onedimensional sequences in an analogous manner. The kernel, comprising the weights on the previous hidden state and the current input, has a width of only two. However, the messages (i.e., the hidden states) are updated consecutively along the sequence, so information can propagate through the entire network in a single pass, substantially reducing the number of layers required. LSTMs and GRUs are ubiquitous in natural language processing tasks, and efficient GPU implementations have been developed Appleyard et al. (2016).
Gated recurrent units (GRUs) are defined by Cho et al. (2014):
where , , and are rowvectors, denotes the columnwise concatenation of and , and the logistic function and hyperbolic tangent are applied elementwise to vector argument . The hidden state , comprising the message from node , is a gated, weighted sum of the previous message and the current input , both subject to an elementwise linear transformation and nonlinear (sigmoid) transformation. Specifically, the sum of the message from the input, , and the gated message from the previous node, , is subject to a linear transformation and a pointwise nonlinearity. This is then gated and added to a gated residual connection from the previous node.
Long shortterm memories (LSTMs) are defined similarly Hochreiter & Schmidhuber (1997):
where is the forget gate, is the input gate, and is the output gate. LSTMs impose a second hyperbolic tangent and gating unit on the nonlinear recurrent message, but nevertheless still follow the form of applying widthtwo kernels and pointwise nonlinearities to the input and hidden state.
In contrast, message passing in graphs is defined by Duvenaud et al. (2015); Kearnes et al. (2016); Kipf & Welling (2016a); Li et al. (2015):
where is the set of neighbors of node , for which there is an edge between and , and is a pointwise nonlinearity such as a logistic function or rectified linear unit. This message passing, also called graph convolutions, can be understood as a firstorder approximation to spectral convolutions on graphs Hammond et al. (2011). Kipf & Welling (2016a) additionally normalize each message by the square root of the degree of each node before and after the sum over neighboring nodes. Kearnes et al. (2016) maintain separate messages for nodes and edges, with the neighborhood of a node comprising the connected edges and the neighborhood of an edge comprising the connected nodes. Li et al. (2015) add gating analogous to a GRU.
An LSTM, taking a SMILES string as input, can realize a subset of the messages passed by graph convolutions. For instance, input gates and forget gates can conspire to ignore openparentheses, which indicate the beginning of a branch of the depthfirst spanning tree traversal. If they similarly ignore the digits that close broken rings, the messages along each branch of the flattened spanning tree are not affected by the extraneous SMILES syntax. Input and forget gates can then reset the LSTM’s memory at closeparentheses, which indicate the end of a branch of the depthfirst spanning tree traversal, and the return to a previous node, ensuring that messages only propagate along connected paths in the molecular graph.
A set of LSTMs on multiple SMILES strings of a single molecule, with messages exchanged between the LSTMs, can generate all of the messages produced by a graph convolution. Atombased pooling between LSTMs on multiple SMILES strings of the same molecule combines the messages produced in each flattened spanning tree, allowing every LSTM to access all messages produced by a graph convolution. While an LSTM decoder generating SMILES strings faces ambiguity regarding which set of SMILES strings representing a molecule to produce, this is analogous to the problem faced by graphbased decoders, as discussed in Appendix D.2
b.2 Bahdanaustyle attention
The layers of the hierarchical approximating posterior after the first define a conditional Gaussian distribution, the mean and logvariance of which are parametrized by an attentional mechanism of the form proposed by Bahdanau et al. (2014). The final encoder hidden vectors for each atom comprise the key vectors , whereas the query vector is computed by a onehiddenlayer network of rectified linear units given the concatenation of the previous latent layers as input. The final output of the attentional mechanism, , is computed via:
The output of the attentional mechanism is subject to batch renormalization and a linear transformation to compute the conditional mean and logvariance of the layer.
b.3 Latent space optimization
To further ensure that the optimization is constrained to welltrained regions of latent space, we add to the objective function, where is a hyperparameter. Finally, to moderate the strictly monotonic nature of linear regressors, we apply an elementwise hard tanh to all latent variables before the regressor, with a linear region that encompasses all values observed in the training set.
To compare with previous work as fairly as possible, we optimize 1000 random samples from the prior to convergence, collecting the last point from each trajectory with a valid SMILES decoding. From these 1000 points, we evaluate the true molecular property on the 100 points for which the predicted property value is largest. Of these 100 values, we report the three largest. However, optimization within our latent space is computationally inexpensive, and requires no additional property measurement data. We could somewhat improve molecular optimization at minimal expense by constructing additional optimization trajectories in latent space, and evaluating the true molecular properties on the best points from this larger set.
Molecular optimization is quite robust to hyperparameters. We considered ADAM learning rates in and .
b.4 Summary of novel contributions
Starting with the work of GómezBombarelli et al. (2018), previous molecular variational autoencoders have used one consistent SMILES string as both the input to the RNN encoder and the target of the RNN decoder. Any single SMILES string explicitly represents only a subset of the pathways in the molecular graph. Correspondingly, the recurrent neural networks in these encoders implicitly propagated information through only a fraction of the possible pathways. Kipf & Welling (2016b), Liu et al. (2018), and Simonovsky & Komodakis (2018), amongst others, trained molecular VAEs with graph convolutional encoders, which pass information through all graph pathways in parallel, but at considerable computational expense. None of these works used enough layers of graph convolutions to transfer information across the diameter of the average molecule in standard drug design datasets. The All SMILES VAE introduces the use of multiple SMILES strings of a single, common molecule as input to a RNN encoder, with pooling of homologous messages amongst the hidden representations associated with different SMILES strings. This allows information to flow through all pathways of the molecular graph, but can efficiently propagate information across the entire width of the molecule in a single layer.
Bjerrum & Sattarov (2018) and Winter et al. (2019a) trained sequencetosequence transcoders to map between different SMILES strings of the same molecule. These transcoders do not define an explicit generative model over molecules, and their latent spaces have no prior distributions. The All SMILES VAE extends this approach to variational autoencoders, and thereby learns a SMILESderived generative model of molecules, rather than SMILES strings. The powerful, learned, hierarchical prior of the All SMILES VAE regularizes molecular optimization and property prediction. To ensure that molecular property optimization searches within the practical support of the prior, containing almost all of its probability mass, we introduce a hierarchical radius constraint on optimization with respect to the latent space.
Appendix C Extended results
We compare the performance of the All SMILES VAE to a variety of stateoftheart algorithms that have been evaluated on standard molecular property prediction and optimization tasks. In particular, we compare to previously published results on the character/chemical VAE (CVAE) GómezBombarelli et al. (2018) (with results reported in Kusner et al. (2017)), grammar VAE (GVAE) Kusner et al. (2017), syntaxdirected VAE (SDVAE) Dai et al. (2018), junction tree VAE (JTVAE) Jin et al. (2018), NeVAE Samanta et al. (2018), semisupervised VAE (SSVAE) Kang & Cho (2018), graph convolutional policy network (GCPN) You et al. (2018), molecule deep Qnetwork (MolDQN) Zhou et al. (2018), and the DeepChem Wu et al. (2018) implementation of extended connectivity fingerprints (ECFP) Rogers & Hahn (2010) and graph convolutions (GraphConv) Duvenaud et al. (2015); Kearnes et al. (2016); Wu et al. (2018). Extended connectivity fingerprints are a fixedlength hash of local fragments of the molecule, used as input to conventional machine learning techniques such as random forests, support vector machines, and nonconvolutional neural networks Wu et al. (2018). For toxicity prediction, we also compare to PotentialNet Feinberg et al. (2018), ToxicBlend Zaslavskiy et al. (2019), and the results of Li et al. (2017).
c.1 Reconstruction accuracy and validity
Previous molecular variational autoencoders have been evaluated using the percentage of molecules that are correctly reconstructed when sampling from both the approximating posterior and the conditional likelihood (reconstruction accuracy), and the percentage of samples from the prior and conditional likelihood that are valid SMILES strings (validity). While these measure have intuitive appeal, they reflect neither the explicit training objective (the ELBO), nor the requirements of molecular optimization. In particular, when optimizing molecules via the latent space, a deterministic decoder ensures that each point in latent space is associated with a single set of welldefined molecular properties.
The All SMILES VAE is trained on a more difficult task than previous molecular VAEs, since the reconstruction targets are different SMILES encodings than those input to the approximating posterior. This ensures that the latent representation only captures the molecule, rather than its particular SMILES encoding, but it requires the decoder LSTM to produce a complex, highly multimodal distribution over SMILES strings. As a result, samples from the decoder distribution are less likely to correspond to the input to the encoder, either due to syntactic or semantic errors.
To compensate for this unusually difficult decoding task, we evaluate the All SMILES VAE using a beam search over the decoder distribution.^{6}^{6}6The full decoder distribution is still used for training. That is, we decode to the single SMILES string estimated to be most probable under the conditional likelihood . This has the added benefit of defining an unambiguous decoding for every point in the latent space, simplifying the interpretation of optimization in the latent space (Section 4.2). However, it renders the reconstruction and validity reported in Section 4 incommensurable with much prior work, which use stochastic encoders and decoders.
c.2 Property prediction
Rather than jointly modeling the space of molecules and their properties, some earlier molecular variational autoencoders first trained an unsupervised VAE on molecules, extracted their latent representations, and then trained a sparse Gaussian process over molecular properties as a function of these fixed latent representations Dai et al. (2018); Jin et al. (2018); Kusner et al. (2017); Samanta et al. (2018). Sparse Gaussian processes are parametric regressors, with the location and value of the inducing points trained based upon the entire supervised dataset Snelson & Ghahramani (2006). They have significantly more parameters, and are corresponding more powerful, than linear regressors.
Molecular properties are only a smooth function of the VAE latent space when the property regressor is trained jointly with the generative model GómezBombarelli et al. (2018). Results using a sparse Gaussian process on the latent space of an unsupervised VAE are very poor compared to less powerful regressors trained jointly with the VAE. Our property prediction is two orders of magnitude more accurate than sparse Gaussian process regression on an unsupervised VAE latent representation, as shown in Table 3.
Model  RMSE 

Character VAE (CVAE) GómezBombarelli et al. (2018); Kusner et al. (2017)  1.504 
Grammar VAE (GVAE) Kusner et al. (2017)  1.404 
Syntaxdirected VAE (SDVAE) Dai et al. (2018)  1.366 
Junction tree VAE (JTVAE) Jin et al. (2018)  1.290 
NeVAE Samanta et al. (2018)  1.23 
All SMILES  0.011 0.001 
We report numerical results on semisupervised property prediction, as well as comparisons from Kang & Cho (2018), in Table 4. Our mean absolute error is at least three times smaller than comparison algorithms on the log octanolwater partition coefficient (logP) and molecular weight (MW).
Model  % labeled  MAE logP  MAE MW  MAE QED 

ECFP  50%  0.180 0.003  9.012 0.184  0.023 0.000 
GraphConv  50%  0.086 0.012  4.506 0.279  0.018 0.001 
SSVAE  50%  0.047 0.003  1.05 0.164  0.01 0.001 
All SMILES  50%  0.007 0.002  0.21 0.07  0.0064 
ECFP  20%  0.249 0.004  12.047 0.168  0.033 0.001 
GraphConv  20%  0.112 0.015  4.597 0.419  0.021 0.002 
SSVAE  20%  0.071 0.007  1.008 0.370  0.016 0.001 
All SMILES  20%  0.009 0.002  0.33 0.06  0.0079 
ECFP  10%  0.335 0.005  15.057 0.358  0.045 0.001 
GraphConv  10%  0.148 0.016  5.255 0.767  0.028 0.003 
SSVAE  10%  0.090 0.004  1.444 0.618  0.021 0.001 
All SMILES  10%  0.014 0.002  0.30 0.06  0.0126 0.0006 
ECFP  5%  0.380 0.009  17.713 0.396  0.053 0.001 
GraphConv  5%  0.187 0.015  6.723 2.116  0.034 0.004 
SSVAE  5%  0.120 0.006  1.639 0.577  0.028 0.001 
All SMILES  5%  0.036 0.004  0.4 0.1  0.0217 0.0003 
As a visual demonstration of the accuracy of property prediction, in Figure 12 we show the predicted logP of a 2D slice of latent space subject to the hierarchical radius constraint, alongside the true logP of the molecules decoded from this slice (identical to Figure (a)a).
Pathways on which activity (active or inactive) is assessed for the Tox21 dataset include seven nuclear receptor signaling pathways: androgen receptor, full (NRAR) androgen receptor, LBD (NRARLBD); aryl hydrocarbon receptor (NRAHR); aromatase (NRAROMATASE); estrogen receptor alpha, LBD (NRERLBD); estrogen receptor alpha, full (NRER); and peroxisome proliferatoractivated receptor gamma (NRPPARGAMMA). The Tox21 dataset also includes activity assessments for five stress response pathways: nuclear factor (erythroidderived 2)like 2/antioxidant responsive element (SRARE); ATAD5 (SRATAD5); heat shock factor response element (SRHSE); mitochondrial membrane potential (SRMMP); and p53 (SRp53). We report the area under the receiver operating characteristic curve (AUCROC) on each assay independently in Table 5. The average of these AUCROCs is reported in Table 1. We do not include the result of Kearnes et al. (2016) in Table 1, since it is not evaluated on the same train/validation/test split of the Tox21 dataset, and so is not commensurable.
NRAR  NRARLBD  NRAHR  NRAROMATASE  NRER  NRERLBD 

0.868  0.907  0.889  0.907  0.714  0.830 
NRPPARGAMMA  SRARE  SRATAD5  SRHSE  SRMMP  SRp53 
0.911  0.863  0.870  0.901  0.914  0.888 
c.3 Molecular optimization
We present an optimization trajectory for the quantitative estimate of druglikeness (QED) in Figure 13. For the molecules depicted in Figure 7, we scaled ) in the ELBO (Equation 1) of the All SMILES VAE by the number of SMILES strings in the decoder. This renders the loss function analogous to that of many parallel singleSMILES VAEs, but with message passing between encoders leading to a shared latent representation. If we leave the KL term unscaled, latent space embeddings are subject to less regularization forcing them to match the prior distribution. Optimization of molecular properties with respect to the latent space therefore searches over a wider space of molecules, which are less similar to the training set. In Figure 14, we show that such an optimization for penalized log P finds very long aliphatic chains, with penalized log P values as large as .
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCOCCCCCCCCCCCCCCCCCCCCCCCOC  CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCOCCOC  CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCOCOCCCCCCCCCCCCCCCCCCCCCCOC 
c.4 Ablation of model components
The All SMILES VAE passes messages along and between the flattened spanning trees of molecular graphs encoded by multiple SMILES strings of a shared molecule. In the base implementation, this is realized via a stack of GRUs, with pooling across SMILES strings amongst the hidden representation of each atom in each layer, and further max pooling of the final hidden states. In Table 6, we progressively ablate model components to demonstrate that this computational architecture builds a powerful fixedlength representation of molecules, rather than their particular SMILES string instantiations. We evaluate the effect of these model ablations on the mean absolute error of predictions of the log octanolwater partition coefficient (logP) and the quantitative estimate of druglikeness (QED), as well as the percentage of samples from the prior that decode to valid SMILES strings (Val) and the percentage of molecules in a heldout test set that are reconstructed accurately by the encoder and decoder (Rec acc). In all cases, we use the layerwise maximum a posterior configuration of the encoder (the mean of each conditional Gaussian distribution), and a beamsearch decoder.
No atombased pooling removes the pooling between each instance of an atom across SMILES strings, depicted in Figure 3. As a result, the multiple SMILES inputs are processed independently until the final max pooling over GRU hidden states, which serves as input to the first layer of the latent variable hierarchy. A random SMILES string is chosen to serve as input to the attentional mechanism for the remaining layers of the latent variable hierarchy. The effect of this ablation on toxicity prediction for the Tox21 dataset is reported in Table 7. We extend this process in One SMILES enc by only feeding a single SMILES string to the encoder, although the decoder still reconstructs multiple disjoint SMILES strings. One SMILES enc/dec () further reduces the size of the decoder set to one, but the encoded and decoded SMILES strings are distinct representations of the molecule. Finally, One SMILES enc/dec () encodes and decodes a single, shared SMILES string.
Except for One SMILES enc/dec (), all of these ablations primarily disrupt the flow of messages between the flattened spanning trees, and induce a similar, significant decay in performance. One SMILES enc/dec () further permits the latent representation to encode the details of the particular SMILES string, rather than forcing the representation of only the underlying molecule, and causes a further reduction in performance.
We also observe a meaningful contribution from the hierarchical approximating posterior, with its attentional pooling mechanism. In No posterior hierarchy, we move all latent variables to the first layer of the hierarchy, removing the succeeding layers. The remaining prior is a standard Gaussian, and there is no attentional pooling over the atom representations.
Table 8 shows that the hierarchical radius constraint significantly improves molecular optimization. In contrast to Table 2, optimization is performed on penalized logP alone, without a log prior regularizer. This produces better results without the radius constraint, and so constitutes a more conservative ablation experiment.
Ablation  MAE logP  MAE QED  Val  Rec acc 

Full model  0.005 0.0006  0.0052 0.0001  98.5 0.1  87.4 1.0 
No atombased pooling  0.008 0.004  0.0076 0.0005  97.6 0.2  84.0 0.4 
One SMILES enc  0.008 0.0005  0.0073 0.0002  98.4 0.1  82.3 0.4 
One SMILES enc/dec ()  0.009 0.001  0.0091 0.0003  
One SMILES enc/dec ()  0.025 0.003  0.0115 0.0004  
No posterior hierarchy  0.010 0.003  0.0051 0.0001 
Ablation  AUCROC 

Full model  0.871 
No atombased pooling  0.864 
Ablation  1st best logP  2nd best logP  3rd best logP 

With radius constraint  17.0 3.0  16.0 2.0  14.8 0.3 
Without radius constraint  8.5044 0.0  6.9526 0  5.36 0.05 
Appendix D SMILES grammar can be enforced with a pushdown automaton
The subset of the SMILES grammar Weininger (1988) captured by Dai et al. (2018) and Kusner et al. (2017) is equivalent to the contextfree grammar shown in Figure 15. This subset does not include the ability to represent multiple disconnected molecules in a single SMILES string, multiple fragments that are only connected by ringbonds, or wildcard atoms. element_symbols includes symbols for every element in the periodic table, including the aliphatic_organic symbols.
chain  
rest_of_chain  
bond  
branched_atom  
ringbond  
branch  
atom  
aliphatic_organic  
aromatic_organic  
bracket_atom  
isotope  
symbol  
aromatic_symbols  
chiral  
hcount  
charge  
class  
digit 
Productions generally begin with a unique, defining symbol or set of symbols. Exceptions include bond and charge (both can begin with ), and aromatic_organic and aromatic_symbols (both include c, n, o, s, and p), but these pairs of productions never occur in the same context, and so cannot be confused. The particular production for chiral can only be resolved by parsing characters up to the next production, but the end of chiral and the identity of the subsequent production can be inferred from its first symbol of the production after chiral. Alternatively, the strings of chiral can be encoded as monolithic tokens.
Whenever there is a choice between productions, the true production is uniquely identified by the next symbols. The only aspect of the SMILES grammar that requires more than a few bits of memory is the matching of parentheses, which can be performed in a straightforward manner with a pushdown automaton. As a result, parse trees Dai et al. (2018); Kusner et al. (2017) need not be explicitly constructed by the decoder to enforce the syntactic restrictions of SMILES strings. Rather, the SMILES grammar can be enforced with a pushdown automaton running in parallel with the decoder RNN. The state of the pushdown automaton tracks progress within the representation of each atom, and the sequence of atoms and bonds. The set output symbols available to the decoder RNN is restricted to those consistent with the current state of the pushdown automaton. ( and [ are pushed onto the stack when are emitted, and must be popped from the top of the stack in order to emit ) or ] respectively.
For example, in addition to simple aliphatic organic (B, C, N, O, S, P, F, Cl, Br, or I) or aromatic organic (b, c, n, o, s, or p) symbols, an atom may be represented by a pair of brackets (requiring parentheses matching) containing a sequence of isotope number, atom symbol, chiral symbol, hydrogen count, charge, and class. With the exception of the atom symbol, each element of the sequence is optional, but is easily parsed by a finite state machine. isotope, symbol, chiral, hcount, charge, and class can all be distinguished based upon their first character, so the position in the progression can be inferred trivially.^{7}^{7}7symbol and hcount can both start with ‘H’, but symbol is mandatory, so there is no ambiguity.
When parsing branched_atom, all productions after the initial atom are ringbonds until the first (, which indicates the beginning of a branch. After observing a ), and popping the complementary ( off of the stack, the SMILES string is necessarily in the third component of a branched_atom, since only a branched_atom can emit a branch, and only branch produces the symbol ). The next symbol must be a (, indicating the beginning of another branch, or one of the first symbols of rest_of_chain, since this must follow the branched_atom in the chain production.
d.1 Ringbond and valence shell semantic constraints
Similarly, the semantic restrictions of ringbond matching and valence shell constraints can be enforced during feedforward production of a SMILES string using a pushdown stack and a small (100element) random access memory. Our approach depends upon the presence of matching bond labels at both sides of a ringbond, which is allowed but not required in standard SMILES syntax. We assume the trivial extention of the SMILES grammar to include this property.
ringbonds are constrained to come in pairs, with the same bond label on both sides. Whenever a given ringbond is observed, flip a bit in the random access memory corresponding to the ring number (the set of digits after the bond). When the ringbond bit is flipped on, record the associated bond in the random access memory associated with the ring number; when the ringbond bit is flipped off, require that the new bond matches the recorded bond, and clear the random access memory of the bond. The molecule is only allowed to terminate (rest_of_chain produces rather than bond? chain) when all ringbond bits are off (parity is even). The decoder may receive as input which ringbonds are open, and the associated bond type, so it can preferentially close them.
The set of nested atomic contexts induced by chain, branched_atom, and branch can be arbitrarily deep, corresponding to the depth of branching in the spanning tree realized by a SMILES string. As a result, the set of SMILES symbols describing bonds to a single atom can be arbitrarily far away from t=he associated atom. However, once a branch is entered, it must be traversed in its entirety before the SMILES string can return to the parent atom. For each atom, it is sufficient to push the valence shell information onto the stack as it is encountered. If the SMILES string enters a branch while processing an atom, simply push on a new context, with a new associated root atom. Once the branch is completed, pop this context off the stack, and return to the original atom.
More specifically, each atom in the molecule is completely described by a single branched_atom and the bond preceding it (from the rest_of_chain that produced the branched_atom). Within each successive pair of bond and branched_atom, track the sum of the incoming rest_of_chainbond, the internal ringbond and branch bonds, and outgoing rest_of_chain bond (from the succeeding rest_of_chain) on the stack. That is, each time a new bond is observed from the atom, pop off the old valence shell count and push on the updated count. Require that the total be less than a bound set by the atom; any remaining bonds are filled by implicity hydrogen atoms. Provide the number of available bonds as input to the decoder RNN, and mask additional ringbonds and branches once the number of remaining available bonds reaches one (if there are still open ringbonds) or zero (if all ringbonds are closed). Mask the outgoing bond, or require that rest_of_chain produce , based upon the number of remaining available bonds.
d.2 Redundancy in graphbased and SMILES representations of molecules
To avoid the degeneracy of SMILES strings, for which there are many encodings of each molecule, some authors have advocated the use of graphbased representations Li et al. (2018); Liu et al. (2018); Ma et al. (2018); Simonovsky & Komodakis (2018). While graphbased processing may produce a unique representation in the encoder, it is not possible to avoid degeneracy in the decoder. Parse trees Dai et al. (2018); Kusner et al. (2017), junction trees Jin et al. (2018), lists of nodes and edges Li et al. (2018); Liu et al. (2018); Samanta et al. (2018), and vectors/matrices of node/edge labels De Cao & Kipf (2018); Ma et al. (2018); Simonovsky & Komodakis (2018) all imply an ordering amongst the nodes and edges, with many orderings describing the same graph. Canonical orderings can be defined, but unless they are obvious to the decoder, they make generative modeling harder rather than easier, since the decoder must learn the canonical ordering rules. Graph matching procedures can ensure that probability within a generative model is assigned to the correct molecule, regardless of the order produced by the decoder Simonovsky & Komodakis (2018). However, they do not eliminate the degeneracy in the decoder’s output, and the generative loss function remains highly multimodal.