Scaffoldbased molecular design using graph generative model
Abstract
Searching new molecules in areas like drug discovery often starts from the core structures of candidate molecules to optimize the properties of interest. The way as such has called for a strategy of designing molecules retaining a particular scaffold as a substructure. On this account, our present work proposes a scaffoldbased molecular generative model. The model generates molecular graphs by extending the graph of a scaffold through sequential additions of vertices and edges. In contrast to previous related models, our model guarantees the generated molecules to retain the given scaffold with certainty. Our evaluation of the model using unseen scaffolds showed the validity, uniqueness, and novelty of generated molecules as high as the case using seen scaffolds. This confirms that the model can generalize the learned chemical rules of adding atoms and bonds rather than simply memorizing the mapping from scaffolds to molecules during learning. Furthermore, despite the restraint of fixing core structures, our model could simultaneously control multiple molecular properties when generating new molecules.
KAIST] Department of Chemistry, KAIST, Daejeon, South Korea \altaffiliationContributed equally to this work KAIST] Department of Chemistry, KAIST, Daejeon, South Korea \altaffiliationContributed equally to this work KAIST] Department of Chemistry, KAIST, Daejeon, South Korea KAIST] Department of Chemistry, KAIST, Daejeon, South Korea KAIST] Department of Chemistry, KAIST, Daejeon, South Korea \alsoaffiliation[Second University] KI for Artificial Intelligence, KAIST, Daejeon, South Korea
1 Introduction
The ultimate goal of drug discovery is to find novel compounds with desirable pharmacological properties. Currently, it is an extremely challenging task, requiring years of development and numerous times of trials and failures.^{1} This is mainly because of the huge size and complexity of chemical space. For instance, the number of potential drug candidates is estimated about to , whereas only molecules have ever been synthesized.^{2, 3} Furthermore, the discrete nature of molecules makes searching in chemical space even harder.
Molecular generative models are attracting great attention as a promising in silico molecular design tool for assisting drug discovery.^{4, 5} In previous works on molecular design, deep learning techniques with SMILES representation^{6} of molecules were shown to be effective.^{7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18} Various deep generative models such as variational autoencoder^{7, 8, 9}, language models^{10, 11, 12}, generative adversarial network,^{15} and adversarial autoencoder^{18} have been utilized to develop SMILESbased molecular generative models. The models have demonstrated their utility for generating new molecules and controlling their molecular properties.
Despite the success of the SMILESbased molecular generative models, the SMILES has fundamental limitations in fully capturing molecular structures.^{19} Molecules having a high molecular similarity may have completely different SMILES representations. In this case, SMILESbased molecular generative models may fail to construct a smooth latent space. In addition, to generate valid SMILES strings, the models have to learn the grammar of SMILES, which increases the difficulty of the learning process and makes SMILES less preferable.
In contrast to the SMILES, graphs can naturally express molecular similarity and validity. However, graphbased molecular generative models have been relatively less studied than SMILESbased models because generating meaningful graphs imposes more difficulties than generating sequences does. Recent improvement of graph generative models on such problem opens a new possibility of molecule generation with graph representations of molecules.
Li et al. proposed a graph generative model that predicts a sequence of graph building actions.^{20} Starting from an empty graph, the model adds new nodes and edges in a successive manner. You et al. also developed a sequential graph generative model based on a deep autoregressive model.^{21} Another approach of generating graphs is to utilize generative adversarial nets.^{22} Wang et al. developed GraphGAN, which is composed of a generator predicting node pairs that are most likely to be connected, and a discriminator classifying real connected pairs from fake ones.^{23}
Focusing on molecular design, Simonovsky et al. proposed GraphVAE, specialized for generating graphs of small molecules.^{24} GraphVAE directly generates an adjacency matrix, a node set, and an edge set of a fullyconnected graph, and then recovers the original graph through a graph matching algorithm based on graph similarity. Jin et al. proposed a twostage molecular graph generative model named JTVAE.^{19} In JTVAE, a graph tree structure composed of chemical substructures is constructed, and then the full graph of a molecule is produced by connecting the chemical substructures. Li et al. used a conditional graph generative model to design a dual inhibitor against cJun Nterminal kinase 3 and glycogen synthase kinase3 beta, which are potential targets of Alzheimer’s disease.^{25}
In realworld molecular design, a common strategy is first identifying initial candidates and then modifying their side chains while maintaining their scaffolds. This strategy is particularly effective in designing protein inhibitors because the structural arrangement between a scaffold and protein residues is a main source of proteinligand binding. Despite its importance, such scaffoldbased molecular design has drawn surprisingly less attention in developing molecular generative models.
One possible way of retaining a scaffold of generated molecules is searching the latent space constrained to the neighborhood of the scaffold.^{9, 19} Alternatively, conditional generation of molecules can be used for controlling the scaffold of generated molecules by embedding the scaffold information in the latent space.^{25, 20} However, those methods cannot assure with certainty that the generated molecules include a given scaffold as a substructure. Also, the performance of the methods may vary depending on the types of scaffolds, or the methods may not be capable of incorporating scaffolds which were not in training set. Furthermore, to our best knowledge, no work has shown to control the scaffold and the molecular properties of generated molecules at the same time.
In this regard, we developed a graphbased molecular generative model which can design new molecules having a given scaffold as a substructure. Our primary contribution is proposing a scheme of generating molecular graphs from the graph of a given scaffold. Our model extends a scaffold graph by sequentially adding new nodes and edges. As a result, the model guarantees the generated molecules to include a given scaffold as a substructure. Our model can generate molecules from arbitrary scaffolds, whether they were used for training the model or not. This shows that our model can actually learn to build new molecules rather than memorize the patterns between the molecules and their scaffolds in the training dataset.
We tested whether our model can generate molecules with desirable properties under the constraint of fixing a scaffold. Conditional molecule generation has been already reported in other molecular generative models.^{8, 9, 19, 25} However, property control becomes more challenging in scaffoldbased molecule generation because fixing a scaffold confines chemical space, decreasing the possibility of finding desirable molecules. Nevertheless, our model can efficiently control molecular properties with high validity and success rate of molecule generation. Moreover, the model can control multiple properties simultaneously with comparable performance to the singleproperty result.
2 Method
Notation  Description 

An arbitrary or wholemolecule graph, depending on the context  
A molecular scaffold graph  
The node set of a graph  
The edge set of a graph  
A node feature vector  
An edge feature vector  
A readout vector summarizing  
A lantent vector to be decoded  
The vector of molecular properties of a wholemolecule  
The vector of molecular properties of a scaffold 
Overall process and model architecture. Our purpose is to generate molecules with target properties while retaining a given scaffold as a substructure. To this end, we set our generative model to be such that accepts a graph representation of a molecular scaffold and generates a graph that is a supergraph of . The underlying distribution of can be expressed as . Our notation here intends to manifest the particular relation, i.e., the supergraph–subgraph relation, between and . We also emphasize that is a distribution of alone; acts as a parametric argument, explicitly confining the domain of the distribution. Molecular properties are introduced as a condition, by which the model can define conditional distributions , where and are the vectors containing the property values of a molecule and its scaffold, respectively. Often in other works of molecule generation ^{25, 20}, a substructure moiety is imposed as a condition, hence defining a conditional distribution . In such case, the distribution can have nonzero probabilities on graphs that are not supergraphs of . On the other hand, the molecules that our model generates according to always include as a substructure. Before we proceed further, we refer the reader to Table 1 for the notations we will use in what follows. Also, when clear distinction is necessary, we will call a molecule a “wholemolecule” to distinguish it from a scaffold.
The learning object of our model is a strategy of extending a graph to larger graphs whose distribution follows that of real molecules. We achieve this by training our model to recover the molecules in a dataset from their scaffolds. The scaffold of a molecule can be defined in a deterministic way such as that by Bemis and Murcko^{26}, which is what we used in our experiments. The construction of a target graph is done by making successive decisions of node and edge additions. The decision at each construction step is drawn from the node features and edge features of the graph at the step. The node features and edge features are recurrently updated to reflect the construction history of the previous steps. The construction process will be further detailed below.
We realized our model as a variational autoencoder (VAE),^{27} with the architecture depicted in Figure 1. The architecture consists of an encoder and a decoder , parametrized by and , respectively. The encoder encodes a graph to an encoding vector , and the decoder decodes to recover . The decoder requires a scaffold graph as an additional input, and the actual decoding process runs by sequentially adding nodes and edges to . The encoding vector plays its role by consistently affecting its information in updating the node and edge features of a transient graph being processed. Similarly to , our notation indicates that candidate generations of the decoder are always a supergraph of . As for the encoder notation , we emphasize that the encoder also has a dependence on the scaffold because of the joint optimization of and .
Graph encoding. The goal of graph encoding is to generate a latent vector of the entire graph of a wholemolecule. Given the graph of any wholemolecule, we first associate each node with a node feature vector and each edge with an edge feature vector . For the initial node and edge features, we choose the atom types and bond types of the molecule. We then embed the initial feature vectors in new vectors with a higher dimension so that the vectors have enough capacity to express deep information in and between the nodes and edges. To fully encode the structural information of the molecule, we want every node embedding vector to contain not only the sole information of its own node but also the relation of to its neighborhood. This can be done by propagating each node’s information to the other nodes in the graph. A large variety of related methods have been devised, each being a particular realization of a graph neural network.
In this work, we implemented the encoder as a variant of the interaction network^{28, 29}. Our network’s algorithm consists of a propagation phase and a readout phase, which we write as
(1)  
(2) 
The propagation phase itself consists of two stages. The first stage calculates an aggregated message between each node and its neighbors as
(3) 
with a message function . The second stage updates the node vectors using the aggregated messages as
(4) 
with an update function . Updating every node feature vector in results in an updated set , as written in Eq. 1. We iterate the propagation phase a fixed number of times whenever applied, using different sets of parameters at different iteration steps. After the propagation, the readout phase (Eq. 2) computes a weighted sum of the node feature vectors, generating one vector representation that summarizes the graph as a whole. Then finally, a latent vector is sampled from a normal distribution whose mean and variance are inferred from .
The graph propagation can be conditioned by incorporating an additional vector in calculating aggregated messages. In such case, the functions and (accordingly) accept as an additional argument (i.e., they become and ). When encoding input graphs, we choose to be the concatenation of the property vectors and to enable propertycontrolled generation. During graph decoding, we use the concatenation of , , and the latent vector as the condition vector (see below).
Graph decoding. The goal of graph decoding is to reconstruct the graph of a wholemolecule from the latent vector sampled in the graph encoding phase. Our graph decoding process is motivated by the sequential generation strategy of Li \latinet al.^{20} In our work, we build the wholemolecule graph from the scaffold graph (extracted from by the BemisMurcko method^{26}) by successively adding nodes and edges. Here, denotes the initial scaffold graph, and we will write to denote any transient (or completed) graph constructed from .
Our graph decoding starts with preparing and propagating the initial node features of . As we do for , we prepare the initial feature vectors of by embedding the atom types and bond types of the scaffold molecule. This initial embedding is done by the same network ( in Figure 1) used for wholemolecules. The initial feature vectors of are then propagated a fixed number of times by another interaction network. As the propagation finishes, the decoder extends by processing it through a loop of node additions and accompanying (inner) loops of edge additions. A concrete description of the process is as follows:

node addition. Choose an atom type or terminate the building process with estimated probabilities. If an atom type is chosen, add a new node, say , with the chosen type to the current transient graph and proceed to Stage 2. Otherwise, terminate the building process and return the graph.

edge addition. Given the new node, choose a bond type or return to Stage 1 with estimated probabilities. If a bond type is chosen, proceed to Stage 3.

node selection. Select a node, say , from the existing nodes except with estimated probabilities. Then, add a new edge to with the bond type chosen in Stage 2. Continue the edge addition from Stage 2.
The flow of the whole process is depicted in the right side of Figure 1. Excluded from Stages 1–3 is the final stage of selecting a proper isomer, about which we describe separately below.
In every stage, the model draws an action by estimating a probability vector on candidate actions. Depending on whether the current stage should add an atom or not (Stage 1), add an edge or not (Stage 2), or select an atom to connect (Stage 3), the probability vector is computed by the corresponding one among the following:
(5)  
(6)  
(7) 
The first probability vector is a length vector, where its elements to correspond to the probabilities on atom types, and is the termination probability. As for , a vector of size , its elements to correspond to the probabilities on bond types, and is the probability of stopping edge addition. Lastly, the th element of the third vector is the probability of connecting the th existing node with the lastly added one.
When the model decides to add a new node, say , a corresponding feature vector should be added to . To that end, the model prepares an initial feature vector by representing the atom type of and then incorporates it with the existing node features in to compute a proper . Similarly, when a new edge, say , is added, the model computes from and to update to , where represents the bond type of . The corresponding modules for initializing new nodes and edges are as follows:
(8)  
(9) 
The graph building modules , , and include a preceding step of propagating node features. For instance, the actual operation done in is
(10) 
where denotes the function composition. According to the righthand side, the module updates node feature vectors through times of graph propagation, then computes a readout vector, then concatenates it with , and finally outputs through a multilayer perceptron . Likewise, both and start with iterated applications of . In this way, node features are recurrently updated every time the transient graph evolves, and the prediction of every building event becomes dependent on the history of the preceding events.
As shown in Eq. 10, the graph propagation in (and and ) incorporates the latent vector , which encodes a wholemolecule graph . This makes our model refer to while making graph building decisions and ultimately reconstruct by decoding . If the model is to be conditioned on wholemolecule properties and scaffold properties , one can understand Eqs. 5–7 and 10 as incorporating instead of .
Molecule generation. When generating new molecules, one needs a scaffold as an input, and a latent vector is sampled from the standard normal distribution. Then the decoder generates a new molecular graph as a supergraph of . If one desires to generate molecules with designated molecular properties, the corresponding property vectors and should be provided to condition the building process.
Isomer selection. Molecules can have stereoisomers, which have the same connectivity between atoms but different threedimensional geometries. Consequently, the complete generation of a molecule should also specify the molecule’s stereoisomerism. We determine the stereochemical configuration of atoms and bonds after a molecular graph is constructed from .^{19} The isomer selection module prepares the graphs of all possible stereoisomers, enumerated by the RDKit software,^{30} whose twodimensional structures without stereochemical labels are the same as that of . All the prepared include the stereochemical configuration of atoms and bonds in the node and edge features. Then the module estimates the selection probabilities as
(11) 
where the elements of the vector are the estimated probabilities of selecting respective .
Objective function. Our objective function has a form of the loglikelihood of an ordinary VAE:
(12) 
where is the KullbackLeibler divergence, and is the standard normal prior. In actual learning, we have a scaffold dataset , and for each scaffold we have a corresponding wholemolecule dataset . Note that any set of molecules can produce a scaffold set and a collection of wholemolecule sets: once the scaffolds of all molecules in a pregiven set are defined, producing , the molecules of the set can be grouped into the collection . Using those datasets, our objective is to find the optimal values of the parameters and that maximize the righthand side of Eq. 12, hence maximizing .
3 Results and discussion
3.1 Datasets and experiments
We obtained our dataset from the molecular library (version March 2018) provided by InterBioScreen Ltd. The raw dataset contained the SMILES strings of organic compounds composed of H, C, N, O, F, P, S, Cl, and Br atoms. We filtered out the strings containing disconnected ions or fragments and those that cannot be read by RDKit. Our preprocess resulted in training molecules and test molecules. The number of heavy atoms was 27 on average with a maximum of 132, and the average molecular weight was . The number of scaffold kinds was in the training set and in the test set.
Our experiments include the training and evaluation of our scaffoldbased graph generative model using the stated dataset. For the conditional molecule generation, we used molecular weight (MW), topological polar surface area (TPSA), and octanol–water partition coefficient (LogP). We used one, two, or all of the three properties to singly or jointly condition the model. We set the learning rate to and trained all instances of the model up to 20 epochs. The other hyperparameters such as the layer dimensions are stated in Appendix. We used RDKit to calculate the properties of molecules. In what follows, we will omit the units of MW () and TPSA () for simplicity.
3.2 Validity, uniqueness, and novelty analysis
The validity, uniqueness, and novelty of generated molecules are basic evaluation metrics of molecular generative models. For the exact meanings of the three metrics, we conform to the following definitions:
where we define a graph to be valid if it satisfies basic chemical requirements such as valency. In practice, we use RDKit to determine the validity of generated graphs. It is particularly important for our model to check the metrics above because generating molecules from a scaffold restricts the space of candidate products. We evaluated the models that are singly conditioned on MW, TPSA, or LogP by randomly selecting 100 scaffolds from the dataset and generating 100 molecules from each. The target values (100 for each property) were randomly sampled from each property’s distribution over the dataset. For MW, generating molecules whose MW is smaller than the MW of its scaffold is unnatural, so we excluded those cases from our evaluation.
Table 2 summarizes the validity, uniqueness, and novelty of the molecules generated by our models and the results of other molecular generative models for comparison. Note that the comparison here is only approximate because the models were trained by different datasets. Despite the strict restriction imposed by scaffolds, our models show high validity, uniqueness, and novelty, comparable to those of the other molecular generative models. The high uniqueness and novelty are particularly meaningful considering the fact that most of the scaffolds in our training set have only a few wholemolecules. For instance, among the scaffolds in the training set, scaffolds have less than ten wholemolecules. Therefore, it is unlikely that our model achieved such a high performance by simply memorizing the training set, and we can conclude that our model learns the chemical rules general in extending arbitrary scaffolds.
Model  Validity  Uniqueness  Novelty 

(%)  (%)  (%)  
Ours (MW)  
Ours (TPSA)  
Ours (LogP)  
GraphVAE^{31}  
MolGAN^{31}  
JTVAE^{19}  –  –  
MolMP^{25}  95.2–97.0  –  91.2–95.1 
SMILES VAE^{25}  –  
SMILES RNN^{25}  – 
3.3 Singleproperty control
For the next analysis, we tested whether our scaffoldbased graph generative model can generate molecules having a specific scaffold and desirable properties simultaneously. Although several molecular generative models have been developed for controlling molecular properties of generated molecules, it would be more challenging to control molecular properties under the constraint imposed by a given scaffold. We set the target values as 80, 100, and 120 for MW, 300, 350, and 400 for TPSA, and 5, 6, and 7 for LogP. For all the nine cases, we used the same 100 scaffolds used for the result in Sec. 3.2 and generated 100 molecules for each scaffold.
Figure 2 shows the property distributions of generated molecules. We see that the property distributions are well centered around the target values. This shows that despite the narrowed search space, our model successfully generated new molecules with desirable properties. To see how our model extends a given scaffold according to designated property values, we drew some of the generated molecules in Figure 3. For the target conditions MW = 400, TPSA = 120, and LogP = 7, we sampled nine random examples using three different scaffolds. The molecules in each row were generated from the same scaffold. We see that the model generates new molecules with designated properties by adding proper side chains: for instance, the model added hydrophobic groups to the scaffolds to generate highLogP molecules, while it added polar functional groups to generate highTPSA molecules.
3.4 Scaffold dependence
Our molecular design process starts from a given scaffold with sequentially adding nodes and edges. So the performance of our model can be affected by the kind of scaffolds. Accordingly, we tested whether our model retains its performance of generating desirable molecules when new scaffolds are given. Specifically, we prepared a set of 100 new scaffolds (henceforth “unseen” scaffolds) that were not included in the training set and an additional set of 100 scaffolds (henceforth “seen” scaffolds) from the training set. We then generated 100 molecules for each scaffold with randomly designated property values. The process is repeated for MW, TPSA, and LogP.
Table 3 summarizes the validity, uniqueness, and MAD of molecules generated from the seen and unseen scaffolds. Here, MAD denotes the mean absolute difference between designated property values and the property values of generated molecules. The result shows no significant difference of the three metrics between the two sets of scaffolds. This shows that our model achieves generalization over arbitrary scaffolds in generating valid molecules with controlled properties.
Property  Validity  Uniqueness  MAD 

(%)  (%)  
MW (seen scaffolds)  
MW (unseen scaffolds)  
TPSA (seen scaffolds)  
TPSA (unseen scaffolds)  
LogP (seen scaffolds)  
LogP (unseen scaffolds) 
3.5 Multiproperty control
Designing new molecules seldom requires only one specific molecular property to be controlled. Among others, drug design particularly involves simultaneous control of a multitude of molecular properties. In this regard, we first tested our model’s ability of simultaneously controlling two of MW, TPSA, and LogP. We trained three instances of the model, each being jointly conditioned on MW and TPSA, MW and LogP, and LogP and TPSA. We then specified each property with two target values (350 and 450 for MW, 50 and 100 for TPSA, and 2 and 5 for LogP) and combined them to prepare four generation conditions for each pair. Under every generation condition, we used the randomly sampled 100 scaffolds that we used for the results of Secs. 3.2 and 3.3 and generated 100 molecules from each scaffold. We excluded those generations whose target MW is smaller than the used scaffold MW.
Figure 4 shows the result of the generations conditioned on MW and TPSA, MW and LogP, and LogP and TPSA. Plotted are the joint distributions of the property values over the generated molecules. Gaussian kernels were used for the kernel density estimation. We see that the modes of the distributions are well located near the point of the target values. As an exception, the distribution by the target (LogP, TPSA) = (2, 50) shows a relatively long tail over larger LogP and TPSA values. This is because LogP and TPSA have by definition a negative correlation between each other and thus requiring a small value for both can make the generation task unphysical. Intrinsic correlation in molecular properties can even cause seemingly feasible targets to result in dispersed property distributions. An example of such can be the result of another target (LogP, TPSA) = (5, 50), but we note that in the very case the outliers (in LogP > 5.5 and TPSA > 65 for example) amount to only a minor portion of the total generations, as the contours show.
We further tested the conditioned generation by incorporating all the three properties. We used the same target values of MW, TPSA, and LogP as above, resulting in total eight conditions of generation. The rest of the settings, including the scaffold set and the number of generations, were retained. The result is shown in Figure 5, where we plotted the MW, TPSA, and LogP values of the generated molecules. The plot shows that the distributions from different target conditions are well separated from one another. As with the doubleproperty result, all the distributions are well centered around their target values.
We also compared the generation performance of our model for single and multiproperty controls in a quantitative way. Table 4 shows the performance statistics of single, double, and tripleproperty controls in terms of MAD, validity, and novelty. Using the same 100 scaffolds, we generated 100 molecules from each, each time under a randomly designated target condition. As the number of incorporated properties increases from one to two and to three, the overall magnitudes of the descriptors are well preserved. Regarding the slight increases in the MAD values, we attribute them to the additional confinement of chemical space forced by intrinsic correlations between multiple properties. Nevertheless, the magnitudes of the worsening are small compared to the mean values of the properties (389 for MW, 77 for TPSA, and 3.6 for LogP).
Properties  MW MAD  TPSA MAD  LogP MAD  Validity  Novelty 
(%)  (%)  
MW^{a}  –  –  
TPSA^{a}  –  –  
LogP^{a}  –  –  
MW & TPSA  –  
MW & LogP  –  
TPSA & LogP  –  
MW & TPSA & LogP  
^{a}From the result in Sec. 3.2 
4 Conclusion
In this work, we proposed a scaffoldbased molecular graph generative model. The model generates new molecules that retain a desired substructure, i.e., scaffold, by sequentially adding new atoms and bonds to the graph of the scaffold. In contrast to other related methods such as conditional generation, the strategy guarantees that the generated molecules naturally have the scaffold as a substructure.
We evaluated our model by examining the validity, uniqueness, and novelty of generated molecules. Despite the constraint on the search space imposed by scaffolds, the model showed comparable results with regard to previous SMILESbased and graphbased molecular generative models. Our model consistently worked well in terms of the three metrics when new scaffolds which were not in the training set were given. This means that the model achieved good generalization rather than memorizing the pairings between the scaffolds and molecules in the training set. In addition, while retaining the given scaffolds, our model successfully generated new molecules with desirable degrees of molecular properties such as molecular weights, topological polar surface areas, and octanolwater partition coefficients. The propertycontrolled generation could also incorporate multiple molecular properties simultaneously. We believe that our scaffoldbased molecular graph generative model provides a practical way of optimizing the functionality of molecules with fixed core structures.
Appendix. Implementation details
We here describe our implementation of the model. The full process of encoding and decoding is shown in Algorithm 1.
Graph representation of molecules. In our graph representation of a molecule, the nodes represent the atoms, and the edges represent the bonds. We regard each node as attributed with an atom type and each edge as attributed with a bond type.
In the present work, we used the atom types in an indexed family (C, N, O, F, P, S, Cl, Br) and the bond types in another indexed family (singlebond, doubledone, triplebond). The symbols in indicate the corresponding elements in the periodic table. For the initial representation of wholemolecules and scaffolds, we use an extended family , which includes all the elements of and additionally chirality (R, S, or none), formal charge, and aromaticity; also, we use an extended family , which includes the three bond types and stereoisomerism (E, Z, cis, trans, or none). We used RDKit^{30} to preprocess molecules into graphs.
To prepare node feature vectors and edge feature vectors , we embed node and edge types via two networks:
(13)  
(14) 
is a raw feature vector representing the type of based on , and similarly is a raw feature vector of based on . For each of and , we used a single linear layer with output dimension 128. The result of embedding all elements of a graph becomes
(15) 
We use the same module to embed all wholemolecules, scaffolds, and stereoisomers.
Graph propagation and readout. The graph propagation module
(16) 
consists of the following processes:
(17)  
(18)  
(19) 
where is the function composition, is the rectified linear unit^{32}, is a condition vector, and is a gated recurrent unit cell^{33} (accepting as the input and as the hidden state). For we used one linear layer with output dimension 128. We had and use a different set of parameters in different rounds of iterated propagation.
The readout module summarizes node features via the gated pooling:
(20) 
where is the sigmoid function, and is the elementwise product. For each of and , we used a single linear layer.
We had and in different modules have different sets of parameters. For instance, all involved in , , , and have different and . In addition, we used two different output dimensions for : when readingout the node features of any transient graph in a building process, we set the dimension equal to that of a node feature vector (i.e., 128); when encoding a wholemolecule graph, we set double.
Encoding. With a wholemolecule graph , we sample a latent vector by applying the reparametrization trick:^{27}
(21)  
(22)  
(23)  
(24) 
where is the standard normal distribution. In the last line we omitted the graph dependence of for simplicity. For each of and , we used one linear layer with output dimension 128. Note that in Algorithm 1 we used to concisely express the sampling process.
Decoder modules. The node addition module computes atom type probabilities as
(25) 
where is the transient graph in a building process. For , we used three linear layers with activations. The output dimensions of the layers were all 128. The length of the vector is . The computed probabilities define a categorical distribution (), from which we sample an index such that
(26) 
If , the model adds a new node with the th chemical element , or else the building process terminates.
If a new node is to be added to a transient graph , the node initialization module computes a corresponding feature vector as follows:
(27) 
In the last line, is a raw feature representing the new node’s type based on (note the absence of an asterisk, unlike the one in Eq. 13). We used one linear layer for each of and .
The edge addition module computes in the same way as Eq. 25 but with an MLP of different weights. If the sampled index is less than or equal to , the model adds a new edge with the th bond type . If , the model stops the edge addition.
To describe the node selection, let us suppose a new node was added to a transient graph so that and . If a new edge is determined to be added, the module first updates the node features as
(28) 
and computes the selection probability for each existing node through the following steps:
(29)  
(30) 
Then from the resulting categorical distribution , the model samples a node and connects it with (i.e., add the resulting edge to ).
The edge initialization module computes edge feature vectors in the same way as Eq. 27. The differences are that different MLPs are used and that is replaced by a raw representation of the chosen bond type.
A complete specification of a molecular graph should assign the extended types in and to its elements. Motivated by the strategy of Jin et al.,^{19} our model assigns only the basic types in and during graph building and specifies stereoisomerism at the final stage of generation. Given a graph , the isomer selection module prepares the set of all possible stereoisomers of enumerated by RDKit. The graphs in the resulting set consist of nodes and edges that are fully typed according to and . For each isomeric graph , estimates the selection probability through
(31)  
(32)  
(33)  
(34) 
Note that multiple can be valid for one . For instance, there can be some whose stereocenters are only partially labelled by its data source, and in such case isomers with different labels on the same unlabelled stereocenters can all be regarded as valid. When generating a new molecule, however, we want our model to predict one isomer without ambiguity. Therefore, in the generation phase, the model normalizes the probabilities by and then chooses one plausible isomer from the resulting categorical distribution.
Learning. We used the molecule dataset described in Sec. 3.1 for learning. The dataset consists of a set of scaffold molecules, , and a collection of each scaffold’s wholemolecules, , where is a set of wholemolecules of scaffold . We can arrange and into an indexed family , each of whose element is a pair of a scaffold and one of its wholemolecules. There can be duplicates of scaffolds or wholemolecules over different indices, but each pair is unique.
The individual loss due to each pair is a weighted sum of three losses: the graph building loss , the isomer selection loss , and the posterior approximation loss . The second of the three is set to be
(35) 
where is the true probability of selecting . The third of the three reads^{27}
(36) 
where and are the th elements of and , respectively (Eqs. 21 and 22).
To describe the graph building loss, let us express the stepwise transitions from to by a finite sequence , where and . The transition from to conforms to the true probability vector , which has one unity value for the correct building action and zeros for the others. During learning, the model reconstructs each from by estimating a sequence . Then the individual graph building loss can be defined by
(37) 
We minimized to optimize our model and maximize the loglikelihood in Eq. 12. We used 0.1 for the weight . As for , the number of iterations of , we set for the initial propagation of wholemolecule graphs and scaffold graphs and for , , and .
Finally, we remark the effect of node and edge orderings. Sequential generation of graphs requires their elements to be ordered. Different orderings amount to different sequences of graph transitions for the same pair. Similarly to Li et al.,^{20} we trained two models using a fixed ordering for one and using random orderings for the other. We evaluated the two models in terms of the descriptors used in Sec. 3 and confirmed that different orderings cause no significance change in performance. Therefore, we used a fixed ordering (assigned by RDKit when reading SMILES data) for all the results in Sec. 3.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT)(NRF2017R1E1A1A01078109). We also thank Prof. MuHyun Baik of the Center for Catalytic Hydrocarbon Functionalizations, Institute for Basic Science (IBS) and Department of Chemistry at KAIST, for providing the computing resources.
References
 Hoelder \latinet al. 2012 Hoelder, S.; Clarke, P. A.; Workman, P. Discovery of small molecule cancer drugs: Successes, challenges and opportunities. Molecular Oncology 2012, 6, 155–176.
 Polishchuk \latinet al. 2013 Polishchuk, P. G.; Madzhidov, T. I.; Varnek, A. Estimation of the size of druglike chemical space based on GDB17 data. Journal of ComputerAided Molecular Design 2013, 27, 675–679.
 Kim \latinet al. 2016 Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B. A.; Wang, J.; Yu, B.; Zhang, J.; Bryant, S. H. PubChem Substance and Compound databases. Nucleic Acids Research 2016, 44, D1202–D1213.
 Chen \latinet al. 2018 Chen, H.; Engkvist, O.; Wang, Y.; Olivecrona, M.; Blaschke, T. The rise of deep learning in drug discovery. Drug Discovery Today 2018, 23, 1241–1250.
 SanchezLengeling and AspuruGuzik 2018 SanchezLengeling, B.; AspuruGuzik, A. Inverse molecular design using machine learning: Generative models for matter engineering. Science 2018, 361, 360–365.
 Weininger 1988 Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Modeling 1988, 28, 31–36.
 GÃ³mezBombarelli \latinet al. 2018 GÃ³mezBombarelli, R.; Wei, J. N.; Duvenaud, D.; HernÃ¡ndezLobato, J. M.; SÃ¡nchezLengeling, B.; Sheberla, D.; AguileraIparraguirre, J.; Hirzel, T. D.; Adams, R. P.; AspuruGuzik, A. Automatic Chemical Design Using a DataDriven Continuous Representation of Molecules. ACS Central Science 2018, 4, 268–276.
 Kang and Cho 2019 Kang, S.; Cho, K. Conditional Molecular Design with Deep Generative Models. Journal of Chemical Information and Modeling 2019, 59, 43–52.
 Lim \latinet al. 2018 Lim, J.; Ryu, S.; Kim, J. W.; Kim, W. Y. Molecular generative model based on conditional variational autoencoder for de novo molecular design. Journal of Cheminformatics 2018, 10, 31.
 Segler \latinet al. 2018 Segler, M. H. S.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks. ACS Central Science 2018, 4, 120–131.
 Gupta \latinet al. 2018 Gupta, A.; Müller, A. T.; Huisman, B. J. H.; Fuchs, J. A.; Schneider, P.; Schneider, G. Generative Recurrent Networks for De Novo Drug Design. Molecular Informatics 2018, 37, 1700111.
 Bjerrum and Sattarov 2018 Bjerrum, E. J.; Sattarov, B. Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders. Biomolecules 2018, 8.
 Popova \latinet al. 2018 Popova, M.; Isayev, O.; Tropsha, A. Deep reinforcement learning for de novo drug design. Science Advances 2018, 4.
 Olivecrona \latinet al. 2017 Olivecrona, M.; Blaschke, T.; Engkvist, O.; Chen, H. Molecular denovo design through deep reinforcement learning. Journal of Cheminformatics 2017, 9, 48.
 Lima Guimaraes \latinet al. 2017 Lima Guimaraes, G.; SanchezLengeling, B.; Outeiral, C.; Cunha Farias, P. L.; AspuruGuzik, A. ObjectiveReinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. arXiv eprints 2017, arXiv:1705.10843.
 Jaques \latinet al. 2017 Jaques, N.; Gu, S.; Bahdanau, D.; HernándezLobato, J. M.; Turner, R. E.; Eck, D. Sequence Tutor: Conservative FineTuning of Sequence Generation Models with KLcontrol. Proceedings of the 34th International Conference on Machine Learning. International Convention Centre, Sydney, Australia, 2017; pp 1645–1654.
 Neil \latinet al. 2018 Neil, D.; Segler, M.; Guasch, L.; Ahmed, M.; Plumbley, D.; Sellwood, M.; Brown, N. Exploring Deep Recurrent Models with Reinforcement Learning for Molecule Design. 6th International Conference on Learning Representations. 2018.
 Polykovskiy \latinet al. 2018 Polykovskiy, D.; Zhebrak, A.; Vetrov, D.; Ivanenkov, Y.; Aladinskiy, V.; Mamoshina, P.; Bozdaganyan, M.; Aliper, A.; Zhavoronkov, A.; Kadurin, A. Entangled Conditional Adversarial Autoencoder for de Novo Drug Discovery. Molecular Pharmaceutics 2018, 15, 4398–4405.
 Jin \latinet al. 2018 Jin, W.; Barzilay, R.; Jaakkola, T. Junction Tree Variational Autoencoder for Molecular Graph Generation. Proceedings of the 35th International Conference on Machine Learning. StockholmsmÃ¤ssan, Stockholm Sweden, 2018; pp 2323–2332.
 Li \latinet al. 2018 Li, Y.; Vinyals, O.; Dyer, C.; Pascanu, R.; Battaglia, P. Learning Deep Generative Models of Graphs. 6th International Conference on Learning Representations. 2018.
 You \latinet al. 2018 You, J.; Liu, B.; Ying, Z.; Pande, V.; Leskovec, J. In Advances in Neural Information Processing Systems 31; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., Garnett, R., Eds.; Curran Associates, Inc., 2018; pp 6410–6421.
 Goodfellow \latinet al. 2014 Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; Bengio, Y. In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., Weinberger, K. Q., Eds.; Curran Associates, Inc., 2014; pp 2672–2680.
 Wang \latinet al. 2018 Wang, H.; Wang, J.; Wang, J.; Zhao, M.; Zhang, W.; Zhang, F.; Xie, X.; Guo, M. GraphGAN: Graph Representation Learning with Generative Adversarial Nets. Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence. 2018.
 Simonovsky and Komodakis 2018 Simonovsky, M.; Komodakis, N. GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders. arXiv eprints 2018, arXiv:1802.03480.
 Li \latinet al. 2018 Li, Y.; Zhang, L.; Liu, Z. Multiobjective de novo drug design with conditional graph generative model. Journal of Cheminformatics 2018, 10, 33.
 Bemis and Murcko 1996 Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frameworks. Journal of Medicinal Chemistry 1996, 39, 2887–2893, PMID: 8709122.
 Kingma and Welling 2013 Kingma, D. P.; Welling, M. AutoEncoding Variational Bayes. arXiv eprints 2013, arXiv:1312.6114.
 Battaglia \latinet al. 2016 Battaglia, P.; Pascanu, R.; Lai, M.; Jimenez Rezende, D.; koray Kavukcuoglu, In Advances in Neural Information Processing Systems 29; Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc., 2016; pp 4502–4510.
 Gilmer \latinet al. 2017 Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; Dahl, G. E. Neural Message Passing for Quantum Chemistry. arXiv eprints 2017, arXiv:1704.01212.
 30 RDKit: OpenSource Cheminformatics. rdkit.org.
 De Cao and Kipf 2018 De Cao, N.; Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv eprints 2018, arXiv:1805.11973.
 Nair and Hinton 2010 Nair, V.; Hinton, G. E. Rectified Linear Units Improve Restricted Boltzmann Machines. Proceedings of the 27th International Conference on International onference on Machine Learning. USA, 2010; pp 807–814.
 Cho \latinet al. 2014 Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. arXiv eprints 2014, arXiv:1406.1078.