Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation
Generating novel graph structures that optimize given objectives while obeying some given underlying rules is fundamental for chemistry, biology and social science research. This is especially important in the task of molecular graph generation, whose goal is to discover novel molecules with desired properties such as drug-likeness and synthetic accessibility, while obeying physical laws such as chemical valency. However, designing models to find molecules that optimize desired properties while incorporating highly complex and non-differentiable rules remains to be a challenging task. Here we propose Graph Convolutional Policy Network (GCPN), a general graph convolutional network based model for goal-directed graph generation through reinforcement learning. The model is trained to optimize domain-specific rewards and adversarial loss through policy gradient, and acts in an environment that incorporates domain-specific rules. Experimental results show that GCPN can achieve improvement on chemical property optimization over state-of-the-art baselines while resembling known molecules, and achieve improvement on the constrained property optimization task.
Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation
Jiaxuan You1††thanks: The two first authors made equal contributions. firstname.lastname@example.org Bowen Liu2††footnotemark: email@example.com Rex Ying1 firstname.lastname@example.org Vijay Pande2 email@example.com Jure Leskovec1 firstname.lastname@example.org
noticebox[b]Preprint. Work in progress.\end@float
1Department of Computer Science, 2Department of Chemistry
Stanford, CA, 94305
Many important problems in drug discovery and material science are based on the principle of designing molecular structures with specific desired properties. However, this remains a challenging task due to the large size of chemical space. For example, the range of drug-like molecules has been estimated to be between and . Additionally, chemical space is discrete, and molecular properties are highly sensitive to small changes in the molecular structure . An increase in the effectiveness of the design of new molecules with application-driven goals would significantly accelerate developments in novel medicines and materials.
Recently, there has been significant advances in applying deep learning models to molecule generation [14, 37, 7, 9, 21, 4, 30, 26, 33, 40]. However, the generation of novel and valid molecular graphs that can directly optimize various desired physical, chemical and biological property objectives remains to be a challenging task, since these property objectives are highly complex  and non-differentiable. Furthermore, the generation model should be able to actively explore the vast chemical space, as the distribution of the molecules that possess those desired properties does not necessarily match the distribution of molecules from existing datasets.
Present Work. In this work, we propose Graph Convolutional Policy Network (GCPN), an approach to generate molecules where the generation process can be guided towards specified desired objectives, while restricting the output space based on underlying chemical rules. To address the challenge of goal-directed molecule generation, we utilize and extend three ideas, namely graph representation, reinforcement learning and adversarial training, and combine them in a single unified framework. Graph representation learning is used to obtain vector representations of the state of generated graphs, adversarial loss is used as reward to incorporate prior knowledge specified by a dataset of example molecules, and the entire model is trained end-to-end in the reinforcement learning framework.
Graph representation. We represent molecules directly as molecular graphs, which are more robust than intermediate representations such as simplified molecular-input line-entry system (SMILES) , a text-based representation that is widely used in previous works [9, 21, 4, 14, 37, 26, 33]. For example, a single character perturbation in a text-based representation of a molecule can lead to significant changes to the underlying molecular structure or even invalidate it . Additionally, partially generated molecular graphs can be interpreted as substructures, whereas partially generated text representations in many cases are not meaningful. As a result, we can perform chemical checks, such as valency checks, on a partially generated molecule when it is represented as a graph, but not when it is represented as a text sequence.
Reinforcement learning. A reinforcement learning approach to goal-directed molecule generation presents several advantages compared to learning a generative model over a dataset. Firstly, desired molecular properties such as drug-likeness [1, 28] and molecule constraints such as valency are complex and non-differentiable, thus they cannot be directly incorporated into the objective function of graph generative models. In contrast, reinforcement learning is capable of directly representing hard constraints and desired properties through the design of environment dynamics and reward function. Secondly, reinforcement learning allows active exploration of the molecule space beyond samples in a dataset. Alternative deep generative model approaches [9, 21, 4, 15] show promising results on reconstructing given molecules, but their exploration ability is restricted by the training dataset.
Adversarial training. Incorporating prior knowledge specified by a dataset of example molecules is crucial for molecule generation. For example, a drug molecule is usually relatively stable in physiological conditions, non toxic, and possesses certain physiochemical properties . Although it is possible to hand code the rules or train a predictor for one of the properties, precisely representing the combination of these properties is extremely challenging. Adversarial training addresses the challenge through a learnable discriminator adversarially trained with a generator . After the training converges, the discriminator implicitly incorporates the information of a given dataset and guides the training of the generator.
GCPN is designed as a reinforcement learning agent (RL agent) that operates within a chemistry-aware graph generation environment. A molecule is successively constructed by either connecting a new substructure or an atom with an existing molecular graph or adding a bond to connect existing atoms. GCPN predicts the action of the bond addition, and is trained via policy gradient to optimize a reward composed of molecular property objectives and adversarial loss. The adversarial loss is provided by a graph convolutional network [19, 5] based discriminator trained jointly on a dataset of example molecules. Overall, this approach allows direct optimization of application-specific objectives, while ensuring that the generated molecules are realistic and satisfy chemical rules.
We evaluate GCPN in three distinct molecule generation tasks that are relevant to drug discovery and materials science: molecule property optimization, property targeting and conditional property optimization. We use the ZINC dataset  to provide GCPN with example molecules, and train the policy network to generate molecules with high property score, molecules with a pre-specified range of target property score, or molecules containing a specific substructure while having high property score. In all tasks, GCPN achieves state-of-the-art results. GCPN generates molecules with property scores higher than the best baseline method, and outperforms the baseline models in the constrained optimization setting by in average.
2 Related Work
Yang et al.  and Olivecrona et al.  proposed a recurrent neural network (RNN) SMILES string generator with molecular properties as objective that is optimized using Monte Carlo tree search and policy gradient respectively. Guimaraes et al.  and Sanchez-Lengeling et al.  further utilized an adversarial loss to the reinforcement learning reward to enforce similarity to a given molecule dataset. In contrast, instead of using a text-based molecular representation, our approach uses a graph-based molecular representation, which leads to many important benefits as discussed in the introduction. Jin et al.  proposed to use a variational autoencoder (VAE) framework, where the molecules are represented as junction trees of small clusters of atoms. This approach can only indirectly optimize molecular properties in the learned latent embedding space before decoding to a molecule, whereas our approach can directly optimize molecular properties of the molecular graphs. You et al.  used an auto-regressive model to maximize the likelihood of the graph generation process, but it cannot be used to generate attributed graphs. Li et al.  and Li et al.  described sequential graph generation models where conditioning labels can be incorporated to generate molecules whose molecular properties are close to specified target scores. However, these approaches are also unable to directly perform optimization on desired molecular properties. Overall, modeling the goal-directed graph generation task in a reinforcement learning framework is still largely unexplored.
3 Proposed Method
In this section we formulate the problem of graph generation as learning an RL agent that iteratively adds substructures and edges to the molecular graph in a chemistry-aware environment. We describe the problem definition, the environment design, and the Graph Convolutional Policy Network that predicts a distribution of actions which are used to update the graph being generated.
3.1 Problem Definition
We represent a graph as , where is the adjacency matrix, and is the node feature matrix assuming each node has features. We define to be the (discrete) edge-conditioned adjacency tensor, assuming there are possible edge types. if there exists an edge of type between nodes and , and . Our primary objective is to generate graphs that maximize a given property function , i.e., maximize , where is the generated graph, and could be one or multiple domain-specific statistics of interest.
It is also of practical importance to constrain our model with two main sources of prior knowledge. (1) Generated graphs need to satisfy a set of hard constraints. (2) We provide the model with a set of example graphs , and would like to incorporate such prior knowledge by regularizing the property optimization objective with under distance metric . In the case of molecule generation, the set of hard constraints is described by chemical valency while the distance metric is an adversarially trained discriminator.
3.2 Graph Generation as Markov Decision Process
A key task for building our model is to specify a generation procedure. We designed an iterative graph generation process and formulated it as a general decision process , where is the set of states that consists of all possible intermediate and final graphs, is the set of actions that describe the modification made to current graph at each time step, is the transition dynamics that specifies the possible outcomes of carrying out an action, . is a reward function that specifies the reward after reaching state , and is the discount factor. The procedure to generate a graph can then be described by a trajectory , where is the final generated graph. The modification of a graph at each time step can be viewed as a state transition distribution: , where is usually represented as a policy network .
Recent graph generation models add nodes and edges based on the full trajectory of the graph generation procedure [41, 24] using recurrent units, which tends to “forget” initial steps of generation quickly. In contrast, we design a graph generation procedure that can be formulated as a Markov Decision Process (MDP), which requires the state transition dynamics to satisfy the Markov property: . Under this property, the policy network only needs the intermediate graph state to derive an action (see Section 3.4). The action is used by the environment to update the intermediate graph being generated (see Section 3.3).
3.3 Molecule Generation Environment
In this section we discuss the setup of molecule generation environment. On a high level, the environment builds up a molecular graph step by step through a sequence of bond or substructure addition actions given by GCPN. Figure 1 illustrates the 5 main components that come into play in each step, namely state representation, policy network, action, state transition dynamics and reward. Note that this environment can be easily extended to graph generation tasks in other settings.
State Space. We define the state of the environment as the intermediate generated graph at time step , which is fully observable by the RL agent. Figure 1 (a)(e) depicts the partially generated molecule state before and after an action is taken. At the start of generation, we assume contains a single node that represents a carbon atom.
Action Space. In our framework, we define a distinct, fixed-dimension and homogeneous action space amenable to reinforcement learning. We design an action analogous to link prediction, which is a well studied realm in network science. We first define a set of scaffold subgraphs to be added during graph generation and the collection is defined as . Given a graph at step , we define the corresponding extended graph as . Under this definition, an action can either correspond to connecting a new subgraph to a node in or connecting existing nodes within graph . Once an action is taken, the remaining disconnected scaffold subgraphs are removed. In our implementation, we adopt the most fine-grained version where consists of all different single node graphs, where denotes the number of different atom types. Note that can be extended to contain molecule substructure scaffolds , which allows the specification of preferred substructures but sacrifice the flexibility of the model. In Figure 1(b), a link is predicted between the green and yellow atoms. We will discuss in detail the link prediction algorithm in Section 3.4.
State Transition Dynamics. Domain-specific rules are incorporated in the state transition dynamics. The environment carries out actions that obey the given rules. Infeasible actions proposed by the policy network are rejected and the state remains unchanged. For the task of molecule generation, the environment incorporates rules of chemistry. In Figure 1(d), both actions pass the valency check, and the environment updates the (partial) molecule according to the actions. Note that unlike a text-based representation, the graph-based molecular representation enables us to perform this step-wise valency check, as it can be conducted even for incomplete molecular graphs.
Reward design. Both intermediate rewards and final rewards are used to guide the behaviour of the RL agent. We define the final rewards as a sum over domain-specific rewards and adversarial rewards. The domain-specific rewards consist of the (combination of) final property scores, such as octanol-water partition coefficient (logP), druglikeness (QED)  and molecular weight (MW). Domain-specific rewards also include penalization of unrealistic molecules according to various criteria, such as excessive steric strain and the presence of functional groups that violate ZINC functional group filters . The intermediate rewards include step-wise validity rewards and adversarial rewards. A small positive reward is assigned if the action does not violate valency rules, otherwise a small negative reward is assigned. As an example, the second row of Figure 1 shows the scenario that a termination action is taken. When the environment updates according to a terminating action, both a step reward and a final reward are given, and the generation process terminates.
To ensure that the generated molecules resemble a given set of molecules, we employ the Generative Adversarial Network (GAN) framework  to define the adversarial rewards
where is the policy network, is the discriminator network, represents an input graph, is the underlying data distribution which defined either over final graphs (for final rewards) or intermediate graphs (for intermediate rewards). However, only can be trained with stochastic gradient descent, as is a graph object that is non-differentiable with respect to parameters . Instead, we use as an additional reward together with other rewards, and optimize the total rewards with policy gradient methods  (Section 3.5). The discriminator network employs the same structure of the policy network (Section 3.4) to calculate the node embeddings, which are then aggregated into a graph embedding and cast into a scalar prediction.
3.4 Graph Convolutional Policy Network
Having illustrated the graph generation environment, we outline the architecture of GCPN, a policy network learned by the RL agent to act in the environment. GCPN takes the intermediate graph and the collection of scaffold subgraphs as inputs, and outputs the action , which predicts a new link to be added, as described in Section 3.3.
Computing node embeddings. In order to perform link prediction in , our model first computes the node embeddings of an input graph using Graph Convolutional Networks (GCN) [19, 5, 17, 35, 8], a well-studied technique that achieves state-of-the-art performance in representation learning for molecules. We use the following variant that supports the incorporation of categorical edge types. The high-level idea is to perform message passing over each edge type for a total of layers. At the layer of the GCN, we aggregate all messages from different edge types to compute the next layer node embedding , where , are the sizes of and respectively, and is the embedding dimension. More concretely,
where is the slice of edge-conditioned adjacency tensor , and ; . is a trainable weight matrix for the edge type, and is the node representation learned in the layer. We use to denote an aggregation function that could be one of . This variant of the GCN layer allows for parallel implementation while remaining expressive for aggregating information among different edge types. We apply a layer GCN to the extended graph to compute the final node embedding matrix .
Action prediction. The link prediction based action at time step is a concatenation of components: selection of two nodes, prediction of edge type, and prediction of termination. Concretely, each component is sampled according to a predicted distribution governed by Equation 3.
We use to denote a Multilayer Perceptron (MLP) that maps to a vector, which represents the probability distribution of selecting each node. The information from the first selected node is incorporated in the selection of the second node by concatenating its embedding with that of each node in . The second MLP then maps the concatenated embedding to the probability distribution of each potential node to be selected as the second node. Note that when selecting two nodes to predict a link, the first node to select, , should always belong to the currently generated graph , whereas the second node to select, , can be either from (forming a cycle), or from (adding a new substructure). To predict a link, takes and as inputs and maps to a categorical edge type using an MLP. Finally, the termination probability is computed by firstly aggregating the node embeddings into a graph embedding using an aggregation function agg, and then mapping the graph embedding to a scalar using an MLP .
3.5 Policy Gradient Training
Policy gradient based methods are widely adopted for optimizing policy networks. Here we adopt Proximal Policy Optimization (PPO) , one of the state-of-the-art policy gradient methods. The objective function of PPO is defined as follows
where is the probability ratio that is clipped to the range of , making the a lower bound of the conservative policy iteration objective , is the estimated advantage function which involves a learned value function to reduce the variance of estimation. In GCPN, is an MLP that maps the graph embedding computed according to Section 3.4.
It is known that pretraining a policy network with expert policies if they are available leads to better training stability and performance . In our setting, any ground truth molecule could be viewed as an expert trajectory for pretraining GCPN. This expert imitation objective can be written as , where pairs are obtained from ground truth molecules. Specifically, given a molecule dataset, we randomly sample a molecular graph , and randomly select one connected subgraph of as the state . At state , any action that adds an atom or bond in can be taken in order to generate the sampled molecule. Hence, we randomly sample , and use the pair to supervise the expert imitation objective.
To demonstrate effectiveness of goal-directed search for molecules with desired properties, we compare our method with state-of-the-art molecule generation algorithms in the following tasks.
Property Optimization. The task is to generate novel molecules whose specified molecular properties are optimized. This can be useful in many applications such as drug discovery and materials science, where the goal is to identify molecules with highly optimized properties of interest.
Property Targeting. The task is to generate novel molecules whose specified molecular properties are as close to the target scores as possible. This is crucial in generating virtual libraries of molecules with properties that are generally suitable for a desired application. For example, a virtual molecule library for drug discovery should have high drug-likeness and synthesizability.
Constrained Property Optimization. The task is to generate novel molecules whose specified molecular properties are optimized, while also containing a specified molecular substructure. This can be useful in lead optimization problems in drug discovery and materials science, where we want to make modifications to a promising lead molecule and improve its properties .
4.1 Experimental Setup
We outline our experimental setup in this section. Further details are provided in the appendix.
Dataset. For the molecule generation experiments, we utilize the ZINC250k molecule dataset  that contains 250,000 drug like commercially available molecules whose maximum atom number is 38. We use the dataset for both expert pretraining and adversarial training.
Molecule environment. We set up the molecule environment as an OpenAI Gym environment  using RDKit  and adapt it to the ZINC250k dataset. Specifically, the maximum atom number is set to be 38. There are 9 atom types and 3 edge types, as molecules are represented in kekulized form. We design the reward such that empirically the ratio of intermediate valency reward, intermediate adversarial reward, final adversarial reward, final chemical filter reward and final chemical property reward is roughly when those rewards reach their optimal value.
GCPN Setup. We use a 3-layer defined GCPN as the policy network with dimensional node embedding in all hidden layers, and batch normalization  is applied after each layer. Another 3-layer GCN with the same architecture is used for discriminator training. We observe comparable performance among different aggregation functions and select for all experiments. We found both the expert pretraining and RL objective important for generating high quality molecules, thus both of them are kept throughout training. Specifically, we use PPO algorithm to train the RL objective with the default hyperparameters, and the learning rate is set as 0.001. The expert pretraining objective is trained with a learning rate of 0.00025. Both objectives are trained using Adam optimizer  with batch size 32.
Baselines. We compare our method with the following state-of-the-art baselines. Junction Tree VAE (JT-VAE)  is a state-of-the-art algorithm that combines graph representation and a VAE framework for generating molecular graphs, and uses Bayesian optimization over the learned latent space to search for molecules with optimized property scores. JT-VAE has been shown to outperform previous deep generative models for molecule generation, including Character-VAE , Grammar-VAE , SD-VAE  and GraphVAE . We also compare our approach with ORGAN , a state-of-the-art RL-based molecule generation algorithm using a text-based representation of molecules. We run both baselines using their released code and tune the hyper-parameters such that all the tasks can be finished within manageable time.
4.2 Molecule Generation Results
Property optimization. In this task, we focus on generating molecules with the highest possible penalized logP  and QED  scores. Penalized logP is a logP score that also accounts for ring size and synthetic accessibility , while QED is an indicator of drug-likeness. Note that penalized logP has an unbounded range, while the QED has a range of by definition, thus directly comparing the percentage improvement of QED may not be meaningful. We adopt the same evaluation method in previous approaches [21, 4, 15], reporting the best 3 property scores found by each model and the fraction of molecules that satisfy chemical validity. Table 1 summarizes the best property scores of molecules found by each model, and the statistics in ZINC250k is also shown for comparison. Our method consistently performs significantly better than previous methods when optimizing penalized logP, achieving an average improvement of compared to JT-VAE, and compared to ORGAN. Our method outperforms all the baselines in the QED optimization task as well.
Compared with ORGAN, our model can achieve a perfect validity ratio due to the molecular graph representation that allows for step-wise chemical valency check. Compared to JT-VAE, our model can reach a much higher score owing to the fact that RL allows for direct optimization of a given property score and is able to easily extrapolate beyond the given dataset. Visualizations of generated molecules with optimized logP and QED scores are displayed in Figure 2(a) and (b) respectively. Although most generated molecules are realistic, in some very rare cases our method can generate unrealistic molecules with astonishingly high penalized logP, such as the one on the bottom-right of Figure 2(a). Unlike previous methods, GCPN is able to find these exceptional cases that reveal the flaw of the property score.
Property Targeting. In this task, we specify a target range for molecular weight (MW) and logP, and report the percentage of generated molecules with property scores within the range, as well as the diversity of generated molecules. The diversity of a set of molecules is defined as the average pairwise Tanimoto distance between the Morgan fingerprints  of the molecules. The RL reward for this task is the L1 distance between the property score of a generated molecule and the range center. To increase the difficulty, we set the target range such that few molecules in ZINC250k dataset are within the range to test the extrapolation ability of the methods to optimize for a given target. The target ranges include , , and .
As shown in Table 2, GCPN has a significantly higher success rate in generating molecules with properties within the target range, compared to baseline methods. In addition, GCPN is able to generate molecules with high diversity, indicating that it is capable of learning a general stochastic policy to generate molecular graphs that fulfill the property requirements.
Constrained Property Optimization. In this experiment, we optimize the penalized logP while constraining the generated molecules to contain one of the 800 ZINC molecules with low penalized logP, following the evaluation in JT-VAE. Since JT-VAE cannot constrain the generated molecule to have certain structure, we adopt their evaluation method where the constraint is relaxed such that the molecule similarity between the original and modified molecules is above a threshold .
We train a fixed GCPN in an environment whose initial state is randomly set to be one of the 800 ZINC molecules, then conduct the same training procedure as the property optimization task. Over the 800 molecules, the mean and standard deviation of the highest property score improvement and the corresponding similarity between the original and modified molecules are reported in Table 3. Our model significantly outperforms JT-VAE with 184% higher penalized logP improvement on average, and consistently succeeds in discovering molecules with higher logP scores. Also note that JT-VAE performs optimization steps for each given molecule constraint. In contrast, GCPN can generalize well: it learns a general policy to improve property scores, and applies the same policy to all 800 molecules. Figure 2(c) shows that GCPN can modify ZINC molecules to achieve high penalized logP score while still containing the substructure of the original molecule.
We introduced GCPN, a graph generation policy network using graph state representation and adversarial training, and applied it to the task of goal-directed molecular graph generation. GCPN consistently outperforms other state-of-the-art approaches in the tasks of molecular property optimization and targeting, and at the same time, maintains validity and resemblance to realistic molecules. Furthermore, the application of GCPN can extend well beyond molecule generation. The algorithm can be applied to generate graphs in many contexts, such as electric circuits, social networks, and explore graphs that can optimize certain domain specific properties.
-  G. R. Bickerton, G. V. Paolini, J. Besnard, S. Muresan, and A. L. Hopkins. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90, 2012.
-  K. H. Bleicher, H.-J. Böhm, K. Müller, and A. I. Alanine. Hit and lead generation: beyond high-throughput screening. Nature Reviews Drug Discovery, 2:369–378, 2003.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.
-  H. Dai, Y. Tian, B. Dai, S. Skiena, and L. Song. Syntax-directed variational autoencoder for structured data. arXiv preprint arXiv:1802.08786, 2018.
-  D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, 2015.
-  P. Ertl. Estimation of synthetic accessibility score of drug-like molecules. J. Cheminform, 2009.
-  P. Ertl, R. Lewis, E. J. Martin, and V. Polyakov. In silico generation of novel, drug-like chemical matter using the LSTM neural network. CoRR, abs/1712.07449, 2017.
-  J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry, 2017.
-  R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, 2014.
-  W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, 2017.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.
-  J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012.
-  E. Jannik Bjerrum and R. Threlfall. Molecular Generation with Recurrent Neural Networks (RNNs). arXiv preprint arXiv:1705.04612, 2017.
-  W. Jin, R. Barzilay, and T. Jaakkola. Junction tree variational autoencoder for molecular graph generation. arXiv preprint arXiv:1802.04364, 2018.
-  S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002.
-  S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of Computer-Aided Molecular Design, 30:595–608, Aug. 2016.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
-  T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2016.
-  P. Kirkpatrick and C. Ellis. Chemical space. Nature, 432:823 EP –, Dec 2004.
-  M. J. Kusner, B. Paige, and J. M. Hernández-Lobato. Grammar variational autoencoder. In D. Precup and Y. W. Teh, editors, International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
-  G. Landrum. Rdkit: Open-source cheminformatics. 2006. Google Scholar, 2006.
-  S. Levine and V. Koltun. Guided policy search. In International Conference on Machine Learning, 2013.
-  Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324, 2018.
-  Y. Li, L. Zhang, and Z. Liu. Multi-Objective De Novo Drug Design with Conditional Graph Generative Model. ArXiv e-prints, Jan. 2018.
-  G. Lima Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. Cunha Farias, and A. Aspuru-Guzik. Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. ArXiv e-prints, May 2017.
-  J. H. Lin and A. Y. H. Lu. Role of pharmacokinetics and metabolism in drug discovery and development. Pharmacological Reviews, 49(4):403–449, 1997.
-  C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews, 23(1):3–25, 1997.
-  B. Liu, B. Ramsundar, P. Kawthekar, J. Shi, J. Gomes, Q. Luu Nguyen, S. Ho, J. Sloane, P. Wender, and V. Pande. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Central Science, 3(10):1103–1113, 2017.
-  M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen. Molecular de-novo design through deep reinforcement learning. Journal of Cheminformatics, 9(1):48, Sep 2017.
-  P. G. Polishchuk, T. I. Madzhidov, and A. Varnek. Estimation of the size of drug-like chemical space based on gdb-17 data. Journal of Computer-Aided Molecular Design, 27(8):675–679, Aug 2013.
-  D. Rogers and M. Hahn. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
-  B. Sanchez-Lengeling, C. Outeiral, G. L. Guimaraes, and A. Aspuru-Guzik. Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC). ChemRxiv e-prints, 8 2017.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
-  K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko. Quantum-chemical insights from deep tensor neural networks. Nature Communications, 8:13890, Jan 2017. Article.
-  M. D. Segall. Multi-parameter optimization: Identifying high quality compounds with a balance of properties. Current Pharmaceutical Design, 18(9):1292–1310, 2012.
-  M. H. S. Segler, T. Kogej, C. Tyrchan, and M. P. Waller. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Central Science, 4(1):120–131, 2018.
-  M. Simonovsky and N. Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. arXiv preprint arXiv:1802.03480, 2018.
-  D. Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
-  X. Yang, J. Zhang, K. Yoshizoe, K. Terayama, and K. Tsuda. ChemTS: An Efficient Python Library for de novo Molecular Generation. ArXiv e-prints, Sept. 2017.
-  J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec. Graphrnn: A deep generative model for graphs. arXiv preprint arXiv:1802.08773, 2018.
-  L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.