ChemoVerse: Manifold traversal of latent spaces for novel molecule discovery
In order to design a more potent and effective chemical entity, it is essential to identify molecular structures with the desired chemical properties. Recent advances in generative models using neural networks and machine learning are being widely used by many emerging startups and researchers in this domain to design virtual libraries of drug-like compounds. Although these models can help a scientist to produce novel molecular structures rapidly, the challenge still exists in the intelligent exploration of the latent spaces of generative models, thereby reducing the randomness in the generative procedure. In this work we present a manifold traversal with heuristic search to explore the latent chemical space. Different heuristics and scores such as the Tanimoto coefficient, synthetic accessibility, binding activity, and QED drug-likeness can be incorporated to increase the validity and proximity for desired molecular properties of the generated molecules. For evaluating the manifold traversal exploration, we produce the latent chemical space using various generative models such as grammar variational autoencoders (with and without attention) as they deal with the randomized generation and validity of compounds. With this novel traversal method, we are able to find more unseen compounds and more specific regions to mine in the latent space. Finally, these components are brought together in a simple platform allowing users to perform search, visualization and selection of novel generated compounds.
Designing a new chemical entity is a time consuming, expensive, and error-prone task. Pharmaceutical companies invest billions of dollars into screening vast libraries of chemical compounds for hit and lead identification . The past few years has seen the rise of deep generative models that can operate over large spaces of molecular structures and embed the chemical properties of such into a vector space. By decoding from this ’latent’ space of chemical structure we can generate new, previously unidentified chemical compounds.
In the domain of computational chemistry, Simplified molecular-input line-entry system or SMILES are a common string textual method for encoding and representation of molecular structures . This facilitates the use of models more commonly used in natural language processing. Therefore, SMILES strings have been used as raw input strings to generative models, which are given the task of encoding and decoding the SMILES string directly . Advances were made by employing variational autoencoders (VAE), a neural network comprised of an encoder that transforms a compound’s representation into a compressed latent space, and a decoder that generates compounds from the latent space . Although, directed search of the resulting latent space is difficult. To counter this, conditional variational autoencoders (CVAE) were used in order to facilitate the generation of new molecules with specified molecular properties. This is achieved by incorporating the molecular properties of a compound into the encoder layer and helping in the generation of more drug-like molecules . Generative adversarial networks (GANs) have also been applied in the same manner , and have been recently combined with reinforcement learning and graph representation of molecules to optimize the generation of molecules with specified molecular properties .
Discovery in the latent space generated by these models is often performed using random sampling and linear interpolation, primarily due to the ease of the implementation of these methods. However, this is not suitable for most generative models as their latent spaces are generally high dimensional and sparse. While doing traversal, we will traverse regions where the data is not very well represented. In other words, this could lead to a ’dead zone’ as the space of molecular samples in the training dataset are present only on a subset of the latent space . Hence, decoding a point from the latent space will end up returning noisy or invalid results. It can also be challenging to incorporate contextual domain information during search, and as a result discovery of compounds with specific properties is often very inconsistent.
In this work we implemented various flavours of auto-encoders as the generative model for producing sets of latent spaces, and in particular show that our novel implementation of Grammar VAE  with an additional attention mechanism  is highly performative, with a low rate of invalid molecules generated. We also introduce a novel manifold interpolation method employing the Riemannian metric  in conjunction with a set of molecular property heuristics to perform directed search and interpolation of these latent spaces in order to design novel molecules with desired properties. This combination of generation and exploration of latent space has enabled us to not only design molecules which have not been seen before but also to explore new regions of latent chemical space where more potent chemical compounds may exist.
2 System Architecture
In this section we describe the various components of our system architecture: generation of latent spaces and our algorithm for manifold traversal.
We used a dataset of 250,000 molecules drawn from the ZINC dataset , and an additional 100,000 drawn from the ChEMBL dataset . These two datasets are comprised of commercially available drug molecules and have been used in related work using models like variational autoencoders (VAEs) . Molecules are represented in canonical SMILES string format, and are further processed into 1) a one-hot character encoding and 2) a set of context-free grammar (CFG) rules. Grammar rules are obtained from the OpenSMILES specification , which denotes how the SMILES representation was formed based on the rules. This context free grammar (CFG) consists of 76 production rules, to which an additional seven are added, and a further nine modified in order to represent the more complex ChEMBL dataset.
2.2 Latent Space Generation
Three models are implemented in our system: a VAE , a Grammar VAE , and a Grammar VAE with self-attention . As our search algorithm is model agnostic, latent spaces can be substituted with minimal effort. However, results will vary depending on the underlying data, model architecture, and training parameters used to generate each latent space.
The ChEMBL dataset is less standardized and contains more complex molecules, therefore we perform transfer learning by initially training each model on the ZINC dataset for 50 epochs, then switching to the ChEMBL dataset for 50 epochs. Training, validation and test sets of 85%, 10%, and 5% respectively was used, with the test set comprised entirely of ChEMBL molecules. We use the Adam optimizer with a learning rate scheduler that is instantiated after 15 epochs with a factor of 0.1 (initialized at 0.001).
The encoder is comprised of three 1D convolutional layers with filters of size 9, 10 and 11 respectively, while the decoder is comprised of 3 gated recurrent units (GRU) of 501 units . The structural validity of the generated compounds are checked using the open-source RDKit library . Examples of test set compounds that have been encoded and decoded are shown in table 1.
As noted in the original Grammar VAE work  the vanilla VAE architectures encoding SMILES strings directly generally produce a very low valid decode rate on larger molecular datasets - just 17% using a conditional VAE under their bayesian optimization search methodology. By instead using a Grammar VAE to generate production rules of a grammar instead of SMILES strings directly a much higher rate of valid compounds was attained. However, as OpenSMILES is a context-free instead of regular grammar it is still unable to model certain subtle characteristics of the SMILES grammar such as paired ring bonds; for example the SMILES string ’c1ccccc1C2CCCC2’ would be decoded as ’c1ccccc1C2CCCC’, incorrectly dropping the final paired digit. By incorporating a self-attention layer in the Grammar VAE architecture this effect was mitigated, and increased the validity of decoded test set molecules from 61% to 70%.
|Actual Compound||Generated compound||Similarity|
|Generated compound||Activity range||SAS Score||Molecular weight||Potential Label|
|CS1(C2)CC(C(C3C(C))C)[C@]1n1C2n1C3C4CC4C||Less than 5||6.186||299.89||DIABETES|
|CCC1(C)C(C)CCCCS1C(C1(O([C@+2](C1))))S||Between 5 and 7||6.251||274.49||DIABETES|
|CSCCCC2(C=C(O))S(C=O)C2C1[C@@](C1(C))CC||Between 5 and 7||5.990||301.49||DIABETES|
|CC1(N(S))NN1CCCNc1ncnc2CCn2c1CC||Greater than 7||5.233||299.89||LUNG CANCER|
|CS1(C2)CC(C(C3C(C))C)[C@]1n1C2n1C3C4CC4Cl||Less than 5||6.186||309.44||LUNG CANCER|
|CCCC1CCC(S)N(C1)C(O)NCC||Less than 5||4.355||232.93||DIABETES|
2.3 Manifold Traversal of Latent Space
Once a latent space has been generated we can apply our manifold traversal algorithm to generate interpolative paths in the latent space in order to decode molecules with desired properties. A common approach here is to use linear or spherical interpolation , however both approaches assume that the latent space is Euclidean and flattened out, and generally produce noisier results . In our algorithm, source and destination points are selected in the latent space; these can either be latent space encoding of single molecules, or cluster centroids of molecules labeled with a desired property. Points of interest are the set of known molecules with desired properties, for example all molecules used in treatment of a specific condition. The goal is to define a path from source to destination points in the latent space through regions of interest, factoring in any additional user-specified heuristics such as synthetic accessibility, binding activity, or drug-likeness that will augment generated molecules.
Interpolation is performed by first calculating the Jacobian distances for all points of interest. This helps us to understand how much each latent space point differs from another based on the representation learnt by the model, and understand the stretching and rotational transformations of the local neighborhood of each point with respect to other points.
A k-dimensional tree is built using the resulting Jacobian distances as edge weights between compounds in the points of interest. Therefore, placing compounds with greater structural similarity in closer proximity on the tree. A k-d tree is chosen since it divides the domain of search into half at each level. Hence search for a node in the tree can be done in logarithmic time and makes the data structure run time efficient. Edges can also be weighted by user-specified domain heuristics, adding a weighted cost to augment the paths produced to generate molecules more relevant for a specific target. These heuristics are listed below:
Fingerprint Similarity: A fingerprint is a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. The similarity can be tested using cosine or Tanimoto distance metrics. The Tanimoto similarity takes into account the structural properties whereas cosine does not.
Synthetic Accessibility (SA Score): A molecule synthetic accessibility is a score which is between 1 (easy to produce) and 10 (very difficult to produce). This is calculated based on fragment contributions and molecule complexity. The absolute difference between two molecules is taken into account.
Drug-likeliness: This is a score which takes into account if the molecule is ’drug-like’. This is evaluated using several parameters such as molecular weight, solubility in water or lipophilic efficiency. The absolute difference between two molecules is taken into account.
Binding activity: This demonstrates the potency to a target for a potential drug compound; less than 5 is considered inactive, 5-7 of intermediate activity, greater than 7 active.
Yen’s algorithm  in combination with the A* algorithm is then applied on the k-d tree to find the shortest path from source to destination given the user constraints. Once the shortest path is found, we interpolate along this path equidistantly and decode the points on the latent space using the generator to generate compounds. Multiple paths can be found by taking into account the shortest path and either perturbing it or by changing the number of interpolation points between the source and the destination, and intuitively this increases the overall number of novel generated compounds.
Manifold traversal is inherently more useful than linear or spherical interpolation as it gives users greater flexibility in path exploration under various conditions, and empirically demonstrates a much higher rate of valid decoded molecules. For example, when considering the diabetes and lung cancer centroids; linear interpolation with 100 equidistant points decoded along the path of centroids generated just 3 compounds with valid structures. On the contrary, applying manifold traversal with fingerprint similarity and Yen’s algorithm as heuristic and perturbing the source and destination points produced 4 different paths. These four paths generated a total of 156 valid, novel compounds along the interpolated manifold between the latent regions of diabetes and lung cancer labeled molecules. Samples of these generated compounds can be seen in Table 2. Specific regions to mine within the latent space can be found by plotting different paths, bound them and finding the overlap region where compounds with the right structure and specific characteristics can be found.
3 Conclusions and Future Work
In this work we presented a model-agnostic platform for performing manifold traversal on latent spaces with user specified domain heuristics. This interpolation method allows us to add more context and direction to search and discovery of molecules in the latent space. Methods for exploration of latent spaces generated from datasets of millions of molecules provide an extremely valuable tool for virtual drug screening, and an ability to facilitate rapid drug discovery.
Some avenues for future work in this domain include: implementation of alternative models to produce latent-spaces of various characteristics; more sophisticated methods for curve fitting in high dimensional spaces such as Bézier curves and Gaussian regression; alternative search methods such as using evolutionary and genetic algorithms on the latent space. Future work would also focus on implementing latent space evaluation metrics using this interpolation method to understand the underlying aspects of these spaces.
- This work was performed during an internship programme at Accenture Labs, Dublin, Ireland.
- (1987) SMILES, a line notation and computerized interpreter for chemical structures. Environmental research brief, U.S. Environmental Protection Agency, Environmental Research Laboratory. External Links: Cited by: §1.
- (2017) Latent space oddity: on the curvature of deep generative models. arXiv preprint arXiv:1710.11379. Cited by: §1, §2.3.
- (2016) BDDCS, the rule of 5 and drugability. Advanced drug delivery reviews 101, pp. 89–98. Cited by: Table 2.
- (2018) MolGAN: an implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973. Cited by: §1.
- (2018) How artificial intelligence is changing drug discovery. Nature 557 (1), pp. S55–S57. Cited by: §1.
- (2017) The chembl database in 2017. Nucleic acids research 45 (D1), pp. D945–D954. Cited by: §2.1.
- (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS central science 4 (2), pp. 268–276. Cited by: §2.1.
- (2012) ZINC: a free tool to discover chemistry for biology. Journal of chemical information and modeling 52 (7), pp. 1757–1768. Cited by: §2.1.
- (2016) OpenSMILES specification. External Links: Cited by: §2.1.
- (2018) Conditional molecular design with deep generative models. Journal of chemical information and modeling 59 (1), pp. 43–52. Cited by: §1.
- (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.2.
- (2017) Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1945–1954. Cited by: §1, §2.2, §2.2, §2.2.
- RDKit: open-source cheminformatics. External Links: Cited by: §2.2.
- (2018) Molecular generative model based on conditional variational autoencoder for de novo molecular design. Journal of cheminformatics 10 (1), pp. 1–9. Cited by: §1.
- (2020) Mol-cyclegan: a generative model for molecular optimization. Journal of Cheminformatics 12 (1), pp. 1–18. Cited by: §1.
- (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.2.
- (2016) Sampling generative networks. arXiv preprint arXiv:1609.04468. Cited by: §1, §2.3.
- (1971) Finding the k shortest loopless paths in a network. management Science 17 (11), pp. 712–716. Cited by: §2.3.