Finding the main product of a chemical reaction is one of the important problems of organic chemistry. This paper describes a method of applying a neural machine translation model to the prediction of organic chemical reactions. In order to translate ‘reactants and reagents’ to ‘products’, a gated recurrent unit based sequence–to–sequence model and a parser to generate input tokens for model from reaction SMILES strings were built. Training sets are composed of reactions from the patent databases, and reactions manually generated applying the elementary reactions in an organic chemistry textbook of Wade. The trained models were tested by examples and problems in the textbook. The prediction process does not need manual encoding of rules (e.g., SMARTS transformations) to predict products, hence it only needs sufficient training reaction sets to learn new types of reactions.
]Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions
Predicting major products of chemical reactions is a basic issue in organic chemistry. Because the ability to make accurate predictions of products plays a key role in applications such as designing syntheses, enhancing this ability has been one of the major objectives in organic chemistry curricula. The use of computational methods to achieve this ability facilitates highly efficient planning of organic syntheses. There is a strong link between predicting reactions and problem of retrosynthesis as these two are the inverse processes of each other. Therefore various methods predicting reactions with retrosynthesis have been developed during the past few decades. These prediction methods are widely covered in recent reviews in computer-aided organic synthesis planning.1, 2
Current computational methods for predictions of reactions in organic chemistry are generally classified into three categories.3, 4 The first category predicts the reactions according to rules encoded by humans. Starting from seminal works in this area such as CAMEO5 and EROS6 systems, some algorithms based on this method have been developed along the years. For instance, some algorithms identify reactive sites.7, 8 Recently, Chen et al.9, 10 presented a prediction system based on the reaction mechanism, using manually composed transformation rules of each mechanistic step. These methods perform well on predicting target reactions included in the composed rules but needs further encoding when new reactions—which are not included in composed rules—are discovered. Because of this need for manual encoding, old projects in this area are already outdated.
The second category is based on physical calculations.11, 12, 13, 14, 15, 16, 17, 18 To predict the products, these methods calculate the energies of transition states from plausible reaction pathways. Some methods for choosing reaction coordinates or searching relevant transition states have been developed.18 However, because the majority of energy calculations involve quantum mechanics, these methods usually require high computational cost. Various approaches have been developed to mitigate this problem. For instance, ToyChem12 calculates energy employing a simplified version of Extended Hückel Theory, while ROBIA16 implements rule–based decision process to filter reactive sites.
The last category predicts reactions using machine learning. Although there were several early approaches,19, 20 this area of reaction prediction has been revisited because of the recent development of machine learning algorithms. For example, ReactionPredictor3, 4 predicted reactions regarding each step of reaction mechanisms. In this work, the interactions between molecular orbitals were marked with recursive notations to express reaction mechanisms. Neural networks were applied to filter source/sink molecular orbitals and to rank generated pairs of molecular orbitals to yield possible mechanistic steps. On the perspective of general reaction predictions, this kind of method requires tree search to discover the product in account to its scope of mechanistic step prediction. However, in some reactions, each step is not always the most plausible mechanism given by theories, and some reactions include complicated mechanistic pathways. Hence, it could be computationally demanding to predict the overall path of reaction.
Another approach utilizing machine learning is to generate a reaction fingerprint to predict the reaction class by considering the whole reaction, not the detailed mechanism. Schneider et al.21 developed a fingerprinting method for classification of reactions based on molecular fingerprints such as AtomPairs,22 Morgan23(ECFP24), and TopologicalTorsions.25 Reactions were classified according to reaction types included in the RSC’s RXNO ontology.26 Focusing on the classification, this work had used products to predict the reaction types. Recently, Wei et al.27 developed a reaction type classifier that is dependent only on reactants and reagents, not products. They also applied the differentiable neural fingerprinting method28 to generate reaction fingerprints, and combined the classifier with SMARTS29 transformations to yield the product molecules. This process of predicting products by reaction classes and corresponding SMARTS enables predicting products which match to those classes to be relatively easy. On the other hand, this method needs manual encoding of SMARTS transformation for new kinds of reactions. In addition, classification of reactions sometimes suffers from ambiguous reaction classes. Due to these drawbacks, an alternative approach to predict reactions without classification is needed.
Although reactant, reagent and product molecules involved in reactions are three-dimensional entities, they can be represented as linear SMILES30 strings, which can be decomposed to a list of atoms, bonds and several kinds of symbols. Hence, in a linguistic perspective, SMILES can be regarded as a language with grammatical specifications. In this sense, the problem of predicting products can be regarded as a problem of translating ‘reactants and reagents’ to ‘products’. In this work, a sequence–to–sequence translation model is applied to predict reactions. A parser to tokenize the SMILES strings was constructed, and two kinds of training and test sets were generated: one reaction set from manually encoded rules, and another set from real reaction database. An encoder-decoder model of neural translation was trained using these reaction sets to predict the correct products. An outline of the prediction system is shown in Figure 1.
2 Results and Discussion
After applying the training set generation process explained in the Methods section, two training sets were generated: one from the patent database, and another one from the reaction templates in an organic chemistry textbook of Wade32. Each training sets will be subsequently mentioned as ‘real’ and ‘gen’ training set. Utilizing those training reaction sets, two reaction prediction models were built: one model using the ‘gen’ training set, and another one using both the ‘gen’ and the ‘real’ training sets. Those two models are compared to investigate the effect of the ‘real’ training set on the reaction prediction model.
2.1 Performance on textbook questions
To test the trained models, problems in Wade32 were applied, following the method of Wei et al.27 10 problem sets from the textbook were applied. Every problem is treated as a product prediction problem, and problems out of scope of this work, such as simple deprotonation, were excluded from the problem set. Each problem set is constituted of 6 to 15 reactions. For every problem in each set, the problem reaction is converted into the reaction SMILES string, and the product part is removed. This product–less SMILES string is fed as an input to the two reaction model, and models (‘gen’ and ‘real+gen’ model) produce the product SMILES strings. This produced product is compared with the original product to evaluate each model. The ratio of correct answers and the average Tanimoto33 similarity between Morgan fingerprints of the predicted products and Morgan fingerprints of the real products were used as evaluation metrics. The product generation is based on the tokenized SMILES string symbols, so this process can sometimes generate invalid SMILES strings, such as not closing the opened branches (mismatched parentheses). If generated product SMILES string contains such errors, the score for corresponding prediction was set to 0. The overall prediction results are shown in Figure 2.
Comparing two models, data in Figure 2 shows that the prediction ability of the ‘real+gen’ model is better than the ‘gen’ model in most cases. It is evident from the results that the training set from the real patent reactions facilitates the product prediction procedure. Training set from the generated reactions does not include reactants with either more than 10 atoms or multiple functional groups. However, the test problem sets include such reactants, and the reasonably good performance on these test problems indicates the generalizability of this model. Problem set 15-30 regards Diels–Alder reactions and 17-44 regards the reactions with benzene as the reactant, hence the reactions in these problem sets are not in the scope of training sets of the generated reactions. For problem set 15-30, though both models didn’t predict the fully correct answers for all 6 problems, the ‘real+gen’ model retrieved better results on the average Tanimoto score. The low correct answer ratio on Diels–Alder reactions could be due to the lack of simple training data for those reactions. Although Diels–Alder reactions are included in the training set from the patent data, they are rather complex. Hence features of Diels–Alder type reactions may be suppressed when training the model regarding these sets of reactions. The ‘real+gen’ model’s better result on the Tanimoto score could account for the lower ratio of invalid product SMILES strings, because the ‘real+gen’ model was trained on larger number of reactions than the ‘gen’ model. Larger number of training sets may have resulted in decoder networks generating more valid SMILES strings. For problem set 17-44, the ‘gen’ model correctly answered two, while the ‘real+gen’ model correctly answered four out of eleven test problems. Reactions of aromatic compounds are only included in the ‘real’ training set, thus it is reasonable that the ‘real+gen’ model yielded relatively better prediction results. However, the ‘gen’ model correctly predicted two reactions, implying that this prediction model even can extrapolate into the unencoded reaction patterns.
2.2 Scalability of the model
To further test the models of their scalability, test reactions are generated with a similar method from generating the training set of manually composed reactions. Substrates used when generating test reactions are molecules with 11 to 16 atoms with a single functional group, while the training set includes molecules with under 11 atoms. Thus, the GDB-1734 molecule dataset was used instead of GDB-11 to generate substrates. The generated test set consists of a total of 2400 reactions, 400 for each substrate molecule atom count of 11 to 16. Evaluation metrics include ratio of correct answers, average Tanimoto score calculated in the same way as the test set of textbook problems, average cross–entropy losses in the models, and the ratio of errors (invalid SMILES string). The prediction results are shown in Figure 3.
As illustrated by the Figures 3b and 3d, the ‘real+gen’ model generally performs better than ‘gen’ model when predicting from longer sequences of reactants and reagents. The ‘real+gen’ model is considerably lower in their error ratio of invalid SMILES string generation compared to the ‘gen’ model. The ‘real + gen’ model maintains a Tanimoto score around 0.7 and an error ratio around 0.4, when the number of atoms in substrate molecules increases. Despite this, the ratio of the fully correct prediction decreases quickly as the number of atoms in the substrate molecules increases. As the SMILES string lengths of the reactants and products are proportional to the atom count of the substrates, this result reveals that the recurrent network models in this work generate mistakes on prediction when the input sequence lengths become longer.
The vectors (in the decoder cells) which generate the attention weights are visualized in Figure 4. As illustrated in the figure, values corresponding to the first few encoder cells are significantly higher than other values in the vectors. Hence, when the softmax function is applied to generate the attention weights, the first elements of the weight vectors become close to 1, and the rest to 0. This results in the decoder cells only attending to the first encoder cells. Attention mechanism is generally adopted in neural translation models to enhance ‘alignment’, which means decoder attending to related encoder cells. If the decoder cells which generate the tokens representing the unreactive sites in the reactant molecules can attend to the correlating encoder cells, analogous to the atom mapping, the reaction prediction with longer input sequences or larger numbers of reactant atoms can be improved. Additionally, as this work is focused on training and testing a translation model for predicting the main product(s) of the reaction, the chemical natures of the participating molecules are overlooked. For instance, the embedding vectors of each token in the encoder (reactants and reagents) and decoder (product) are not related to their chemical features, as visualized in Figure 5. Also, the scope of ‘main product(s)’ can be ambiguous, such as regarding or not regarding the leaving protecting groups from the deprotection reactions. Future works on this type of reaction product prediction algorithm will have to address this issue for improvements.
This work have dealt with the application of neural machine translation in the field of organic chemistry reaction prediction. Two models (the ‘gen’ model and the ‘real+gen’ model) were composed, and the comparison of results between two models showed that the training on real reaction facilitates the prediction ability of the model. The models predicted the products of the reactions in a reasonably high precision, and in the case of the ‘gen’ model, the model could extrapolate their prediction ability to untrained types of reactions (reactions with aromatic substrates). While the test sets used to acquire quantitative results were elementary reactions, the ‘real+gen’ model was able to predict some high–level reactions because it was trained on the recent patent reaction.
Comparing with previous works applying machine learning to reaction prediction task, the mechanism–based model of Kayala et al.4 is better on reactions with single mechanistic step, while only few multistep reactions were shown on their work, as those reactions need tree–search algorithm to discover the mechanistic pathways to final products. This work used similar training set generation and evaluation metrics with the fingerprint–based model of Wei et al.,27 and the model in this work performed better on product generation in test set of organic chemistry textbook questions. Also, this algorithm generates product SMILES strings from tokens; hence manual input of SMARTS transformations is not needed. This allows overall process to be significantly flexible, as this method only requires sufficient data of reactions to train on. However, this also generates some problems such as composing invalid product SMILES strings, and reactions with multiple pathways—for instance, substitution and elimination—are hard to deal with present model structure. Future version of this algorithm should deal with these problems.
Machine learning based reaction prediction algorithms are able to generate the prediction much faster than the methods using quantum calculations, making these algorithms suitable for synthesis planning tasks.1, 27 The reaction prediction models can be extended to deal with the quantity of reagents, reaction temperature and time to gain more elaborate prediction result, or it can be used to predict yield of the possible products. The link between this method of reaction prediction and machine–assisted chemistry36 will be able to open the new areas of automatic chemical systems.
4.1 Training reaction set composition
Two training sets were generated to train the reaction predictor model. The first set is based on real reactions. There exists reaction databases such as CASREACT37, Reaxys38, or SPRESI39, but they are commercial databases, and the reactions included in these databases cannot be extracted as appropriate forms for this work. Hence, the reaction database collected from patents by Lowe40, 41 was used. Schneider et al.21 had also used this database to train reaction classification system. In this work, reactions extracted from 2001–2013 USPTO applications were used. First, atom mappings were removed from the reaction SMILES as they are unnecessary for the translation model. To filter out inappropriate reactions for the translation model, (1) reactions with reactants and reagents lengths (length of string before the second ‘>’ in reaction SMILES) longer than 150, (2) reactions with products which lengths (length of string after the second ‘>’) are longer than 80, and (3) reactions with four or more products were excluded. A total of 1,094,235 reactions were collected.
Because the reactions are relatively new, this reaction set lacks elementary reactions. Hence, following the method of Wei et al.,27 the second reaction set was composed according to elementary reactions in an undergraduate organic chemistry textbook by Wade32. A total of 75 reaction types regarding five types of substrate molecules (acid derivatives, alcohols, aldehydes and ketones, alkenes, alkynes) were considered. For each reaction type, reactions were generated by iterating the reactant molecules which match the reaction template specified as a SMARTS transformation. Reactant molecules with 1–10 atoms were extracted from the molecule database GDB-11.42, 43 As all halides in GDB-11 are fluorides, F was substituted to Cl, Br, I in each halide to generate alkyl halide reactants. Molecules with either multiple functional groups or bulky groups such as neopentyl group were excluded. RDKit44 was used to collect matching reactant molecules and generate reactions from the reaction template. A total of 865,118 reactions were generated in this way.
4.2 Reaction predicting translation model
This work was in part supported by the Korea Foundation for the Advancement of Science & Creativity. The authors acknowledge Jennifer N. Wei for allowing the use of her codes with useful discussions, Minho Kim for extensive support on the logical structure and English of this manuscript, and Hanjun Lee for useful discussions.
- Szymkuć et al. 2016 Szymkuć, S.; Gajewska, E. P.; Klucznik, T.; Molga, K.; Dittwald, P.; Startek, M.; Bajczyk, M.; Grzybowski, B. A. Angewandte Chemie International Edition 2016, 55, 5904–5937.
- Todd 2005 Todd, M. H. Chemical Society Reviews 2005, 34, 247–266.
- Kayala et al. 2011 Kayala, M. A.; Azencott, C.-A.; Chen, J. H.; Baldi, P. Journal of Chemical Information and Modeling 2011, 51, 2209–22.
- Kayala and Baldi 2012 Kayala, M. A.; Baldi, P. Journal of Chemical Information and Modeling 2012, 52, 2526–2540.
- Jorgensen et al. 1990 Jorgensen, W. L.; Laird, E. R.; Gushurst, A. J.; Fleischer, J. M.; Gothe, S. A.; Helson, H. E.; Paderes, G. D.; Sinclair, S. Pure and Applied Chemistry 1990, 62, 1921–1932.
- Hollering et al. 2000 Hollering, R.; Gasteiger, J.; Steinhauer, L.; Schulz, K.; Herwig, A. Journal of Chemical Information and Computer Sciences 2000, 40, 482–94.
- Satoh and Funatsu 1995 Satoh, H.; Funatsu, K. Journal of Chemical Information and Modeling 1995, 35, 34–44.
- Sello 1992 Sello, G. Journal of Chemical Information and Modeling 1992, 32, 713–717.
- Chen and Baldi 2008 Chen, J. H.; Baldi, P. Journal of Chemical Education 2008, 85, 1699.
- Chen and Baldi 2009 Chen, J. H.; Baldi, P. Journal of Chemical Information and Modeling 2009, 49, 2034–2043.
- Behn et al. 2011 Behn, A.; Zimmerman, P. M.; Bell, A. T.; Head-Gordon, M. Journal of Chemical Physics 2011, 135, 224108.
- Benkö et al. 2003 Benkö, G.; Flamm, C.; Stadler, P. F. Journal of Chemical Information and Computer Sciences 2003, 43, 1085–1093.
- Chaffey-Millar et al. 2012 Chaffey-Millar, H.; Nikodem, A.; Matveev, A. V.; Krüger, S.; Rösch, N. Journal of Chemical Theory and Computation 2012, 8, 777–786.
- Olsen et al. 2004 Olsen, R. A.; Kroes, G. J.; Henkelman, G.; Arnaldsson, A.; Jónsson, H. Journal of Chemical Physics 2004, 121, 9776–9792.
- Plessow 2013 Plessow, P. Journal of Chemical Theory and Computation 2013, 9, 1305–1310.
- Socorro et al. 2005 Socorro, I. M.; Taylor, K.; Goodman, J. M. Organic Letters 2005, 7, 3541–3544.
- Wang et al. 2016 Wang, L. P.; McGibbon, R. T.; Pande, V. S.; Martinez, T. J. Journal of Chemical Theory and Computation 2016, 12, 638–649.
- Zimmerman 2013 Zimmerman, P. M. Journal of Computational Chemistry 2013, 34, 1385–1392.
- Gelernter et al. 1990 Gelernter, H.; Rose, J. R.; Chen, C. Journal of Chemical Information and Modeling 1990, 30, 492–504.
- Röse and Gasteiger 1990 Röse, P.; Gasteiger, J. Analytica Chimica Acta 1990, 235, 163–168.
- Schneider et al. 2015 Schneider, N.; Lowe, D. M.; Sayle, R. A.; Landrum, G. A. Journal of Chemical Information and Modeling 2015, 55, 39–53.
- Carhart et al. 1985 Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Journal of Chemical Information and Computer Science 1985, 25, 64–73.
- Morgan 1965 Morgan, H. L. Journal of Chemical Documentation 1965, 5, 107–113.
- Rogers and Hahn 2010 Rogers, D.; Hahn, M. Journal of Chemical Information and Modeling 2010, 50, 742–754.
- Nilakantan et al. 1987 Nilakantan, R.; Bauman, N.; Dixon, J. S.; Venkataraghavan, R. Journal of Chemical Information and Computer Sciences 1987, 27, 82–85.
- 26 RXNO: reaction ontologies. https://github.com/rsc-ontologies/rxno.
- Wei et al. 2016 Wei, J. N.; Duvenaud, D.; Aspuru-Guzik, A. ACS Central Science 2016, 2, 1–20.
- Duvenaud et al. 2015 Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R. P. Advances in Neural Information Processing Systems 28 2015, 2215–2223.
- 29 James, C. A.; Weininger, D.; Delany, J. Daylight Theory Manual. http://daylight.com/dayhtml/doc/theory/index.html.
- Weininger 1988 Weininger, D. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36.
- Cho et al. 2014 Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014, 1724–1734.
- Wade 2013 Wade, L. G. Organic chemistry, 6th ed.; Pearson: Upper Saddle River, NJ, USA, 2013.
- Bajusz et al. 2015 Bajusz, D.; Rácz, A.; Héberger, K. Journal of Cheminformatics 2015, 7.
- Ruddigkeit et al. 2012 Ruddigkeit, L.; Van Deursen, R.; Blum, L. C.; Reymond, J. L. Journal of Chemical Information and Modeling 2012, 52, 2864–2875.
- Van Der Maaten and Hinton 2008 Van Der Maaten, L.; Hinton, G. Journal of Machine Learning Research 2008, 9, 2579–2605.
- Ley et al. 2015 Ley, S. V.; Fitzpatrick, D. E.; Ingham, R. J.; Myers, R. M. Angewandte Chemie - International Edition 2015, 54, 3449–3464.
- Blake and Dana 1990 Blake, J. E.; Dana, R. C. Journal of Chemical Information and Modeling 1990, 30, 394–399.
- 38 Chemical Data Reaxys. https://www.elsevier.com/solutions/reaxys.
- Roth 2005 Roth, D. L. Journal of Chemical Information and Modeling 2005, 45, 1470–1473.
- 40 Lowe, D. Patent Reaction Extraction. https://bitbucket.org/dan2097/patent-reaction-extraction.
- Lowe 2012 Lowe, D. Extraction of chemical structures and reactions from the literature. Ph.D. thesis, 2012.
- Fink et al. 2005 Fink, T.; Bruggesser, H.; Reymond, J. L. Angewandte Chemie - International Edition 2005, 44, 1504–1508.
- Fink and Raymond 2007 Fink, T.; Raymond, J. L. Journal of Chemical Information and Modeling 2007, 47, 342–353.
- 44 RDKit: Open-Source Cheminformatics Software. http://rdkit.org/.
- Bahdanau et al. 2014 Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. 2014; http://arxiv.org/abs/1409.0473.
- Jean et al. 2015 Jean, S.; Cho, K.; Memisevic, R.; Bengio, Y. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) 2015, 1–10.
- Johnson et al. 2016 Johnson, M.; Schuster, M.; Le, Q. V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.; Hughes, M.; Dean, J. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. 2016; http://arxiv.org/abs/1611.04558.
- Luong et al. 2014 Luong, M.-T.; Sutskever, I.; Le, Q. V.; Vinyals, O.; Zaremba, W. Addressing the Rare Word Problem in Neural Machine Translation. 2014; http://arxiv.org/abs/1410.8206.
- Sutskever et al. 2014 Sutskever, I.; Vinyals, O.; Le, Q. V. V. Sequence to sequence learning with neural networks. Advances in neural information processing systems. 2014; pp 3104–3112.
- Vinyals et al. 2014 Vinyals, O.; Kaiser, L.; Koo, T.; Petrov, S.; Sutskever, I.; Hinton, G. Grammar as a Foreign Language. 2014.
- Wu et al. 2016 Wu, Y. et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. 2016; http://arxiv.org/abs/1609.08144.
- Hochreiter and Schmidhuber 1997 Hochreiter, S.; Schmidhuber, J. J. Neural Computation 1997, 9, 1–32.
- Gómez-Bombarelli et al. 2016 Gómez-Bombarelli, R.; Duvenaud, D.; Hernández-Lobato, J. M.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; Aspuru-Guzik, A. 2016,
- Jastrzȩbski et al. 2016 Jastrzȩbski, S.; Leśniak, D.; Czarnecki, W. M. Learning to SMILE(S). 2016; http://arxiv.org/abs/1602.06289.
- Ford et al. 2004 Ford, B.; Institute, M.; Technology, O. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation. Principles of Programming Languages (POPL). New York, New York, USA, 2004; pp 111–122.
- 56 pyPEG – a PEG Parser-Interpreter in Python. https://fdik.org/pyPEG/.
- 57 Smidge: Lightweight SMILES Parser. http://metamolecular.com/smidge/.
- 58 Craig A. James, OpenSMILES specification. http://opensmiles.org/opensmiles.html.
- 59 Sequence-to-Sequence Models. https://www.tensorflow.org/versions/r0.10/tutorials/seq2seq/index.html.
- 60 Abadi, M. et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.