An Intermediate Level of Abstraction for Computational Systems Chemistry
Abstract
Computational techniques are required for narrowing down the vast space of possibilities to plausible prebiotic scenarios, since precise information on the molecular composition, the dominant reaction chemistry, and the conditions for that era are scarce. The exploration of large chemical reaction networks is a central aspect in this endeavour. While quantum chemical methods can accurately predict the structures and reactivities of small molecules, they are not efficient enough to cope with large-scale reaction systems. The formalization of chemical reactions as graph grammars provides a generative system, well grounded in category theory, at the right level of abstraction for the analysis of large and complex reaction networks. An extension of the basic formalism into the realm of integer hyperflows allows for the identification of complex reaction patterns, such as auto-catalysis, in large reaction networks using optimization techniques.
1 Introduction
From a fundamental physics point of view chemical systems, or more precisely molecules and their reactions, are just time dependent multi-particle quantum systems, completely described by the fundamental principles of quantum field theory (QFT) [38]. At this level of description almost all questions of interest to a chemist are not tractable in practice, however. A hierarchy of approximations and simplifications is employed therefore to reach models of more practical value. These are guided at least in part by conceptual notions that distinguish chemistry from other quantum systems. Among these are constraints such as the immutability of atomic nuclei and the idea that chemical reactions comprise only a redistribution of electrons. On the formal side the Born-Oppenheimer approximation [7] stipulates a complete separation of the wave function of nuclei and electrons and leads to the concept of the potential energy surface (PES) that explains molecular geometries and provides a consistent – if not completely accurate – view of chemical reactions as classical paths of nuclear coordinates on the PES. The PES itself is the result of solving the Schrödinger Equation with nuclear coordinates and charges as parameters [32, 21]. Quantum chemistry (QC) has developed a plethora of computational schemes for this purpose, typically trading off accuracy for computational resource consumption. Among them in particular are the so-called semi-empirical methods that use the fact the chemical bonds are usually formed by pairs of electrons to decompose the electronic wave function into contributions of electron pairs.
Molecular modelling (MM) and molecular dynamics (MD) [30, 9] abandon quantum mechanics altogether and instead treat chemical bond akin to classical springs. Sacrificing accuracy, MM and MD can treat macro-molecules and supra-molecular complexes that are well outside the reach of exact and even semi-empirical quantum-chemical methods. For special classes of molecules, even coarser approximations have been developed. Many properties of aromatic ring systems, for instance, can be explained in terms of graph-theoretical models known as Hückel theory [26, 24]. For nucleic acids, on the other hand, models have been developed that aggregate molecular building blocks (nucleotides) into elementary objects so that Watson-Crick base pairs become edges in the graph representation [41].
A common theme in the construction of coarser approximations is that more and more external information needs to be supplied to the model. While QFT does not require more than a few fundamental constants, nuclear masses and charges are given in practical QC computations. Semi-empirical methods, in addition, require empirically determined parameters for electron correlation effects. MM and MD models use extensive tables of parameters that specify properties of localised bonds as a function of bond type and incident atoms. Similarly, RNA folding depends on a plethora of empirically determined energy contributions for base pair stacking and loop regions [36]. A second feature of coarse-grained methods is that they are specialised to answer different types of questions, or that different classes of systems resort to different, mutually incompatible approximations.
This well-established hierarchy of internally consistent models of molecular structures is in stark contrast to our present capability to model chemical reactions. While transition state theory [17] does provide a means to infer reaction rate constants from PES, it requires the prior knowledge of the educt and product states.
A systematic investigation of large chemical reaction systems requires the development of a theoretical framework that is sufficiently coarse-grained to be computationally tractable. Any such model must satisfy consistency conditions that are inherited from the underlying physics.
A recent study [33] on complex chemical reaction network supports this view and concludes, that on the structure and reactivity level of small molecules efficent QM based computational approaches exist, but on the large-scale network level heuristic approaches are indispensable. Here we argue that chemistry offers a coarse-grained level of description that allows the construction of mathematically sound and consistent formal models that, nevertheless, are conceptually and structurally different from the formalism of quantum chemistry. Much of chemistry is taught in terms of abstracted molecular structures and rules (named reactions) that transform molecular graphs into each other. In the following section we show how this level of modelling can indeed be made mathematically precise and how it can accommodate key concepts of chemistry such as transition state theory and fundamental conservation laws inherited from the underlying physics.
2 Graph Grammar Chemistry
The starting point of an inherently discrete model of chemistry is a simplified, graphical representation of molecules. This coarse-graining maps the atoms and bonds of a molecule to vertices and edges of the corresponding graph. All type information (atom types C, H, N, O, etc. and bond multiplicity single, double, triple bond) are mapped to labels on the respective vertices or edges of the graph. In this setting all information on properties which are tied to the three-dimensional space, such as chirality, and cis/trans isomerism of double bonds is neglected.
It is possible, however, to extend the model to retain the local geometric information and thus capture the part of stereochemistry which is tied to stereogenic centers. Helicity, for instance, can not be expressed by the extended model since this property is generated by an extended spacial arrangement around an axis and not a single point or center. The basis for the extended model is the valence-shell electron-pair repulsion (VSEPR) model, which albeit comprising a set of simple rules, has a firm grounding in quantum chemical modelling [19]. VSEPR theory determines approximate bond angles around an atom depending on the incident bond types, i.e., in terms of information conveyed by the labelled graph representation. Stereochemical information involving chiral centres as well as cis/trans isomerism thus can be encoded simply in the order in which bonds are listed, and augmenting the labelled graph with local permutation groups.
Still certain aspects of chemistry cannot be described in this form. For instance, the concept of bonds as edges fails in multi-centre bonds since three or more atoms share a pair of bonding electrons. These are frequently observed in boranes or organometallic compounds such as ferrocenes. Non-local chirality, found for instance in helical molecules does not rely on local, atom-centred symmetries and thus is not captured by local orientation information.
With molecules represented as graphs, the mechanism of a chemical reaction is naturally expressed as a graph transformation rule. Graph transformation thus retains the semantics familiar from organic chemistry textbooks. As a research discipline in computer science, graph transformation dates back to the 1970s. Graph transformation has been studied extensively in the context of formal language theory, pattern recognition, software engineering, concurrency theory, compiler construction, verification among other fields in computer science [34]. Several formalisms have been developed in order to formalize and implement the process of transforming graphs. Algebraic approaches are of particular interest for modelling chemistry, where multiple variations based on category theory exist. For example, different semantics can be expressed using either the Single Pushout (SPO) approach, the more restrictive Double Pushout (DPO) approach, or the recently developed Sesqui-Pushout (SqPO) approach [14].
In the context of chemical reactions, DPO graph transformation is the formalism of choice because it facilitates the construction of transformation rules that are chemical in nature. DPO guarantees that all chemical reactions are reversible [4]. The conservation of atoms translates to a simple formal condition (formally, the graph morphisms relating the context to the left hand and right hand side of a rule must be bijections for vertices). In turn this requirement guarantees the existence of well-defined atom maps.
Figure 2 illustrates how the Meisenheimer rearrangement [31], a temperature induced rearrangement of aliphatic amine oxides into N-alkoxylamines, translates to the DPO formalism. The reaction transforms educt graph (amine oxide) into product graph (N-alcoxylamine).
All arrows in the diagram are morphisms. The reaction centre, i.e., the subset of atoms and bonds of the reactant molecules directly involved in the bond breaking/forming steps of the chemical reaction, is expressed as a graph transformation rule. The information of how to change the connectivity and the charges of the atoms is specified by three graphs . The left graph (resp. right graph ) expresses the local state of molecules before (resp. after) applying the reaction rule. The context graph encodes the invariant part of the reaction centre and mathematical relates and to each other. The left graph is thus the precondition for application of the rule (i.e., it can only be applied if there exists a subgraph match that embeds in the host graph ; see the red and blue highlighted part of graph in Figure 2). In this case can be replaced by the right graph (see the green and blue highlighted part of graph in Figure 2), which transforms the educt graph into the product graph .
A computationally very demanding step when performing graph transformation (e.g., for generating large chemical spaces) is the enumeration of subgraph matches. Deciding if a single subgraph match exists in a host graph is known to be an -complete problem [11, 18]. Better theoretical results exists for certain classes of graphs, e.g., for the so-called partial -trees of bounded degree (to which almost all molecule graphs belong [40, 25, 1]) where the subgraph matching problem can be solved in polynomial time [29, 15]. In practice it is however faster to use simpler algorithms, e.g., VF2 [13, 12].
Chemical reactions are often compositions of elementary reactions. In the latter, the reaction centre can always be expressed as a cycle[22, 23], with an even number of vertices for homovalent reactions and an odd number from ambivalent reactions, better known as redox reactions. Graph transformations have a natural mechanism for rule composition that allows the expression of multi-step reactions (e.g., enzyme-mediated reactions or even complete metabolic pathways) as compositions of elementary transformation rules. The properties of elementary rules in terms of mass conservation or atom-to-atom mapping nicely carry over to the composed “overall transformation rules”. Since the action of chemical reactions is to redistribute atoms along complex reaction sequences, rule composition can be used to study the trace of individual atoms along these reaction sequences in a chemically as well as mathematically correct fashion. Rule composition can be completely automatized and thus opens the possibility for model reduction (see [5] for further details). We illustrate rule composition in the context of prebiotic chemistry in Figure 3.
3 Network View of Chemistry
Classical synthetic chemistry traditionally has been concerned with the step-wise application of chemical reactions in carefully crafted synthesis plans. In living organisms, in contrast, complex networks of intertwined reactions are active concurrently. These intricate reaction webs harbour complex reaction patterns such as branch points, autocatalytic cycles, and interferences between reaction sequences. The emerging field of Systems Chemistry has set out to leverage the systemic, network-centred view as a framework also for synthetic chemistry. Consequently, large-scale chemical networks are no longer just a subject of analysis in the context of understanding the working of a living cell’s metabolism, but are becoming a prerequisite to understanding the possibilities within chemical spaces, i.e., the universe of chemical compounds and the possible chemical reactions connecting them. The formulation of a predictive theory of chemical space requires it to be rooted in a strict mathematical formalization and abstraction of the overwhelming amount of anecdotal knowledge, that has been collected on the single reaction and functional subnetwork level, into generalizing principles.
Graph transformation systems, as discussed earlier, provide the basis for such a formalism that allow for a systematic and step-wise construction of arbitrary chemical spaces. A chemical system is then specified as a formal graph grammar that encapsulate a set of transformation rules, encoding the reaction chemistry, together with a set of molecules which provide the starting points for rule application. The iteration of the graph grammar yields reaction networks in the form of directed hypergraphs as explicit instantiations of the chemical space. Usually a simple iterative expansion of the chemical space leads to a combinatorial explosion in the number of novel molecules. Therefore a sophisticated strategy framework for the targeted exploration of the parts of interest of the chemical space has been developed [6]. Such strategies are indispensable if for example polymerization/cyclization spaces are the subject of investigation. These type of spaces are for example found in the important natural product classes of polyketides and terpenes, and in prebiotic chemistry [3]. The strategy framework allows the guidance of chemical space exploration not only using physico-chemical properties of the generated molecules, but also using experimental data such as mass spectra. Importantly, the hypergraphs (reaction networks) are generated automatically annotated with atom-to-atom maps, as defined implicitly in the underlying graph grammar. For large and complex reaction networks it is thus possible to construct atom flow networks in an automated fashion, even including corrections for molecule and subnetwork symmetries, as required for the interpretation of isotope labelling experiments [8].
The origin of life can be viewed as an intricate process which has been shaped by external constraints provided by early Earth’s environment and intrinsic constraints stemming from reaction chemistry itself. Higher order chemical transformation motifs, such as network auto-catalysis, are believed to have played a key role in the amplification of the building blocks of life [27, 28, 37]. A combination of the constructive graph grammar approach with techniques from combinatorial optimization sets the proper formal stage for attacking some of these origin of life related questions. The key idea here is to rephrase the topological requirements for a particular chemical behaviour, e.g., network auto-catalysis, as an optimization problem on the underlying reaction network (hypergraph). An example of this is the enumeration of pathways with specific properties, which can be formally modelled as a constrained hyperflow problem. Many of these problems are theoretically computationally hard [2], though in practice methods such as Integer Linear Programming (ILP) can be successfully used to identify such transformation motifs in arbitrary chemical spaces. The enumeration of transformation motifs is the first step in computer-assisted large-scale analysis of reaction networks. Other mathematical formalisms, such as Petri nets, and in general concurrency theory, can subsequently be used to model properties of chemical systems on an even higher-level. Complicated chemical spaces, such as the one formed by the formose process [10], can thus be dismantled into coupled functional modules, advancing the understanding of how a particular reaction chemistry induces specified behaviour on the reaction network level. More generally speaking and emphasizing the need for new approaches, it is well foreseeable that the future of chemistry is strongly bundled with a deeper understanding of complex chemical systems [39, 20], and the necessary skills to analyze such systems will become more and more important.
4 Discussion and Concluding Remarks
Narrowing down potential pathways for prebiotic scenarios indispensably requires novel systemic approaches that allow for the investigation of large chemical reaction systems. While the development of mathematically well grounded methods for abstraction and coarse-graining of (concurrent) systems is a very active research area in computer science, the interdisciplinary endeavor to integrate these approaches with chemistry is more often treated as a conceptual possibility rather than as a predictive approach. Many of the problems to be solved in this process are computationally hard (i.e., -hard) but still allow for practical in silico solutions. This discrepancy led to a relatively new and successful subfield in computer science called algorithmic engineering [35], in which one of the goals is to bridge the gap between theoretical results and practical solutions to hard problems. Clearly, results from that field should be taken into account when large chemical systems with a plethora of underlying hard problems have to be solved.
As an illustration of the integrative potential we sketch an example in Figure 3 (see http://mod.imada.sdu.dk for further examples). It shows how graph transformation based chemical space exploration (rooted in graph theory, category theory, and concurrency theory) with subsequent solution enumeration (using diverse optimization techniques) can be applied to a reaction schema presented in [16], in which Eschenmoser describes how aldehydes act as catalysts for the hydrolysis of groups of the -tetramer. Given a set of chemical reactions (encoded as graph grammar rules) and a set of initial molecules, the iterative application of these rules (potentially with an underlying strategy for the space expansion) leads to a chemical space encoded as hypergraph. This hypergraph is the source for solving the subsequent problem of inferring and enumerating declaratively defined reaction motifs or pathways. In Figure 3 we do not illustrate the expansion step, the depicted mechanism (left column) is however found automatically as one of the potentially many solutions for the general question of enumerating hydrolysis pathways of -polymers that use glyoxylate (GLX: glyoxylate, CID 760) as catalyst. Formally such a solution is encoded as an integer hyperflow within the underlying hypergraph. Given the depicted reaction sequence, the possibility for transformation rule composition is utilised. The overall, (more coarse-grained) rule is thus automatically inferred by consecutive composition of the (simpler) transformation rules . All intermediate steps are depicted in the right column of Figure 3. Note, that the automated coarse-graining implemented by rule composition allows for keeping track of the possibilities of different atom traces, expressed as the atom-atom mapping from educt to product in composed rules. An obvious reachable next step is therefore the analysis as well as the design of isotope labeling experimentation based on the in silico generative chemistry approach with subsequent trace analysis.
Clearly, the illustration in Figure 3 is serving only as an example. The modelling of essential chemical parameters including kinetic components and thermodynamics are still missing. Nevertheless, the approach is already highly automated, and will bring wetlab and in silico experiments closer together. We argue that the intermediate-level theory outlined here holds promise in many fields of chemistry. In particular, we suggest that it is a plausible substrate for a predictive theory of prebiotic chemistry.
Acknowledgements
This work is supported by the Danish Council for Independent Research, Natural Sciences, the COST Action CM1304 “Emergence and Evolution of Complex Chemical Systems”, and the ELSI Origins Network (EON), which is supported by a grant from the John Templeton Foundation.The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of the John Templeton Foundation.
Footnotes
- Email: daniel@imada.sdu.dk
References
- Tatsuya Akutsua and Hiroshi Nagamochi. Comparison and enumeration of chemical graphs. Comput Struct Biotechnol J, 5:e201302004, 2013.
- J. L. Andersen, C. Flamm, D. Merkle, and P. F. Stadler. Maximizing output and recognizing autocatalysis in chemical reaction networks is NP-complete. J. Systems Chem., 3:1, 2012.
- Jakob L Andersen, Tommy Andersen, Christoph Flamm, Martin Hanczyc, Daniel Merkle, and Peter F Stadler. Navigating the chemical space of HCN polymerization and hydrolysis: Guiding graph grammars by mass spectrometry data. Entropy, 15:4066–4083, 2013.
- Jakob L Andersen, Christoph Flamm, Daniel Merkle, and Peter F Stadler. Inferring chemical reaction patterns using graph grammar rule composition. J. Syst. Chem., 4:4, 2013.
- Jakob L Andersen, Christoph Flamm, Daniel Merkle, and Peter F Stadler. 50 Shades of rule composition: From chemical reactions to higher levels of abstraction. In François Fages and Carla Carla Piazza, editors, Proceedings of the 1st International Conference on Formal Methods in Macro-Biology, volume 8738 of LNCS, pages 117–135. Springer-Verlag, Berlin Heidelberg, 2014.
- Jakob L Andersen, Christoph Flamm, Daniel Merkle, and Peter F Stadler. Generic strategies for chemical space exploration. Int J Comp Biol Drug Des, 7:225–258, 2014.
- Max Born and J Robert Oppenheimer. Zur Quantentheorie der Molekülen. Ann. Physik, 389:457–484, 1927.
- Joerg M Buescher, Maciek R Antoniewicz, Laszlo G Boros, Shawn C Burgess, Henri Brunengraber, Clary B Clish, Ralph J DeBerardinis, Olivier Feron, Christian Frezza, Bart Ghesquiere, Eyal Gottlieb, Karsten Hiller, Russell G Jones, Jurre J Kamphorst, Richard G Kibbey, Alec C Kimmelman, Jason W Locasale, Sophia Y Lunt, Oliver D K Maddocks, Craig Malloy, Christian M Metallo, Emmanuelle J Meuillet, Joshua Munger, Katharina Nöh, Joshua D Rabinowitz, Markus Ralser, Uwe Sauer, Gregory Stephanopoulos, Julie St-Pierre, Daniel A Tennant, Christoph Wittmann, Matthew G Vander Heiden, Alexei Vazquez, Karen Vousden, Jamey D Young, Nicola Zamboni, and Sarah-Maria Fendt. A roadmap for interpreting C metabolite labeling patterns from cells. Curr Opin Biotechnol, 34:189–201, 2015.
- U Burkert and NL Allinger. Molecular Mechanics, volume 177. American Chemical Society, Washington, 1982.
- AM Butlerov. Einiges über die chemische Structur der Körper. Z Chem, 4:549–560, 1861.
- Stephen A. Cook. The complexity of theorem-proving procedures. In Proceedings of the Third Annual ACM Symposium on Theory of Computing, STOC ’71, pages 151–158, New York, NY, USA, 1971. ACM.
- L.P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub) graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10):1367, 2004.
- Luigi Pietro Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. An improved algorithm for matching large graphs. In Proc. of the 3^{rd} IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition, pages 149–159, 2001.
- A Corradini, T Heindel, F Hermann, and B König. Sesqui-pushout rewriting. In Graph Transformations, Third International Conference, ICGT 2006, Proceedings, volume 4178 of Lecture Notes in Computer Science, pages 30–45. Springer, 2006.
- A Dessmark, A Lingas, and A Proskurowski. Faster algorithms for subgraph isomorphism of k-connected partial k-trees. Algorithmica, 27:337–347, 2000.
- Albert Eschenmoser. On a hypothetical generational relationship between HCN and constituents of the reductive citric acid cycle. Chem. Biodivers., 4:554–573, 2007.
- Henry Eyring. The activated complex in chemical reactions. J. Chem. Phys., 3:107–114, 1935.
- MR Garey and DS Johnson. Computers and Intractability. A Guide to the Theory of Completeness. Freeman, San Francisco, 1979.
- RJ Gillespie. Fifty years of the VSEPR model. Coord. Chem. Rev., 252:1315–1327, 2008.
- Chris M. Gothard, Siowling Soh, Nosheen A. Gothard, Bartlomiej Kowalczyk, Yanhu Wei, and Bartosz A. Grzybowski Bilge Baytekin. Rewiring chemistry: Algorithmic discovery and experimental validation of one-pot reactions in the network of organic chemistry. Angewandte Chemie, 51(32):7922–7927, 2012.
- Dietmar Heidrich, Wolfgang Kliesch, and Wolfgang Quapp. Properties of Chemically Interesting Potential Energy Surfaces, volume 56 of Lecture Notes in Chemistry. Springer-Verlag, Berlin, 1991.
- James B Hendrickson. Comprehensive system for classification and nomenclature of organic reactions. J Chem Inf Comput Sci, (37):852–860, 1997.
- James B Hendrickson. Systematic signatures for organic reactions. J Chem Inf Model, 50:1319–1329, 2010.
- Roald Hoffmann. An extended Hückel theory. I. Hydrocarbons. J Chem Phys, 39:1397–1412, 1963.
- Tamás Horváth and Jan Ramon. Efficient frequent connected subgraph mining in graphs of bounded tree-width. Theoret Comput Sci, 411:2784–2797, 2010.
- Ernst Hückel. Quantentheoretische Beiträge zum Benzolproblem. I. Die Elektronenkonfiguration des Benzols und verwandter Verbindungen. Z. Physik, 70:204–286, 1931.
- S. Kauffman. At Home in the Universe: The Search for the Laws of Self-Organization and Complexity. Oxford University Press, 1995.
- S. Kauffman. Question 1: Origin of life and the living state. Orig Life Evol Biosph, 37(4):315–322, 2007.
- Jiři Matoušek and Robin Thomas. On the complexity of finding iso- and other morphisms for partial k-trees. Disc Math, 108:343–364, 1992.
- JA McCammon, BR Gelin, and M Karplus. Dynamics of folded proteins. Nature, 267:585–590, 1977.
- J Meisenheimer. Über eine eigenartige Umlagerung des Methyl-allyl-anilin-N-oxyds. Chem Ber, 52:1667–1677, 1919.
- PG Mezey. Potential Energy Hypersurfaces. Elsevier, Amsterdam, 1987.
- Dmitrij Rappoport, Cooper J. Galvin, Dmitry Yu. Zubarev, and Alán Aspuru-Guzik. Complex chemical reaction networks from heuristics-aided quantum chemistry. J Chem Theory Comput, 10(3):897–907, 2014.
- Grzegorz Rozenberg, editor. Handbook of Graph Grammars and Computing by Graph Transformation. World Scientific, 1997.
- Peter Sanders. Algorithm engineering - an attempt at a definition. In Efficient Algorithms, volume 5760 of Lecture Notes in Computer Science, pages 321–340. Springer, 2009.
- DH Turner and DH Mathews. NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res, 38:280–282, 2010.
- Robert Ulanowicz. Ecology, the Ascendent Perspective. Columbia Univ. Press, 1997.
- S. Weinberg. The Quantum Theory of Fields. Cambridge University Press, 2005.
- George M. Whitesides. Reinventing chemistry. Angew. Chem. Int. Ed., 54(11):3196–3209, 2015.
- Atsuko Yamaguchi, Kiyoko F Aoki, and Hiroshi Mamitsuka. Finding the maximum common subgraph of a partial k-tree and a graph with a polynomially bounded number of spanning trees. Inf Proc Lett, 92:57–63, 2004.
- M Zuker and P Stiegler. Optimal computer folding of larger RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res, 9:133–148, 1981.