Genetic optimization of training sets for improved machine learning models of molecular properties

Genetic optimization of training sets for improved machine learning models of molecular properties

Nicholas J. Browning    Raghunathan Ramakrishnan    O. Anatole von Lilienfeld    Ursula Röthlisberger National Center for Computational Design and Discovery of Novel Materials (MARVEL) Laboratory of Computational Chemistry and Biochemistry, Institute of Chemical Sciences and Engineering. Swiss Federal Institute of Technology (EPFL) CH-1015 Lausanne, Switzerland. Tata Institute of Fundamental Research, Centre for Interdisciplinary Sciences, 21 Brundavan Colony, Narsingi, Hyderabad 500075, India Institute of Physical Chemistry, Department of Chemistry, University of Basel, Klingelbergstrasse 80, CH-4056 Basel, Switzerland
July 5, 2019

The training of molecular models of quantum mechanical properties based on statistical machine learning requires large datasets which exemplify the map from chemical structure to molecular property. Intelligent a priori selection of training examples is often difficult or impossible to achieve as prior knowledge may be sparse or unavailable. Ordinarily representative selection of training molecules from such datasets is achieved through random sampling. We use genetic algorithms for the optimization of training set composition consisting of tens of thousands of small organic molecules. The resulting machine learning models are considerably more accurate with respect to small randomly selected training sets: mean absolute errors for out-of-sample predictions are reduced to 25% for enthalpies, free energies, and zero-point vibrational energy, to 50% for heat-capacity, electron-spread, and polarizability, and by more than 20% for electronic properties such as frontier orbital eigenvalues or dipole-moments. We discuss and present optimized training sets consisting of 10 molecular classes for all molecular properties studied. We show that these classes can be used to design improved training sets for the generation of machine learning models of the same properties in similar but unrelated molecular sets.

I Introduction

Experimentally accurate solutions to the time-independent non-relativistic electronic Schrödinger equation (SE) for electrons and a collection of atoms involve numerically challenging calculations Cramer (2004). This limits routine electronic structure elucidation and accurate high-throughput screening. Previous works Rupp et al. (2012); Ramakrishnan et al. (2015); Montavon et al. (2013) have found that the task of repetitiously solving the SE can be mapped onto a computationally efficient, data-driven supervised machine-learning (ML) problem instead. In these models, expectation values of quantum-mechanical (QM) operators are inferred in the subset of chemical space spanned by a set of reference molecular graphs, enabling a speedup of several orders of magnitude von Lilienfeld (2013) for predicting relevant molecular properties such as enthalpies, polarizabilities and electronic excitations Montavon et al. (2013); Hansen et al. (2015); Ramakrishnan et al. (2015). Here, QM reference calculations provide training examples , where are molecular structures and the expectation values for a chemical property, for interpolation in a 4 dimensional space, aka. chemical space. After training, accurate property predictions for new as of yet unseen molecules can be obtained at the base cost of the underlying ML model, provided that the new query molecule lies close to the space spanned by the reference data.

A key issue in the validation of ML models is the selection of appropriate data to use for training. Here, we will tackle this problem in the context of ML models of quantum chemistry–based estimates of molecular properties. Training examples are typically chosen from a uniform random distribution, thus there is no guarantee that the selected data will produce an optimal model. In this work we study the effect of optimizing the composition of the set of training examples used for learning, by maximizing the predictive power of the underlying ML model for a given training set size.

There are numerous alternative strategies to design ML models with reduced relative mean absolute errors (RMAE) of property prediction. In previous work k-fold cross-validation has been used Hansen et al. (2013) to reduce the predictive error of the ML model by optimizing the hyperparameters and . The molecular representation could also be exchanged without any loss of generality. For this study we chose to rely on the sorted Coulomb matrix representation – a well established and generic representation which meets the crucial uniqueness criterion von Lilienfeld et al. (2015). We note that more advantageous descriptors, such as the Bag-of-Bonds (BoB) Hansen et al. (2015), or Bonds-and-Angles Machine Learning (BAML) models Huang and von Lilienfeld (2016) could have been used just as well. Finally, instead of focussing on properties directly, -learning Ramakrishnan et al. (2015) can be used to focus on the difference between a baseline and higher level of theory. In this study we include this model for learning the differences between PM7 and B3LYP atomization enthalpies.

This work is based on a quantum chemistry database published in 2014 Ramakrishnan et al. (2014), which contains relaxed geometries and 13 molecular properties computed at the DFT/B3LYP/6-31(2df,p) level of theory for 133’885 small organic molecules of up to 9 heavy atoms, extracted from the GDB-17 list of 166B SMILES strings Ruddigkeit et al. (2012), and will herein be referred to as the GDB9 database. In this work we use both this and a smaller subset, denoted as the GDB8 database, containing molecules of up to 8 heavy atoms, resulting in 21800 molecules. For the GDB8 subset, the following properties are considered for training set composition optimization: enthalpy and free energy of atomization; heat capacity ; isotropic molecular polarizability ; electronic spatial extent ; harmonic zero-point vibrational energy ZPVE; energy of the highest occupied and lowest unoccupied molecular orbitals; HOMO-LUMO gap and dipole moment . In addition, the modeling of electronic spectra of GDB8 has been studied in Ramakrishnan et al. (2015).

Figure 1: GA procedure to optimize molecular sets for training the ML model. A population of trial training sets containing an identical number of molecules are randomly sampled from the molecular database. This initial population of trial training sets serves as the first iteration of the algorithm, and is denoted as the parent population. An ML model is trained for each training set, and the mean out-of-sample prediction error is assigned to the training set as a fitness metric. The population then undergoes fitness-based selection and variation operators to create a child population. A portion of the worst children are replaced by the best parents. Finally the children are labelled as parents, and the algorithm repeats until there exists limited information difference between subsequent iterations.

Ii Theory

ii.1 Machine Learning Model

Similarly to the Hamiltonian used for electronic structure calculation, molecular information in the ML model has Cartesian coordinate and nuclear charge information inherently encoded through sorted Coulomb matrices Dral et al. (2015); Hansen et al. (2013), which are used to represent molecular structures for training and property prediction. The norm then serves as a metric measure of similarity between any two molecules and in the set of all sorted Coulomb matrices . Note that sorted Coulomb matrices encode (except among enantiomers) the external potential of any given molecule uniquely, such that it is invariant with respect to molecular translation, rotation, and atom-indexing. The ML model attempts to construct a non-linear mapping between molecular characteristics and molecular properties. Here, we model the property of a new out-of-sample molecule M as a linear combination of weighted Laplacian kernel functions , located on each training instance :


where runs over all molecules characterized by sorted Coulomb matricies M in the training set of size , and is the Laplacian kernel. For the Coulomb matrix descriptor, the combination of the Laplacian kernel and norm has been shown to yield an optimal combination of low computational cost and good predictive accuracy for models of atomization enthalpies Hansen et al. (2013). The regression coefficient vector is obtained by training on . Note that each molecule contributes to the property estimate not only according to its distance, but also according to its regression weight . The global hyperparameter corresponds to the kernel-width. The regression coefficients are the solutions to the kernel ridge regression (KRR) Hastie et al. (2001) minimization problem for a given kernel width and regularization parameter ,


where is the Lagrangian to be minimized with respect to the coefficient vector and is the vector of all reference training data. The solution for the coefficient vector then reads , with being the kernel matrix, and the identity matrix. Frequently, 5-fold cross validation is used to estimate optimal values of hyperparameters and , however, this is computationally expensive to perform during an optimization procedure. Instead, we employ the single-kernel method Ramakrishnan and von Lilienfeld (2015), using constant values of and .

ii.2 Genetic Optimization of Training Sets

Simple systematic enumeration to find training sets which minimize the out-of-sample predictive error of the ML model is a computationally demanding optimization problem. Finding 1000 optimal training molecules to train a machine which can best reproduce the properties of the remaining 20800 molecules in the GDB8 database requires the training of machines for a complete search. Clearly, complete optimization of training set composition in such a nearly-infinitely large space is impossible, thus an intelligent search method is required to find near-optimal solutions. Here we employ a genetic algorithm (GA) Turing (1995); Barricelli (1954); Fraser (1957), a biologically inspired meta-heuristic optimization technique, which has proved a successful optimization scheme in highly-dimensional and/or large spaces Godefroid and Khurshid (2002). The GA optimization process is pictorially summarized in figure 1. Bounded chemical compound space representations for the ML model, i.e training sets, are represented as a collection of unique molecular pointers,

X is termed a position vector, where each element uniquely maps to each Coulomb matrix , and is the training set size. Each position vector is used to train the ML model and therefore facilitates the calculation of the mean absolute error (MAE) of predicting out-of-sample properties. Starting with a population of training sets with unique molecules sampled from a uniform distribution, individuals are stochastically selected using MAE as a fitness criterion. These initial training sets are then successively evolved by applying selection, variation and replacement. Selection determines which training sets should remain in the population to produce the training sets of the following generation. Variation results in new representations of chemical compound space and includes two operations. The first, termed crossover, mixes two training sets uniquely such that two new "child" training sets are produced which contain training molecules from both "parents". The second operator, termed mutation, randomly changes training molecules to introduce new information into the population. Finally, a portion of the worst children are replaced by the best parents to create a new parent population. After a number of iterations, genetic homogeneity is reached within the population and out-of-sample MAEs improve no further. These training sets are termed GA-optimized training sets and are considered to be near-optimal.

ii.3 Training and Validation

For the GA optimization of training sets in the GDB8 database, the entire database of 21800 molecules is partitioned into independent training and validation databases, each consisting of 10900 molecules. GA optimization procedures for all properties and training set sizes sample solely from the training database, while the validation set is used for calculating out-of-sample MAEs, used to guide the GA optimization procedure. We also define a third database in which all GDB8 molecules are removed from the GDB9 database, and name it the GDB9-8 database. The GDB9 and GDB9-8 databases have been similarly partitioned into training and validation databases of  65k and  55k molecules each, respectively, which are used for rule development detailed in section III.4.

ii.4 Computational Details

Genetic optimization is performed using a population of 2000 trial training sets for each property and training set size, for a maximum number of 500 iterations. Tournament selection without replacement Sastry and Goldberg (2001) is used with a tournament size of 2. Crossover is performed by partially-matched uniform crossover Spears and Anand (1991); Spears and Dejong (1991); Eshelman (1990) with a per-gene crossover probability of 0.5. Mutation is performed by partially-matched polynomial mutation Deb and Goyal (1996) with a distribution index of 7 and a per-gene mutation probability of where is the training set size. Finally, population replacement is achieved through fitness-proportionate elitism with an elitist ratio of 0.7. These parameters result from rigorous benchmarking, targeting maximal RMAE reduction, and typically result in convergence after 300 iterations.

Iii Results

iii.1 GDB8 Learning

Figure 2: Enthalpy of atomization learning curves using direct-learning (green), GA-optimized direct-learning (yellow), B3LYP-PM7 delta-learning (blue) and GA-optimized delta-learning (red). Dashed horizontal lines show training set sizes required to reach given accuracies. Left inset: scatter plot of B3LYP-PM7 reference energies of atomization and predicted values using both a randomly generated training set (blue), and the GA-optimized counterpart (red). Right inset: typical error distribution over B3LYP-PM7 reference atomization enthalpies for aforementioned random (blue) and GA-optimized (red) training sets.
P (P) N
10 50 100 500 1k 2k 3k
H 113.0 (31.6) 48.0 (18.3) 33.3 (14.3) 14.8 (7.5) 10.2 (5.8) 6.8 (4.5) 5.1 (3.9)
G 101.8 (28.8) 44.0 (17.7) 31.4 (14.1) 14.3 (7.5) 9.9 (5.6) 6.7 (4.3) 5.0 (3.9)
C 27.3 (14.5) 18.2 (9.4) 14.6 (7.8) 7.4 (4.0) 5.2 (2.9) 3.4 (2.3) 2.5 (2.0)
ZPVE 353.2 (83.2) 150.4 (38.4) 97.9 (28.0) 31.5 (14.0) 20.9 (10.5) 13.9 (7.0) 10.5 (3.5)
<> 168.5 (92.2) 117.0 (44.2) 85.6 (33.2) 35.7 (19.1) 25.7 (15.5) 18.3 (12.7) 14.5 (11.6)
11.3 (8.5) 10.3 (7.7) 9.9 (7.4) 8.4 (6.3) 7.5 (5.7) 6.2 (5.1) 5.2 (4.7)
40.8 (16.3) 23.1 (12.0) 18.5 (10.8) 11.8 (7.8) 9.6 (6.5) 7.2 (5.4) 5.8 (4.9)
13.0 (9.0) 11.2 (8.1) 10.4 (7.3) 7.7 (5.2) 6.3 (4.5) 4.9 (3.8) 4.0 (3.5)
22.3 (15.8) 18.8 (12.7) 17.0 (11.1) 11.9 (8.0) 9.7 (6.7) 7.4 (5.6) 5.9 (5.0)
gap 24.0 (17.8) 20.8 (15.0) 19.5 (13.5) 14.3 (9.8) 11.8 (8.1) 9.0 (6.8) 7.3 (6.2)
6.6 (5.0) 6.0 (4.4) 5.7 (4.1) 4.6 (3.2) 4.1 (2.6) 3.4 (2.1) 3.1 (1.9)
Table 1: Randomized and GA-optimized out-of-sample RMAEs for all properties. All target chemical accuracies are 1 kcal mol, except for ZPVE, dipole moment and isotropic polarizability, which have target accuracies of 10cm, 0.1 and 0.1 respectively. GA-optimized RMAEs are denoted by while randomly generated training set RMAEs are denoted as . Final row corresponds to out-of-sample RMAEs for enthalpy of atomization using -learning (aka ), and bracketed the GA optimized counterpart, referred to as and respectively.

All reported relative mean absolute errors (RMAEs) refer to out-of-sample prediction of molecules from a machine trained on in-sample molecules. Target accuracies (for which RMAE = 1) for thermochemical properties and orbital energies are 1 kcal mol. For ZPVE a target accuracy of 10cm was selected, comparable to the average accuracy of coupled cluster methods with converged basis sets Tew et al. (2007) for predicting harmonic vibrational frequencies of small molecules. For isotropic polarizability and norm of dipole moment, target accuracies of and D were used. Again, these values are within the predictive uncertainty using CC level of theory Hickey and Rowley (2014).

Figure 2 displays learning curves of randomly generated and GA-optimized training sets for out-of-sample predictions of B3LYP level enthalpies of atomization, , using both direct learning as well as the -ML model Ramakrishnan et al. (2015). Due to KRR based ML errors decaying as inverse powers of training set size, we present RMAEs as a function of training set size on a log-log scale. Upon GA optimization we note a substantial lowering of the learning curves for both properties and all training set sizes. The combination of GA optimization with -ML model is the most promising: For machines trained on 3k molecules the GA optimization reduces the RMAE from 3.1 to 1.9. Corresponding scatterplots and error distribution plots are on display as insets in figure 2. They also indicate that the GA-optimized -ML models approach the ideal model much faster than randomly sampled training sets. It is particularly encouraging to note that the relative gain in predictive power, obtained by using GA, appears to converge towards a constant rather than to vanish.

RMAEs for ML models with and without using GA are summarized in table 1 for various training set sizes and all the aforementioned properties. GA optimization of training set composition systematically improves RMAEs for all properties and training set sizes. Some properties experience greater improvement than others. The reduction in RMAE is most prevalent for smaller training set sizes, particularly of size =10. More specifically, properties related to chemical bonding, such as enthalpies and free energies of atomization as well as ZPVE improve by 75%. Other extensive properties, such as heat capacity or polarizability, improve by 50%. In contrast, intensive electronic properties experience much less improvement. Overall, the smallest RMAE reduction is found for the norm of the dipole moment. While percentage wise, seemingly small, the error reduction for is still very relevant due to its outstanding accuracy in absolute terms. This finding could possibly be also due to the comparatively more demanding target accuracies, the standard deviation of energetic properties in the data set is orders of magnitude larger than the standard deviation of electronic properties, such as the dipole moment. See figure 4 for comparison.

Overall, these results clearly amount to numerical evidence that the choice of how to select training set members can have a dramatic effect on the predictive power of the resulting ML model. It follows that substantially fewer training examples would be needed for the generation of ML models which reach the same accuracy as ML models trained on a much larger training set sampled at random. In order for this insight to be useful, however, one would also require access to the solution of the selection optimization process in an a priori fashion, and not a posteriori as it is the case in this study.

Figure 3: Molecular classes I-X for all properties listed in table 1. These classes correspond to the N=10 column, where each molecule and its respective training set is collectively presumed to optimally represent the entire space of validation molecules. Each training set is sorted by its index in the GDB8 database, thus chemical weight approximately increases from left to right.

iii.2 Analysis of GA-Optimized GDB8 Training Sets

The additional GA optimization layer can be seen as providing the underlying machine learning model with the capacity to intelligently select its own data for optimal out-of-sample prediction. This in turn naturally means that the machine, through the kernel, distance metric and CM representation, is finding optimal maps from molecular structure and composition to property with respect to the diverse chemical space of the validation database. As such, it is interesting to inspect the outcome of this optimization. While this is rapidly overwhelming for larger training sets, for =10 training sets the most essential chemistry encoded can easily be grasped. Figure 3 shows the optimized sets for all properties.

It is important to keep in mind that the molecules shown likely do not represent the global optimum, rather they represent one of many near-optimal combinations. We have obtained them by first resampling unique ensembles of =10 training sets many times, with a biased probability proportional to the distance from each molecule in the best GA-optimized training set for each property. All training sets sampled in this way produced significantly better RMAEs comparatively to random sampling. Thus, each GA-optimal molecule should be interpreted as being representative for very similar compounds which could be selected in optimized training sets, and are reported in figure 3. We therefore find it appropriate to relabel each index in the GA-optimal training sets as molecular classes I-X. Note though that, due to lack of an iterative optimisation procedure, ML models based on such biased training sets composed of these similar molecules yield slightly worse performance than ML models resulting from the underlying GA-optimized training sets.

From inspection it is obvious that the chemistry, represented by these classes, is very rich, including linear, planar, cage like, branched, and even strained structures. Variety in chemical composition is maintained through the occasional replacement of carbon units by functional groups containing oxygen or nitrogen. Furthermore, all hybridization states, , , and are represented. It can also be seen that many properties share very similar molecular classes, with some molecules being shared across optimal training sets for different properties. For example classes VII and VIII for ZPVE and , or classes VIII and IV for and , respectively. We note that for energetic properties in particular, very stable and unstrained saturated molecules are selected as well as rather exotic unsaturated and strained systems, suggesting a bias towards the extremes of the distribution. Additionally, these exotic molecules tend to appear more so for more complex properties, such as HOMO-LUMO gap or . However, we do not find it obvious to further rationalize the specific selection of chemistries.

Figure 4: Left: Principal moment of inertia density plots for the training database (grayscale) and 1000 GA-optimized N=10 training sets (colored) for enthalpy of atomization, H. Color corresponds to selection probability upon GA optimization. Right: training data cumulative density plots for the properties: H, , , and ZPVE. Normalised selection probabilities inferred from the ensemble of GA optimisations for =10 shown in black, with a corresponding 8th order polynomial fit shown in purple. Circle sizes and colors also correspond to selection probabilities.

iii.3 GDB8 Selection Pressures

While the visual interpretation of the chemical classes discussed previously is not obvious, we can use the statistics from the GA optimization runs to systematically identify the effect of the bias. More specifically, to understand selection pressures we analyzed the regions of chemical space which produce optimal training sets for a given training set size. To this end, we have run 1000 identical GA optimizations for training sets of size =10 to obtain sufficient statistics to infer selection probabilities. In figure 4 we plot these results in terms of the principal moment of inertia density of molecular structures (left) for enthalpies of atomization, , as well as in terms of the cumulative density plots (right) for properties , , and ZPVE. Additionally on each cumulative density plot, we overlay the inferred selection probabilities of all molecules within the training database. It can be seen from the principal moment of inertia density plot, that while the training database disproportionately contains linear and oblate/prolate structures, GA optimization does not necessarily drive training set composition towards regions of high density within the GDB8 database. Indeed, many of the molecules which are frequently selected during training set composition optimization are from medium to low density regions and appear to rather homogeneously cover the entire plane.

In contrast, from the cumulative property density plots it can be seen that GA optimization preferentially selects regions of low property density (tails), while regions of high density (linear) are less likely to be sampled. We have investigated if introducing a deliberate bias into the training set sample, through importance sampling of the property distribution alone (shown in figure 4) to specifically over-represent these regions, affords ML models with increased predictive power. Unfortunately, the introduction of such bias does not necessarily yield any improvement over random sampling (and indeed can even be worse). Thus, while there are systematic trends upon GA optimization, there do not seem to be simple biasing rules purely based on properties, and the effective change in sampling chemical space has to be taken into account. We illustrate the effect of the GA-optimization on training set distribution in figure 5: We plot the difference in distance distribution between an average of 1000 randomly generated training sets of various sizes and their GA-optimized counterparts for the same properties. It can be seen that there is a small yet systematic GA-induced outward shift, i.e. to predominantly sample training molecules which are further apart from each other across chemical space. As with the reduction in RMAE upon GA-optimization these changes are not identical across properties, and appear to slowly converge to a constant per property. In view of this, we note that we have additionally tried to optimize training sets through maximal spread and thus generate training sets which are more chemically diverse, however, this yielded no improvement over random sampling.

Figure 5: GA-optimization tends to increase molecular distances: Normalised distance distribution differences (GA-optimized training set distribution minus randomly selected counterpart) for properties (top left, green), (top right, purple), (bottom left, cyan) and ZPVE (bottom right, orange) as a function of training set size .

iii.4 Design and Application of =10 Training Sets for GDB9-8 ML Models

In order to investigate the transferability of the above insights gained upon the molecular classes we have studied the effect of designing an artificially constructed training set, derived from the classes discovered through the use of GA on the GDB8 database. More specifically, we define "projection" rules to sample the larger and independent GDB9-8 database for which no GA has ever been performed. For each of the GDB8 molecules used to train the ML model of , on display in figure 3, we individually replace each hydrogen with the heavy substituents -OH, NH and CH. Figure  6 illustrates the procedure for the projection of GDB8 molecule -V in figure 3 to GDB9-8. This procedure leads to hundreds of pseudo-GDB9-8 molecules, all having the same number of heavy elements as the other molecules present in GDB9-8, containing a total of 100k molecules. For each of these new, pseudo-GDB9-8 structures, we search the GDB9-8 training database (discussed in section II.3) for the closest set of molecules which represent each of the ten classes. With the resulting molecules we trained new ML models for which are applicable to GDB9-8.

Figure 6: Exemplification of applying the projection rule GDB8 GDB9-8 to the molecular class -V (lower left, see figure  3) for -OH substitution. The box contains the pseudo-GDB9-8 molecules obtained by substitution of each alkyl hydrogen (shown in orange). For each of these pseudo molecules the entire GDB9-8 database is queried and the closest molecule is selected to represent the projection of this molecular class into the GDB9-8 database (lower right).

While there are many pseudo-GDB9-8 structures found during the search which do exist (with some deviation in due to small geometrical differences), many of them do not have real GDB9-8 counterparts and in these cases a closest analogue is selected. When creating training sets for use in the GDB9-8 database, we do not weight these unfavourably and instead simply generate them with molecules uniformly sampled from each projected molecular class. On average, the resulting ML model affords a reduction of RMAE for out-of-sample predictions by more than 20% (decrease from 116.2 to 90.4) with respect to randomly generated training sets. Furthermore, upon use of the projected molecular classes, the error of the most extreme outlier in GDB9-8 is reduced from 275.4 to 116.1, while the error of the best training set reduces from 75.3 to 61.4. One should keep in mind that this still represents a substantial reduction: Due to the exponential scaling of chemical space with molecular size, GDB9-8 is roughly an order of magnitude larger than GDB8. As such, one should not expect that a simple projection from the 10 best molecules in GDB8 towards GDB9-8 results in the optimal set of 10 molecules out of GDB9-8, but rather in substantial sub-optimality.

Iv Conclusions

Here, genetic algorithms (GA) have been employed to optimize the composition of molecular training sets that can be used for the training of machine learning models of molecular properties. Application of GA to a training set of a given size improves the performance of the ML model substantially, when compared to a model trained on randomly selected molecules coming from the GDB subspace of organic molecules that follows a distribution that is comprehensive with respect to chemical intuition. Conversely, for achieving the same accuracy, dramatically less training examples are necessary—provided that the user has the possibility to bias the selection of training examples prior to training. Ensembles of GA optimization procedures have shown that, while there is evidence of systematic bias towards low density regions of property distributions, attempting to construct a beneficially biased training instance distribution for improved ML models is not obvious. However, the design of improved training sets containing structures identified through the use of simple molecular projection rules applied to GA-optimized training sets seems to be possible. We have exemplified this for 100k molecules in GDB9-8 using constructed training set molecules obtained by GA for 20k molecules in GDB8.

It is worth noting that all optimized training sets reported are unlikely to correspond to global optima, but instead are rather near-optimal. The collection of training molecules producing such optimized RMAEs is heavily dependent on the choice of GA parameters, but in particular upon population size, i.e the extent of training set sampling. Conversely the speed of optimization is directly linked to population and training set sizes, thus there is a trade-off between RMAE reduction and computation time. Nevertheless this error can be further reduced using larger population sizes, or more training data. Tightening GA convergence criteria may also yield some improvement. Furthermore, the distribution of optimal training instances depends strongly on the chosen fitting function, in our case the specific combination of Slater-type kernel function, Manhattan distance, and Coulomb-matrix. Modification of the model (through representation/regressor/hyperparameter-selection) will likely influence the details of the selection pressure of particular molecules for a given training set size and property. Any other ML model is likely to lead to another optimal distribution. However, the overall insight that substantial model improvements can be obtained through GA optimization should be general since the selection bias present in training sets is independent of model details. The exact nature of the relationship between ML model specifics and selection bias remains to be elucidated in future work.

Finally, we note that while we have investigated how to remove the bias in a training set of a certain size for a given ML model, the inherent bias in the entire data set has not been explored—it is an open question, however, if that is even possible, due to the practically infinite number of possible molecular compositions and geometries. Future work will show if our conclusions also apply to ML models trained on other subsets of chemical space, e.g. corresponding to ZINC-molecules or the “representative set of molecules”  Virshup et al. (2013).

V Acknowledgments

This research was supported by the NCCR MARVEL, funded by the Swiss National Science Foundation. OAvL acknowledges funding from the Swiss National Science Foundation (No. PP00P2_138932). Some calculations were performed at sciCORE ( Scientific Computing Core Facility at the University of Basel and at the Swiss National Supercomputing Centre (CSCS,


  • Cramer (2004) Cramer, C. J. Essentials of Computational Chemistry: Theories and Models (2nd Edition); John Wiley & Sons: West Sussex, England, 2004.
  • Rupp et al. (2012) Rupp, M.; Tkatchenko, A.; Müller, K.-R.; von Lilienfeld, O. A. Phys. Rev. Lett. 2012, 108, 058301.
  • Ramakrishnan et al. (2015) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. J. Chem. Theory Comput. 2015, 11, 2087–2096.
  • Montavon et al. (2013) Montavon, G.; Rupp, M.; Gobre, V.; Vazquez-Mayagoitia, A.; Hansen, K.; Tkatchenko, A.; Müller, K.-R.; von Lilienfeld, O. A. New J. Phys. 2013, 15, 095003.
  • von Lilienfeld (2013) von Lilienfeld, O. A. International Journal of Quantum Chemistry 2013, 113, 1676–1689.
  • Hansen et al. (2015) Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; von Lilienfeld, O. A.; Müller, K.-R.; Tkatchenko, A. J. Phys. Chem. Lett. 2015,
  • Ramakrishnan et al. (2015) Ramakrishnan, R.; Hartmann, M.; Tapavicza, E.; von Lilienfeld, O. A. J. Chem. Phys. 2015, 143, 084111.
  • Hansen et al. (2013) Hansen, K.; Montavon, G.; Biegler, F.; Fazli, S.; Rupp, M.; Scheffler, M.; von Lilienfeld, O. A.; Tkatchenko, A.; Müller, K.-R. Journal of Chemical Theory and Computation 2013, 9, 3404–3419.
  • von Lilienfeld et al. (2015) von Lilienfeld, O. A.; Ramakrishnan, R.; Rupp, M.; Knoll, A. International Journal of Quantum Chemistry 2015, 115, 1084–1093.
  • Hansen et al. (2015) Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; von Lilienfeld, O. A.; Müller, K.-R.; Tkatchenko, A. The Journal of Physical Chemistry Letters 2015, 6, 2326–2331.
  • Huang and von Lilienfeld (2016) Huang, B.; von Lilienfeld, O. A. The Journal of Chemical Physics 2016, 145.
  • Ramakrishnan et al. (2014) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. Scientific Data 2014, 1, 140022 EP –.
  • Ruddigkeit et al. (2012) Ruddigkeit, L.; Deurson, R. v.; Blum, L. C.; Reymond, J.-L. Journal of Chemical Information and Modelling 2012, 52.
  • Dral et al. (2015) Dral, P. O.; von Lilienfeld, O. A.; Thiel, W. Journal of Chemical Theory and Computation 2015, 11, 2120–2125, PMID: 26146493.
  • Hastie et al. (2001) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer Series in Statistics; Springer New York Inc.: New York, NY, USA, 2001.
  • Ramakrishnan and von Lilienfeld (2015) Ramakrishnan, R.; von Lilienfeld, O. A. Chimia 2015, 69, 182–186.
  • Turing (1995) Turing, A. M. In Computers & Thought; Feigenbaum, E. A., Feldman, J., Eds.; MIT Press: Cambridge, MA, USA, 1995; Chapter Computing Machinery and Intelligence, pp 11–35.
  • Barricelli (1954) Barricelli, N. Methodos 1954, 45–68.
  • Fraser (1957) Fraser, A. Aus. J. Bio. Sci. 1957, 484–491.
  • Godefroid and Khurshid (2002) Godefroid, P.; Khurshid, S. In Tools and Algorithms for the Construction and Analysis of Systems; Katoen, J.-P., Stevens, P., Eds.; Lecture Notes in Computer Science; Springer Berlin Heidelberg, 2002; Vol. 2280; pp 266–280.
  • Sastry and Goldberg (2001) Sastry, K.; Goldberg, D. E. Modeling Tournament Selection With Replacement Using Apparent Added Noise; 2001.
  • Spears and Anand (1991) Spears, W.; Anand, V. In Methodologies for Intelligent Systems; Ras, Z., Zemankova, M., Eds.; Lecture Notes in Computer Science; Springer Berlin Heidelberg, 1991; Vol. 542; pp 409–418.
  • Spears and Dejong (1991) Spears, W. M.; Dejong, K. A. Foundations of Genetic Algorithms 1991,
  • Eshelman (1990) Eshelman, L. J. The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. Foundations of Genetic Algorithms. 1990; pp 265–283.
  • Deb and Goyal (1996) Deb, K.; Goyal, M. Computer Science and Informatics 1996, 26, 30–45.
  • Tew et al. (2007) Tew, D. P.; ; Klopper, W.; Heckert, M.; ; Gauss, J. The Journal of Physical Chemistry A 2007, 111, 11242–11248.
  • Hickey and Rowley (2014) Hickey, A. L.; Rowley, C. N. The Journal of Physical Chemistry A 2014, 118, 3678–3687.
  • Virshup et al. (2013) Virshup, A. M.; Contreras-García, J.; Wipf, P.; Yang, W.; Beratan, D. N. Journal of the American Chemical Society 2013, 135, 7296–7303.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description