Structure-based Sampling and Self-correcting Machine Learning for Accurate Calculations of Potential Energy Surfaces and Vibrational Levels

Structure-based Sampling and Self-correcting Machine Learning for Accurate Calculations of Potential Energy Surfaces and Vibrational Levels

Pavlo O. Dral Max-Planck-Institut für Kohlenforschung, Kaiser-Wilhelm-Platz 1, 45470 Mülheim an der Ruhr, Germany    Alec Owens Department of Physics and Astronomy, University College London, Gower Street, WC1E 6BT London, United Kingdom Max-Planck-Institut für Kohlenforschung, Kaiser-Wilhelm-Platz 1, 45470 Mülheim an der Ruhr, Germany    Sergei N. Yurchenko Department of Physics and Astronomy, University College London, Gower Street, WC1E 6BT London, United Kingdom    Walter Thiel Max-Planck-Institut für Kohlenforschung, Kaiser-Wilhelm-Platz 1, 45470 Mülheim an der Ruhr, Germany
July 26, 2019

We present an efficient approach for generating highly accurate molecular potential energy surfaces (PESs) using self-correcting, kernel ridge regression (KRR) based machine learning (ML). We introduce structure-based sampling to automatically assign nuclear configurations from a pre-defined grid to the training and prediction sets, respectively. Accurate high-level ab initio energies are required only for the points in the training set, while the energies for the remaining points are provided by the ML model with negligible computational cost. The proposed sampling procedure is shown to be superior to random sampling and also eliminates the need for training several ML models. Self-correcting machine learning has been implemented such that each additional layer corrects errors from the previous layer. The performance of our approach is demonstrated in a case study on a published high-level ab initio PES of methyl chloride with 44,819 points. The ML model is trained on sets of different size and then used to predict the energies for tens of thousands of nuclear configurations within seconds. The resulting datasets are utilized in variational calculations of the vibrational energy levels of CHCl. By using both structure-based sampling and self-correction, the size of the training set can be kept small (e.g. 10% of the points) without any significant loss of accuracy. In ab initio rovibrational spectroscopy, it is thus possible to reduce the number of computationally costly electronic structure calculations through structure-based sampling and self-correcting KKR-based machine learning by up to 90%.


I Introduction

Molecular potential energy surfaces (PESs) are often computed using high-level ab initio theory, notably in theoretical ro-vibrational spectroscopy where sub-wavenumber accuracy is targeted for calculated transition frequencies. Such sophisticated quantum mechanical (QM) calculations can be extremely costly, and they have to be repeated for tens of thousands of nuclear geometries. Particularly challenging are larger systems with many degrees of freedom that suffer from the so-called “curse of dimensionality”. Machine learning (ML) techniques offer a means to significantly reduce the necessary computational time Behler (2016).

It has been shown that neural networks (NNs) can be used to interpolate between points along the coordinate of simple reactions Gastegger and Marquetand (2015); Kolb et al. (2016), to improve the accuracy of semiempirical QM/MM calculationsShen, Wu, and Yang (2016), and to calculate accurate energies of conformations and clusters of organic and inorganic species Bholoa, Kenny, and Smith (2007); Malshe et al. (2009); Artrith and Behler (2012); Schütt et al. (2017); Smith, Isayev, and Roitberg (2017). Kernel ridge regression (KRR) can reduce errors of low-level QM methods such that atomization enthalpies of organic isomers can be reproduced with an accuracy close to higher-level QM methods Dral, von Lilienfeld, and Thiel (2015); Ramakrishnan et al. (2015a). Gaussian process models are able to sample low-energy regions of the PES Toyoura et al. (2016), correct density functional theory energies of water Bartók et al. (2013), predict energies and forces in Si clusters Bartók, Kondor, and Csányi (2013), and construct global PESs for N Cui and Krems (2016). ML can also be employed for accurate yet computationally fast PES representation; for example, NNs have been used to compute molecular vibrational energy levels Manzhos and Carrington (2006); Manzhos et al. (2006); Manzhos, Dawes, and Carrington (2014); Majumder et al. (2015); Shao et al. (2016), reaction probabilities Shao et al. (2016), and reaction rates Blank et al. (1995). Whilst potentials based on KRR Chmiela et al. (2016), NNs Hobday, Smith, and Belbruno (1999); Hobday, Smith, and BelBruno (1999); Raff et al. (2005); Behler and Parrinello (2007); Behler et al. (2008); Pukrittayakamee et al. (2009); Eshet et al. (2010); Khaliullin et al. (2010); Behler (2011); Khaliullin et al. (2011); Morawietz, Sharma, and Behler (2012); Behler (2015) or Gaussian process models Bartók et al. (2010) have been utilized in molecular dynamics (MD) simulations.

ML performs best in the interpolation regime and can fail spectacularly when used for extrapolation. For relatively large training sets and high error tolerance, as in many chemical applications, training on a random sample of data points can give useful results Rupp et al. (2012); Hansen et al. (2013); Dral, von Lilienfeld, and Thiel (2015); Ramakrishnan et al. (2015a); Cui and Krems (2016). Random sampling is straightforward and has minimal computational cost involved but it does not ensure per se that ML avoids extrapolation. Thus, it is preferable to ensure that ML is applied only for interpolation by using better sampling techniques. Even a simple manual choice of critical points on the PES can improve the accuracy of ML significantly Gastegger and Marquetand (2015).

Automated sampling is necessary for more complex PESs and several approaches have been reported in the literature. When training ML potentials, structures have been screened on their input vector elements to see if they are within the minimum and maximum values of the training set, which can then be expanded to include ‘outside-the-range’ structures Behler (2011); Botu and Ramprasad (2015). NN PESs of simple reactions have been re-fitted with points when the initial NN potential has encountered problems Kolb et al. (2016). Similarly, the error of the ML potential can be monitored during dynamics and ML can be re-trained on-the-fly by including additional points when errors become too large Li, Kermode, and Vita (2015). In the case of very simple systems like \ceH3, the training set can be generated on a coarse grid Kolb et al. (2016).

Because of their inherent flexibility different NNs can give very different energies for the same structure, and hence training different NNs on the same data can also highlight problem structures that should be incorporated into the training set Morawietz, Sharma, and Behler (2012); Artrith and Behler (2012). Another approach exploits the fact that the most important regions of the PES are likely to be covered by standard MD simulations. Snapshots from the MD trajectories can thus be used in the training set to produce a more robust ML PES Raff et al. (2005); Pukrittayakamee et al. (2009); Geiger and Dellago (2013); Suzuki, Tamura, and Miyazaki (2016); Schütt et al. (2017); Chmiela et al. (2016). For better sampling of vibrational phase space, previous studies have employed stretched bond distances in MD simulations Pukrittayakamee et al. (2009) or a random displacement to sample along the vibrational normal modes Smith, Isayev, and Roitberg (2017). In other work, structures have been chosen from MD trajectories based on various clustering approaches, which reduce the redundancies in the NN training set Häse et al. (2016). For calculations of rotation-vibration spectra where low-energy regions of the PES are more important, probability distribution functions that favor structures near equilibrium have been used for pseudo-random sampling Manzhos and Carrington (2006); Manzhos et al. (2006); Manzhos, Dawes, and Carrington (2014).

Here we report a new and efficient method for generating highly accurate molecular PESs which uses KRR-based machine learning. In our approach, we initially define a set of grid points (nuclear configurations) that covers the relevant low-energy part of the PES, which can be done using relatively inexpensive energy estimates (see Section II.3). Thereafter, our sampling relies on an interpolation between structures rather than on energies, i.e. the sampling procedure itself does not require energy calculations or the training of several ML models. We investigate the effect of structure-based sampling on the accuracy of the ML model in comparison with random sampling. We also explore what error reduction can be achieved through ML self-correction.

To illustrate our KRR-based ML model we utilize a high-level ab initio PES for methyl chloride (\ceCH3^35Cl) Owens et al. (2015). This PES, denoted CBS-35, is based on extensive explicitly correlated coupled cluster calculations with extrapolation to the complete basis set (CBS) limit and incorporates a range of higher-level additive energy corrections. These include: core-valence electron correlation, higher-order coupled cluster terms, scalar relativistic effects, and diagonal Born-Oppenheimer corrections. A considerable amount of computational time was spent generating this PES. For example, a single point of the CBS-35 PES at the equilibrium geometry required 9 separate calculations and a combined total of 26.7 hours on a single core of an Intel Xeon E5-2690 v2 GHz processor. Building a reliable PES requires tens of thousands of nuclear geometries for polyatomic species and the computational time increases for distorted configurations due to slower energy convergence. Efficient ML techniques that can reduce the necessary computational effort are therefore highly desirable.

The paper is structured as follows: In Section II we present the KRR-based self-correcting ML model, the molecular descriptor, and the sampling procedure. Details on the variational calculation of vibrational energy levels are also given. In Section III the ML model is evaluated for different sizes of training set, and vibrational energies are computed using several ML model PESs. Concluding remarks are offered in Section IV and an outlook is provided in Section V.

Ii Methods

ii.1 Machine Learning

The chosen ML technique is based on KRR Hastie, Tibshirani, and Friedman (2009) and is similar to the approach used for ML across chemical compound space Rupp et al. (2012); von Lilienfeld (2013); Hansen et al. (2013); Ramakrishnan et al. (2015a, b) and for calculating relative stabilities of organic isomers Ramakrishnan et al. (2015a); Dral, von Lilienfeld, and Thiel (2015). Below we give details of the KRR approach relevant to this work. All ML calculations were performed using our own program package MLatom Dral (2017).

Some property value of a nuclear configuration represented by the molecular descriptor (discussed in Section II.2) is calculated using the expression Hastie, Tibshirani, and Friedman (2009),


where the sum runs over configurations represented by the molecular descriptors of the training set. The regression coefficient refers to the configuration , and is the kernel. Here we use the Gaussian kernel which is equivalent to the form often employed in the literature Rupp et al. (2012); Hansen et al. (2013),


where is the kernel width, and is the Euclidean distance (the norm) between descriptors and (vectors of size ) defined as


This definition of distances between molecular structures Rupp et al. (2012); von Lilienfeld (2013); Hansen et al. (2013) and a similar definition based on the norm (instead of the norm used here) Ramakrishnan et al. (2014); Hansen et al. (2013); Dral, von Lilienfeld, and Thiel (2015); Ramakrishnan et al. (2015a); Bereau, Andrienko, and von Lilienfeld (2015); Rupp, Ramakrishnan, and von Lilienfeld (2015); Ramakrishnan et al. (2015b); Faber et al. (2016) were used previously to compare molecular geometries Ramakrishnan et al. (2014) and learn various molecular properties Rupp et al. (2012); von Lilienfeld (2013); Hansen et al. (2013); Dral, von Lilienfeld, and Thiel (2015); Ramakrishnan et al. (2015a); Bereau, Andrienko, and von Lilienfeld (2015); Rupp, Ramakrishnan, and von Lilienfeld (2015); Ramakrishnan et al. (2015b); Faber et al. (2016). We have also tested the Laplacian kernel employing the norm, but the PESs obtained with this kernel showed much larger errors than those obtained with the Gaussian kernel.

Training the ML model involves finding the regression coefficients , i.e. solving the minimization problem,


which has a known analytical solution Hastie, Tibshirani, and Friedman (2009)


where is the identity matrix, is the kernel matrix with the elements calculated using Eq. 2 for all pairs of training set points, is a vector with reference property values, and is the so-called regularization parameter that ensures the transferability of the model to new nuclear configurations Rupp et al. (2012); von Lilienfeld (2013).

Two additional improvements were made to the approach outlined above which were not considered in earlier work: Rupp et al. (2012); von Lilienfeld (2013); Hansen et al. (2013); Ramakrishnan et al. (2015a); Dral, von Lilienfeld, and Thiel (2015) (i) we sample data to ensure that ML interpolates between the training points and does not extrapolate beyond them (see Section II.3), and (ii) we use a nested, self-correcting ML approach. For the latter, ML is first trained on the reference ab initio energies ( in Eq. (5)) and then used to make a first-layer estimate of the deformation energy ( in Eq. (1)). The next layer corrects errors of the energies estimated in the previous layer, and so on. For example, the second layer ML model is trained on and is then used to calculate corrections for the prediction set which are summed up with layer 1 predicted energies to obtain layer 2 energies, i.e. . The dependence of the performance of the ML model on the number of layers is discussed in Section III.1.

In order to determine the optimal values of the hyperparameters and for each layer, we sample 80% of the training set points into a sub-training set using the same sampling procedure that was employed for the training set (see Section II.3). The points with deformation energies less than 10,000 cm are taken from the remaining 20% into the validation set. Using the ML model trained on the sub-training set, we search for values of the hyperparameters which give the lowest root-mean-square error (RMSE) for the validation set. This is performed using a simple logarithmic grid search Hansen et al. (2013); Rupp (2015). By optimizing the hyperparameters such as to obtain a better description below 10,000 cm we ensure an adequate treatment of the spectroscopically most relevant part of the PES.111On only one occasion (one of the ML models trained on 50% randomly drawn grid points) the RMSE of the training set in the first-layer became unreasonably high due to overfitting caused by very small value of the parameter . In this case we set to for the first layer and re-trained the ML model.

The above self-correcting scheme is similar in spirit to a two-stage NN model that has been introduced to remedy peculiarities of NNs Manzhos et al. (2006); Manzhos, Dawes, and Carrington (2014), such as the presence of ill-defined regions (“holes”) in NN PESs as a result of overfitting. In this previous work, the NN potential was first fit with as few nodes as possible to eliminate holes, which resulted in a large RMSE; to obtain a PES of reasonable quality, the NN potential was then re-fit by incorporating more nodes to eliminate residual errors.

In our study, overfitting is primarily prevented by using the regularization parameter and the problem of holes on the PES is addressed by our sampling procedure (discussed in Section II.3). The KRR approach makes fitting straightforward as there is no need to manually adjust any of the parameters. As a result, the RMSE of our one-layer ML model is already quite low, while multi-layer ML models possess still slightly smaller RMSEs (see Section III.1). This is in contrast to two-stage NN models where the RMSE of the first fit is one order of magnitude larger than that of the second fit Manzhos et al. (2006).

ii.2 Molecular descriptor

The success of ML is largely determined by the choice of an appropriate molecular descriptor. Many of the molecular descriptors proposed in literature are functions of the internuclear distances, for example the Coulomb matrixRupp et al. (2012); Hansen et al. (2013), the Bag of BondsHansen et al. (2015), the atom-centered symmetry functionsBehler and Parrinello (2007); Behler (2016), the bispectrum of the neighbor densityBartók et al. (2010), the smooth overlap of atomic positionsBartók, Kondor, and Csányi (2013), and othersMajumder et al. (2015). We designed and tested many descriptors for \ceCH_3Cl but overall the most accurate was a vector with ten elements corresponding to the ten interatomic pairs. Each element is defined as the corresponding internuclear distance in the near-equilibrium reference geometry of \ceCH_3Cl () divided by the current distance (), e.g. for the \ceC-Cl atomic pair. This form of descriptor ensures that the ML model is rotationally and translationally invariant.

Since the molecular descriptor also has to be atom index invariant, we sort the hydrogen nuclei by the sum of their internuclear repulsions with the four other nuclei for structure-based sampling (see Section II.3). Simple sorting of hydrogen nuclei in the molecular descriptor may however lead to instabilities in regions where the order of hydrogen nuclei changes. To avoid this problem we employ a normalized permutational invariant kernel in our ML calculations (Section II.1) as suggested in the literature:Bartók et al. (2013); Bartók and Csányi (2015)


where permutes the order of hydrogen nuclei. There are permutations of three hydrogen nuclei. We found that results obtained using the normalized permutational invariant kernel are superior to those obtained using a sorted molecular descriptor.

ii.3 Grid points and sampling

The target PES needs to be evaluated on a large number of pre-defined grid points (nuclear configurations). These grid points can be determined through rather inexpensive initial energy computations. In our previous work on \ceCH_3Cl, this was done by creating a sparse grid along each one-dimensional cut of the PES, calculating single-point energies using a reliable but relatively cheap ab initio method (e.g. CCSD(T)-F12b/VTZ-F12), fitting a polynomial function to get rough energy estimates, and randomly selecting all remaining points using an energy-weighted Monte Carlo type sampling algorithm to cover the desired low-energy PES regionOwens et al. (2015). This procedure, which was inexpensive computationally, produced a grid of 44,819 geometries with energies up to  50,000 cm ( is the Planck constant and is the speed of light). In our present proof-of-principle study we use this grid, which was employed for the final CBS-35 PES of \ceCH_3ClOwens et al. (2015), and partition it into training and test sets using the structure-based sampling procedure described below; the energies for the training set are taken from the available ab initio resultsOwens et al. (2015) while those for the remaining grid points are predicted essentially for free using ML.

In our sampling procedure we select nuclear configurations for the training set based on the relative distance of the molecular descriptors. This is done such that all remaining points used for prediction are within the boundaries of the training set or very close to it. The first point of the training set is taken closest to equilibrium. The second point of the training set is the one among all grid points that has the largest distance from the first point. For each remaining point on the grid we calculate its distance to all points in the training set using Eq. (3) and determine the shortest distance to any point in the training set. We then include the grid point that has the ‘longest’ shortest distance into the training set, as illustrated in Figure 1. This procedure is repeated until the required number of points are selected for the training set. The other remaining points are used for prediction, and by construction they lie within the training set or very close to it; at least one of their distances to the points in the training set should be shorter than the shortest distance between points in the training set.

Figure 1: Illustration of sampling 3 points into the training set out of 6 points in two-dimensional space. Point 1 is closest to the equilibrium, while point 2 is farthest apart from point 1, so they are both included into the training set. Point 3 has the shortest distance to point 1 of the current training set, which is longer than any shortest distance of points 4, 5, and 6 to the points in the current training set (1 and 2). Thus, point 3 is included into the training set (red), while the remaining points 4–6 are left for the prediction set (blue).

This sampling procedure is closely related to the farthest-point traversal iterative procedure used to select points such that they are as distant as possible from the previously selected points. In this respect structure-based sampling can be also viewed as a way to obtain a training set as diverse as possible.

The sampling procedure outlined above selects a training set of predetermined size from a larger set of predefined structures. The same sampling principles can be applied to test whether additional structures (beyond the initially chosen set) should be included into the training set. If this is the case, the ML model needs to be re-trained (similar to an approach described in Ref. 31).

ii.4 Variational calculations

In this work we use the nuclear motion program TROVE Yurchenko, Thiel, and Jensen (2007); Yachmenev and Yurchenko (2015) for computing vibrational energy levels. Since ro-vibrational calculations have previously been reported for CHCl Owens et al. (2015, 2016), we summarize only the key aspects relevant for the present study.

In variational calculations the PES must be represented analytically. To do this we introduce the coordinates,


where for the C–Cl bond length , and for the three C–H bond lengths and . For the angular terms,


where , , are the interbond angles, and , , are the dihedral angles between adjacent planes containing HCCl and HCCl. Here , and are the reference equilibrium structural parameters of CHCl.

The potential function (maximum expansion order ) is given by the expression,


and contains the terms,


which are symmetrized combinations of different permutations of the vibrational coordinates , and transform according to the representation of the molecular symmetry group Bunker and Jensen (1998).

The expansion coefficients are determined through a least-squares fitting to the ab initio and/or ML data. Weight factors of the form Partridge and Schwenke (1997)


are used in the fittings. Here and the normalization constant (all values in cm). Larger weights () are assigned to lower deformation energies (), which correspond to more spectroscopically important regions of the PES. As shown in Figure 2, this form ensures that structures with energies up to 10,000 cm above equilibrium have weights near unity, whilst other configurations are significantly downweighted with increasing energy.

Figure 2: Decay of unitless weight factors calculated using Eq. (14) with increasing deformation energy in cm.

To ensure a reliable comparison the CBS-35 PES expansion parameter set was used to fit the ab initio and ML-generated datasets. This contained 414 parameters and included linear expansion terms. For this reason we fixed the values of , , and , to be the same as those used for the CBS-35 PES. Each fit employed Watson’s robust fitting scheme Watson (2003), which reduces the weights of outliers and lessens their influence in determining the final set of parameters.

Iii Results and discussion

In this section we investigate the accuracy of the ML model s trained on 50%, 25%, and 10% of the available 44,819 deformed structures and their ab initio energies used to construct the CBS-35 PES Owens et al. (2015).

To evaluate the accuracy of predicted energies () we employ standard error measures as well as the weighted root-mean-square error (wRMSE):


Weights are calculated using Eq. 14.

iii.1 Optimal machine learning model

We first examine how the accuracy of ML calculated deformation energies for the prediction set depends on the number of ML layers and the sampling procedure. The results in Table 1 indicate that the errors are significantly lowered by adding the second layer for all models except for the one trained on 10% points selected with structure-based sampling. Including a third layer significantly reduces the errors in a few cases, while a fourth layer does not yield any noticeable improvements. We therefore expect that adding further layers will make no difference.

Number of layers
1 2 3 4
Structure-based sampling from unsliced data
22,409 (50%) 22 3,985 6.57 0.37 0.37 0.37
11,204 (25%) 1 1,033 3.21 1.87 1.60 1.60
4,481 (10%) 1 195 5.00 4.99 4.83 4.83
Structure-based sampling from data sliced into three regions
22,408 (50%) 131 7,215 2.31 0.62 0.62 0.62
11,203 (25%) 19 3,348 2.90 2.59 2.58 2.58
4,480 (10%) 1 1,191 4.42 3.63 3.63 3.63
Random sampling
22,409 (50%)
11,204 (25%)
4,481 (10%)
  for the entire grid of 44,819 points.
  for the entire grid of 44,819 points.
 Standard deviations were calculated for 16, 20, and 30 various randomly drawn
training sets for ML trained on 50%, 25%, and 10% grid points, respectively.
Table 1: (Continued)

For practical purposes the three-layer model appears to be sufficient. However, in the following we have applied the four-layer model because the computational cost of including additional layers is rather low and we have not observed any significant accumulation of numerical noise. It takes around 4 hours to optimize the ML hyperparameters using a fairly inefficient optimization algorithm, around a minute to train and only a couple of seconds to make predictions with the 50%-ML model on 20 cores of an Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz processor.

As shown in Table 1, random sampling is clearly inferior to structure-based sampling; all four-layer ML models trained on randomly drawn points have wRMSEs significantly higher than those of the four-layer, structure-based sampling ML models. Interestingly, self-correction works well even for random sampling: it reduces the wRMSEs by 32–44% and the standard deviations by 70–90%. Despite this the remaining error of  cm for the randomly sampled, four-layer ML models trained on 50% of grid points is still much higher than the wRMSE of 0.37 cm for the four-layer, structure-based sampling ML model trained on the same number of grid points. Standard deviations for random sampling are also relatively high and increase from 0.87 to 1.67 cm when going from the 50% to 10% training sets. There is no such problem with structure-based sampling which provides a unique training set — an important practical advantage for high-accuracy applications. As for the computational cost of sampling, it takes ca. 9 hours to sample 50% from 44,819 data points on 12 Intel(R) Xeon(R) CPU X5680 @ 3.33GHz processors.

As the training set becomes relatively small, structure-based sampling may lead to an under-representation of the training points in the low-energy region, e.g. 10% training points drawn from the entire grid contain only 195 structures with deformation energies below 10,000 cm (compare the number of configurations with deformation energies below 1,000 and 10,000 cm, columns and in Table 1). Sorting the geometries by their distance to the equilibrium structure (which correlates strongly with the deformation energies, Figure 3), followed by slicing the data into several regions and sampling points from each of these regions leads to a more energy-balanced training set. Looking at Figure 3, one can argue that splitting the set into three regions with equal number of structures should be close to optimal. The first region includes the most important structures with deformation energies below 10,000 cm and a significant portion of structures with energies between 10,000 and 20,000 cm. The second region mainly includes configurations with energies between 10,000 and 20,000 cm but also a considerable number of geometries above and below this region. The third slice includes all remaining high-energy structures. Structure-based sampling from each of the above regions leads to close to those expected from random sampling, e.g. 10% training points drawn from the sliced grid contain 1,191 structures with deformation energies below 10,000 cm (Table 1). As a result, wRMSE of the ML model trained on the latter training set (3.63 cm) is lower than wRMSE of the ML model trained on 10% points drawn from unsliced data (4.83 cm). However, such slicing does not generate training sets that are as diverse as possible, and therefore the errors of the ML models become higher for the sliced grids as the training set increases in size (training sets with 25% and 50% grid points, Table 1). Thus we recommend slicing only for very small training sets where low-energy structures are under-represented.

Figure 3: Correlation between ab initio deformation energies in cm and unitless distances to the near-equilibrium structure calculated using Eq. (3). A linear trend line is shown in red with its value (0.83). Orange vertical lines slice the data into three regions with equal numbers of data points in the training set. Each data point is represented by a blue dot with a black edge, hence the most densely populated areas are black.

In the following we will discuss four-layer ML models trained on 50%, 25%, and 10% points drawn using structure-based sampling from the available unsliced 44,819 grid points. We refer to these ML models as 50%-ML, 25%-ML, and 10%-ML, respectively. We also compare with one of the randomly sampled, four-layer ML models referred to as r50%-ML which has been chosen at random from the ML models trained on 50% of grid points. In addition, we compare with the four-layer ML model trained on 10% points drawn using structure-based sampling from each region of the dataset sliced into three regions. This model is referred to as s10%-ML in the following.

A more detailed analysis of the ML model errors, listed in Table 2, reveals that r50%-ML has the largest outliers with a wRMSE of 4.14 cm, which is more than twice as large as that of 25%-ML (wRMSE of 1.60 cm) for their respective prediction sets. The non-weighted RMSE of r50%-ML ( 167.19 cm) is more than four times higher than the non-weighted RMSE of 10%-ML ( 39.63 cm). Moreover, the RMSE of r50%-ML for energies below 1,000  cm is practically the same as the respective RMSE of 50%-ML and for energies below 10,000 cm is higher than the respective RMSE of 50%-ML, despite the fact that many more points from these regions are included into the training set (compare and for these two models in Table 1). These observations provide strong evidence for the superiority of structure-based sampling. s10%-ML has RMSEs for energies below 1,000 and 10,000 cm close to those of 25%-ML, while the respective RMSEs of 10%-ML are much higher. On the other hand, 10%-ML has a much lower non-weighted RMSE, MAE and MSE than s10%-ML. Thus, slicing clearly improves the description of the low-energy region at the cost of other regions. Apparently, for sparse training data, the benefits of a better description of the low-energy region achieved by slicing outweigh the disadvantage of an overall worse description of the PES, which is exemplified by the lower wRMSE of s10%-ML compared to the wRMSE of 10%-ML.

50%-ML r50%-ML 25%-ML 10%-ML s10%-ML
Training set
22,409 22,409 11,204 4,481 4,480
LPO 0.00 4.18 0.00 0.00 0.00
LNO 0.00 0.00 0.00 0.00
MSE 0.00 0.00 0.00 0.00 0.00
MAE 0.00 0.03 0.00 0.00 0.00
RMSE (all) 0.00 0.11 0.00 0.00 0.00
RMSE (10,000) 0.00 0.03 0.00 0.00 0.00
RMSE (1,000) 0.00 0.07 0.00 0.00 0.00
wRMSE 0.00 0.03 0.00 0.00 0.00
Prediction set
22,410 22,410 33,615 40,338 40,339
LPO 319.75 2,015.44 1,035.63 1,617.61 1,481.69
MSE 0.02 0.18 0.19 1.85
MAE 0.82 25.12 3.47 11.27 19.54
RMSE (all) 6.23 167.19 16.12 39.63 61.96
RMSE (10,000) 0.20 1.38 1.12 4.26 1.19
RMSE (1,000) 0.08 0.07 0.16 1.40 0.25
wRMSE 0.37 4.14 1.60 4.83 3.63
Entire grid
LPO 319.75 2,015.44 1,035.63 1,617.61 1,481.69
MSE 0.01 0.14 0.17 1.66
MAE 0.41 12.57 2.60 10.14 17.58
RMSE (all) 4.41 118.22 13.96 37.60 58.79
RMSE (10,000) 0.17 0.98 1.09 4.24 1.14
RMSE (1,000) 0.08 0.07 0.16 1.40 0.25
wRMSE 0.26 2.93 1.39 4.59 3.44
Table 2: (Continued)

From Figure 4 we see that deformation energies predicted by 50%-ML, r50%-ML, 25%-ML, and s10%-ML models correlate nicely with the reference ab initio energies; the value is always larger than 0.999. Deformation energies predicted by 10%-ML model (not shown in Figure 4) correlate better () with the reference energies than the energies predicted by the s10%-ML model (consistent with the above conclusions). Clearly, 50%-ML is superior to all other models and has the best correlation with far fewer outliers and smaller residual errors. This is particularly relevant for high-accuracy work as we will see in Section III.2.

Figure 4: Correlation between reference ab initio deformation energies and deformation energies predicted by 50%-ML, r50%-ML, 25%-ML, and s10%-ML for their respective prediction sets. Linear trend lines are shown in red with their value. Each data point is represented by a blue dot with a black edge, hence the most densely populated areas are black.

As for the effect of training set size, illustrated in Figure 5, wRMSEs in the prediction set drop from 144 cm for the 1%-ML model to 0.05–0.06 cm for the 85–99%-ML models. Interestingly, the 1%-ML model trained on only 448 grid points may still be regarded as a chemically meaningful representation of the PES since its non-weighted RMSE is only 0.77 kcal/mol ( 271 cm). The error drops very quickly to 12.29 cm for the 5%-ML model and is below 1.00 cm for 35% and above, finally becoming smaller than 0.5 cm for training sets with 50% or more of all configurations.

Regarding the four structure-based sampling ML models that were tested extensively, wRMSEs grow significantly from 50%-ML ( 0.37 cm) to 25%-ML ( 1.60 cm) to s10%-ML (3.63 cm) to 10%-ML ( 4.83 cm) . We investigate further the effect of the training set size for high-resolution spectroscopy applications in Section III.2, where we report vibrational energies using PESs based on the 50%-ML, r50%-ML, 25%-ML, 10%-ML, and s10%-ML models.

Figure 5: Dependence of wRMSE in the prediction set (in cm) of the four-layer, structure-based sampling ML models as a function of the training set size (in %). In all cases sampling was done from unsliced data. The plot starts with a 1% training set size and ends at 99%. The plot in the inset starts with a 10% training set size.

The analytic representation employed for the CBS-35 (discussed in Sections II.4 and III.2) was fitted with a wRMSE of 3.00 cm for the entire grid (see Section III.2). For 50%-ML the training set, prediction set, and entire grid of 44,819 points were reproduced with wRMSEs of 0.00, 0.37 and 0.26 cm, respectively (Table 2). Our ML approach, which has a defined analytic form, could therefore provide a more accurate description of the PES based on fewer training points and could possibly be employed directly in variational calculations. However, this is beyond the scope of this work and application of our ML technique in variational calculations will be the focus of future research.

As indicated by one reviewer, further fine-tuning of the ML models is possible by using the anisotropic Gaussian kernel with multiple kernel widths instead of a single parameter (as for instance in Ref. 13). In test calculations we have found that such ML models may reduce errors somewhat (e.g. by 10% for 10%-ML), but there is a substantial increase in the complexity of the ML model and the computational cost for large training sets (due to parameter optimization), and hence we decided to use one single kernel width parameter in this study.

iii.2 Vibrational energy levels

For the 50%-ML, r50%-ML, 25%-ML, 10%-ML, and s10%-ML PESs, the expansion coefficients of the potential function given in Eq. (12) were determined in a least-squares fitting to the 44,819 grid points. The results of the fittings are listed in Table 3. In addition, PESs were determined for the five, associated training sets and the results are also included in Table 2). We see that the fits of the CBS-35 PES, the ML model PESs, and the training set PESs are of a similar accuracy, with the exception of the 10%-ML fits which exhibit significantly larger errors. For the other fits, the RMSEs below 10,000 cm range between 1.17 and 2.08 cm, with wRMSE values between 3.01 and 3.59 cm for the ML-based PESs and up to 3.60 cm for the training set PESs. The mean errors are particularly low for the 50%-ML PES, its training set PES, and 25%-ML PES (0.16, 0.24, and 0.02 cm, respectively).

CBS-35 50%-ML r50%-ML 25%-ML 10%-ML s10%-ML
ML model PESs
No. of parameters 414 414 414 412 402 409
fitting wRMSE 0.82 0.83 0.89 0.99 1.39 1.13
LPO 2,717.33 2,718.84 2,546.10 2,700.54 2,506.24 2,868.59
MSE 0.20 0.16 0.56 0.02 0.81 7.38
MAE 20.82 20.83 21.01 21.83 31.73 24.63
RMSE (all) 102.24 102.28 102.13 102.69 113.78 101.39
RMSE (10,000) 1.18 1.18 1.22 1.22 3.66 1.37
RMSE (1,000) 0.33 0.33 0.33 0.34 1.08 0.40
wRMSE 3.00 3.01 3.09 3.19 5.40 3.59
Training set PESs
No. of parameters 414 414 414 410 411
fitting wRMSE 0.98 0.82 0.94 0.50 0.99
LPO 2,548.77 2,813.94 2,368.88 1,963.23 2,431.71
MSE 0.24 0.75 1.50 1.35
MAE 21.03 21.00 21.95 24.11 23.94
RMSE (all) 103.82 102.55 105.37 105.88 109.00
RMSE (10,000) 1.17 1.18 1.30 2.08 1.18
RMSE (1,000) 0.35 0.33 0.33 0.62 0.32
wRMSE 3.01 3.02 3.15 3.60 3.58
Fitting wRMSEs are relative to the PES being fitted and not to the CBS-35 data.
Note also that the weights differ slightly from Eq. 14 because Watson’s
robust fitting scheme Watson (2003) was employed (see Section II.4).
Table 3: (Continued)

In TROVE calculations the Hamiltonian was represented as a power series expansion around the equilibrium geometry in terms of nine vibrational coordinates and was constructed numerically using an automatic differentiation method Yachmenev and Yurchenko (2015). The coordinates used were identical to those given in Eqs. (7) to (11), except for the kinetic energy operator where linear expansion terms, e.g. , replace the Morse oscillator functions for the stretching modes. The kinetic and potential energy operators were truncated at 6th and 8th order, respectively, and atomic mass values were employed throughout. Calculations were carried out using a medium-sized vibrational basis set with a polyad truncation number of (see Refs. 43; 58 for further details). The basis set was constructed using a multi-step contraction scheme and contained 16,829 vibrational basis functions.

In Figure 6 we plot residual errors, , of computed vibrational energy levels using the 50%-ML, r50%-ML, 25%-ML, and s10%-ML model PESs with respect to the CBS-35 PES values. The RMSE and mean-absolute-deviation (MAD) for energies up to 5,000 and 10,000 cm are also listed for each model in Table 4. Comparing 50%-ML with r50%-ML, it is clear that structure-based sampling produces results that are far more reliable than random sampling. The residual errors are consistently smaller and more uniform for the energy range considered. The 25%-ML and s10%-ML models still perform reasonably well but errors steadily increase with energy. The 10%-ML model (not shown in Figure 6) has deteriorated and no longer gives accurate predictions (errors are much higher than 1 cm, Table 4).

Figure 6: Residual errors () of computed vibrational energy levels using the 50%-ML, r50%-ML, 25%-ML, and s10%-ML PESs, with respect to the CBS-35 PES values. Note the different scale for 10%-ML.
50%-ML r50%-ML 25%-ML 10%-ML s10%-ML
ML model PESs
RMSE (5,000 cm) 0.02 0.10 0.09 1.61 0.14
MAD (5,000 cm) 0.01 0.08 0.07 1.29 0.10
RMSE (10,000 cm) 0.04 0.18 0.16 1.75 0.28
MAD (10,000 cm) 0.03 0.15 0.12 1.44 0.21
Training set PESs
RMSE ( 5,000 cm) 0.06 0.08 0.12 0.30 0.12
MAD ( 5,000 cm) 0.05 0.06 0.10 0.25 0.10
RMSE (10,000 cm) 0.12 0.14 0.19 0.74 0.32
MAD (10,000 cm) 0.11 0.09 0.14 0.61 0.24
Table 4: Root-mean-square error (RMSE) and mean-absolute-deviation (MAD) of computed vibrational energy levels for the ML model PESs and training set PESs up to 5,000 and 10,000 cm (166 and 3,606 levels, respectively), with respect to the original CBS-35 PES values.

TROVE assigns quantum numbers to the eigenvalues by analyzing the contribution from the basis functions. This is how we match energy levels computed with different ML model PESs. However, given the approximate nature of the labeling scheme these assignments can occasionally differ between surfaces. This tends to happen mostly at higher energies (above 10,000 cm) but does not necessarily mean that the energy levels are mismatched — they have simply been labeled differently. This occurs for 2% of the computed values in the case of 50%-ML, 9% for r50%-ML, 12% for 25%-ML, and 18% for s10%-ML. For 10%-ML this percentage rises dramatically to 46% providing further evidence that this ML model is no longer reliable for high-accuracy applications.

We also computed vibrational energies using the training set PESs (also listed in Table 4). The errors are reasonably small and structure-based sampling again performs better than random sampling. For 50%-ML, 25%-ML, and s10%-ML the predicted spectra are more reliable than the results of the respective training set PESs (constructed from 50%, 25%, and 10% of the entire CBS-35, respectively), but this does not hold for the r50%-ML model, and there is even a marked deterioration in the case of 10%-ML.

Iv Conclusions

We propose a procedure for building highly accurate PESs using KRR-based machine learning. Our approach employs structure-based sampling to ensure that machine learning is applied in the interpolation regime where it performs best. Data slicing in terms of the energy distribution is recommended for a balanced representation of the PES for very small training sets. Self-correction capabilities are introduced into the ML model by including additional ML layers.

In a pilot study, we explored the merits of our ML model using a recently published high-level ab initio PES of CHCl as an example. Several ML models were built and trained using training sets of different size and a different number of data slices, and their performance was assessed by comparisons with the original ab initio energies at the grid points in the training set and the prediction set. Excellent agreement was found for the 50%-ML model, which reproduces the PES with sub-wavenumber accuracy.

For five selected ML-based PESs, the vibrational energy levels of CHCl were computed using variational TROVE calculations, in complete analogy to the published ab initio work Owens et al. (2015). The results clearly show that structure-based sampling produces more accurate ML models than random sampling. The structure-based sampling 50%-ML model gives excellent agreement with the ab initio reference data for the vibrational energy levels (RMSE of 0.04 cm in the range up to 10,000 cm). The accuracy deteriorates slightly for the 25%-ML model (RMSE of 0.16 cm) and for the s10%-ML model (RMSE of 0.28 cm, training set sampled from the dataset sliced into three regions) and quite strongly for the 10%-ML model (RMSE of 1.75 cm, training set sampled from the unsliced dataset).

The evidence from the present pilot study suggests that the number (and computational cost) of electronic structure calculations in high-level ab initio studies of rovibrational spectroscopy may be reduced by up to 90% (depending on the user needs and the initial grid size) by using structure-based sampling and self-correcting ML models, with minimal loss of accuracy. We expect that this will also hold for small molecules other than CHCl (in the absence of obvious reasons to suspect anything else). Of course, this should be examined in future work, which should also aim at establishing standard protocols for such ML studies. Finally, given the fact that our ML model is available in analytic form, it seems worthwhile to explore whether it can be used directly in TROVE-type variational calculations (without an intermediate fit to a standard polynomial form).

V Outlook

The objective of this study was not to improve upon the existing ab initio PES of \ceCH3Cl but to demonstrate how the computational cost of building such a PES can be substantially reduced by performing fewer ab initio calculations and by interpolating efficiently with KRR-based ML. We have shown that this can be done by the following procedure.

  1. Generate a large and dense grid of deformed structures using established techniques (as outlined in Section II.3 and Ref. 43).

  2. Select points from this grid into the training set by using structure-based sampling.

  3. Calculate the energy for each point in the training set as accurately as possible using a high-level ab initio method.

  4. Train the self-correcting ML model on the training set geometries and the high-level ab initio energies.

  5. Predict the energies for the remaining grid points using self-correcting machine learning.

  6. Calculate rovibrational levels variationally using TROVE.

  7. Repeat Steps 2 to 6 by including more points from the grid into the training set using structure-based sampling until the calculated rovibrational levels converge.

We plan to apply this procedure to generate accurate PESs for other small molecules. During these studies we will also investigate additional ways to reduce even further the computational cost of generating accurate PESs by combining our procedure with the so-called -ML approachRamakrishnan et al. (2015a). In this approach, ML is trained on differences between high-level and less demanding but still reliable low-level ab initio results; the -ML model is then applied to correct low-level energies for the remaining grid points (similar to previous work on the water PESBartók et al. (2013)). We anticipate that this approach will allow us to investigate larger systems such as organic molecules with several carbon atoms.

PD thanks Georgios Gerogiokas for software performance related advice and Alexey Dral for discussions. The authors thank the reviewers for their valuable comments.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description