Structurebased Sampling and Selfcorrecting Machine Learning for Accurate Calculations of Potential Energy Surfaces and Vibrational Levels
Abstract
We present an efficient approach for generating highly accurate molecular potential energy surfaces (PESs) using selfcorrecting, kernel ridge regression (KRR) based machine learning (ML). We introduce structurebased sampling to automatically assign nuclear configurations from a predefined grid to the training and prediction sets, respectively. Accurate highlevel ab initio energies are required only for the points in the training set, while the energies for the remaining points are provided by the ML model with negligible computational cost. The proposed sampling procedure is shown to be superior to random sampling and also eliminates the need for training several ML models. Selfcorrecting machine learning has been implemented such that each additional layer corrects errors from the previous layer. The performance of our approach is demonstrated in a case study on a published highlevel ab initio PES of methyl chloride with 44,819 points. The ML model is trained on sets of different size and then used to predict the energies for tens of thousands of nuclear configurations within seconds. The resulting datasets are utilized in variational calculations of the vibrational energy levels of CHCl. By using both structurebased sampling and selfcorrection, the size of the training set can be kept small (e.g. 10% of the points) without any significant loss of accuracy. In ab initio rovibrational spectroscopy, it is thus possible to reduce the number of computationally costly electronic structure calculations through structurebased sampling and selfcorrecting KKRbased machine learning by up to 90%.
pacs:
I Introduction
Molecular potential energy surfaces (PESs) are often computed using highlevel ab initio theory, notably in theoretical rovibrational spectroscopy where subwavenumber accuracy is targeted for calculated transition frequencies. Such sophisticated quantum mechanical (QM) calculations can be extremely costly, and they have to be repeated for tens of thousands of nuclear geometries. Particularly challenging are larger systems with many degrees of freedom that suffer from the socalled “curse of dimensionality”. Machine learning (ML) techniques offer a means to significantly reduce the necessary computational time Behler (2016).
It has been shown that neural networks (NNs) can be used to interpolate between points along the coordinate of simple reactions Gastegger and Marquetand (2015); Kolb et al. (2016), to improve the accuracy of semiempirical QM/MM calculationsShen, Wu, and Yang (2016), and to calculate accurate energies of conformations and clusters of organic and inorganic species Bholoa, Kenny, and Smith (2007); Malshe et al. (2009); Artrith and Behler (2012); Schütt et al. (2017); Smith, Isayev, and Roitberg (2017). Kernel ridge regression (KRR) can reduce errors of lowlevel QM methods such that atomization enthalpies of organic isomers can be reproduced with an accuracy close to higherlevel QM methods Dral, von Lilienfeld, and Thiel (2015); Ramakrishnan et al. (2015a). Gaussian process models are able to sample lowenergy regions of the PES Toyoura et al. (2016), correct density functional theory energies of water Bartók et al. (2013), predict energies and forces in Si clusters Bartók, Kondor, and Csányi (2013), and construct global PESs for N Cui and Krems (2016). ML can also be employed for accurate yet computationally fast PES representation; for example, NNs have been used to compute molecular vibrational energy levels Manzhos and Carrington (2006); Manzhos et al. (2006); Manzhos, Dawes, and Carrington (2014); Majumder et al. (2015); Shao et al. (2016), reaction probabilities Shao et al. (2016), and reaction rates Blank et al. (1995). Whilst potentials based on KRR Chmiela et al. (2016), NNs Hobday, Smith, and Belbruno (1999); Hobday, Smith, and BelBruno (1999); Raff et al. (2005); Behler and Parrinello (2007); Behler et al. (2008); Pukrittayakamee et al. (2009); Eshet et al. (2010); Khaliullin et al. (2010); Behler (2011); Khaliullin et al. (2011); Morawietz, Sharma, and Behler (2012); Behler (2015) or Gaussian process models Bartók et al. (2010) have been utilized in molecular dynamics (MD) simulations.
ML performs best in the interpolation regime and can fail spectacularly when used for extrapolation. For relatively large training sets and high error tolerance, as in many chemical applications, training on a random sample of data points can give useful results Rupp et al. (2012); Hansen et al. (2013); Dral, von Lilienfeld, and Thiel (2015); Ramakrishnan et al. (2015a); Cui and Krems (2016). Random sampling is straightforward and has minimal computational cost involved but it does not ensure per se that ML avoids extrapolation. Thus, it is preferable to ensure that ML is applied only for interpolation by using better sampling techniques. Even a simple manual choice of critical points on the PES can improve the accuracy of ML significantly Gastegger and Marquetand (2015).
Automated sampling is necessary for more complex PESs and several approaches have been reported in the literature. When training ML potentials, structures have been screened on their input vector elements to see if they are within the minimum and maximum values of the training set, which can then be expanded to include ‘outsidetherange’ structures Behler (2011); Botu and Ramprasad (2015). NN PESs of simple reactions have been refitted with points when the initial NN potential has encountered problems Kolb et al. (2016). Similarly, the error of the ML potential can be monitored during dynamics and ML can be retrained onthefly by including additional points when errors become too large Li, Kermode, and Vita (2015). In the case of very simple systems like \ceH3, the training set can be generated on a coarse grid Kolb et al. (2016).
Because of their inherent flexibility different NNs can give very different energies for the same structure, and hence training different NNs on the same data can also highlight problem structures that should be incorporated into the training set Morawietz, Sharma, and Behler (2012); Artrith and Behler (2012). Another approach exploits the fact that the most important regions of the PES are likely to be covered by standard MD simulations. Snapshots from the MD trajectories can thus be used in the training set to produce a more robust ML PES Raff et al. (2005); Pukrittayakamee et al. (2009); Geiger and Dellago (2013); Suzuki, Tamura, and Miyazaki (2016); Schütt et al. (2017); Chmiela et al. (2016). For better sampling of vibrational phase space, previous studies have employed stretched bond distances in MD simulations Pukrittayakamee et al. (2009) or a random displacement to sample along the vibrational normal modes Smith, Isayev, and Roitberg (2017). In other work, structures have been chosen from MD trajectories based on various clustering approaches, which reduce the redundancies in the NN training set Häse et al. (2016). For calculations of rotationvibration spectra where lowenergy regions of the PES are more important, probability distribution functions that favor structures near equilibrium have been used for pseudorandom sampling Manzhos and Carrington (2006); Manzhos et al. (2006); Manzhos, Dawes, and Carrington (2014).
Here we report a new and efficient method for generating highly accurate molecular PESs which uses KRRbased machine learning. In our approach, we initially define a set of grid points (nuclear configurations) that covers the relevant lowenergy part of the PES, which can be done using relatively inexpensive energy estimates (see Section II.3). Thereafter, our sampling relies on an interpolation between structures rather than on energies, i.e. the sampling procedure itself does not require energy calculations or the training of several ML models. We investigate the effect of structurebased sampling on the accuracy of the ML model in comparison with random sampling. We also explore what error reduction can be achieved through ML selfcorrection.
To illustrate our KRRbased ML model we utilize a highlevel ab initio PES for methyl chloride (\ceCH3^35Cl) Owens et al. (2015). This PES, denoted CBS35, is based on extensive explicitly correlated coupled cluster calculations with extrapolation to the complete basis set (CBS) limit and incorporates a range of higherlevel additive energy corrections. These include: corevalence electron correlation, higherorder coupled cluster terms, scalar relativistic effects, and diagonal BornOppenheimer corrections. A considerable amount of computational time was spent generating this PES. For example, a single point of the CBS35 PES at the equilibrium geometry required 9 separate calculations and a combined total of 26.7 hours on a single core of an Intel Xeon E52690 v2 GHz processor. Building a reliable PES requires tens of thousands of nuclear geometries for polyatomic species and the computational time increases for distorted configurations due to slower energy convergence. Efficient ML techniques that can reduce the necessary computational effort are therefore highly desirable.
The paper is structured as follows: In Section II we present the KRRbased selfcorrecting ML model, the molecular descriptor, and the sampling procedure. Details on the variational calculation of vibrational energy levels are also given. In Section III the ML model is evaluated for different sizes of training set, and vibrational energies are computed using several ML model PESs. Concluding remarks are offered in Section IV and an outlook is provided in Section V.
Ii Methods
ii.1 Machine Learning
The chosen ML technique is based on KRR Hastie, Tibshirani, and Friedman (2009) and is similar to the approach used for ML across chemical compound space Rupp et al. (2012); von Lilienfeld (2013); Hansen et al. (2013); Ramakrishnan et al. (2015a, b) and for calculating relative stabilities of organic isomers Ramakrishnan et al. (2015a); Dral, von Lilienfeld, and Thiel (2015). Below we give details of the KRR approach relevant to this work. All ML calculations were performed using our own program package MLatom Dral (2017).
Some property value of a nuclear configuration represented by the molecular descriptor (discussed in Section II.2) is calculated using the expression Hastie, Tibshirani, and Friedman (2009),
(1) 
where the sum runs over configurations represented by the molecular descriptors of the training set. The regression coefficient refers to the configuration , and is the kernel. Here we use the Gaussian kernel which is equivalent to the form often employed in the literature Rupp et al. (2012); Hansen et al. (2013),
(2) 
where is the kernel width, and is the Euclidean distance (the norm) between descriptors and (vectors of size ) defined as
(3) 
This definition of distances between molecular structures Rupp et al. (2012); von Lilienfeld (2013); Hansen et al. (2013) and a similar definition based on the norm (instead of the norm used here) Ramakrishnan et al. (2014); Hansen et al. (2013); Dral, von Lilienfeld, and Thiel (2015); Ramakrishnan et al. (2015a); Bereau, Andrienko, and von Lilienfeld (2015); Rupp, Ramakrishnan, and von Lilienfeld (2015); Ramakrishnan et al. (2015b); Faber et al. (2016) were used previously to compare molecular geometries Ramakrishnan et al. (2014) and learn various molecular properties Rupp et al. (2012); von Lilienfeld (2013); Hansen et al. (2013); Dral, von Lilienfeld, and Thiel (2015); Ramakrishnan et al. (2015a); Bereau, Andrienko, and von Lilienfeld (2015); Rupp, Ramakrishnan, and von Lilienfeld (2015); Ramakrishnan et al. (2015b); Faber et al. (2016). We have also tested the Laplacian kernel employing the norm, but the PESs obtained with this kernel showed much larger errors than those obtained with the Gaussian kernel.
Training the ML model involves finding the regression coefficients , i.e. solving the minimization problem,
(4) 
which has a known analytical solution Hastie, Tibshirani, and Friedman (2009)
(5) 
where is the identity matrix, is the kernel matrix with the elements calculated using Eq. 2 for all pairs of training set points, is a vector with reference property values, and is the socalled regularization parameter that ensures the transferability of the model to new nuclear configurations Rupp et al. (2012); von Lilienfeld (2013).
Two additional improvements were made to the approach outlined above which were not considered in earlier work: Rupp et al. (2012); von Lilienfeld (2013); Hansen et al. (2013); Ramakrishnan et al. (2015a); Dral, von Lilienfeld, and Thiel (2015) (i) we sample data to ensure that ML interpolates between the training points and does not extrapolate beyond them (see Section II.3), and (ii) we use a nested, selfcorrecting ML approach. For the latter, ML is first trained on the reference ab initio energies ( in Eq. (5)) and then used to make a firstlayer estimate of the deformation energy ( in Eq. (1)). The next layer corrects errors of the energies estimated in the previous layer, and so on. For example, the second layer ML model is trained on and is then used to calculate corrections for the prediction set which are summed up with layer 1 predicted energies to obtain layer 2 energies, i.e. . The dependence of the performance of the ML model on the number of layers is discussed in Section III.1.
In order to determine the optimal values of the hyperparameters and for each layer, we sample 80% of the training set points into a subtraining set using the same sampling procedure that was employed for the training set (see Section II.3). The points with deformation energies less than 10,000 cm are taken from the remaining 20% into the validation set. Using the ML model trained on the subtraining set, we search for values of the hyperparameters which give the lowest rootmeansquare error (RMSE) for the validation set. This is performed using a simple logarithmic grid search Hansen et al. (2013); Rupp (2015). By optimizing the hyperparameters such as to obtain a better description below 10,000 cm we ensure an adequate treatment of the spectroscopically most relevant part of the PES.^{1}^{1}1On only one occasion (one of the ML models trained on 50% randomly drawn grid points) the RMSE of the training set in the firstlayer became unreasonably high due to overfitting caused by very small value of the parameter . In this case we set to for the first layer and retrained the ML model.
The above selfcorrecting scheme is similar in spirit to a twostage NN model that has been introduced to remedy peculiarities of NNs Manzhos et al. (2006); Manzhos, Dawes, and Carrington (2014), such as the presence of illdefined regions (“holes”) in NN PESs as a result of overfitting. In this previous work, the NN potential was first fit with as few nodes as possible to eliminate holes, which resulted in a large RMSE; to obtain a PES of reasonable quality, the NN potential was then refit by incorporating more nodes to eliminate residual errors.
In our study, overfitting is primarily prevented by using the regularization parameter and the problem of holes on the PES is addressed by our sampling procedure (discussed in Section II.3). The KRR approach makes fitting straightforward as there is no need to manually adjust any of the parameters. As a result, the RMSE of our onelayer ML model is already quite low, while multilayer ML models possess still slightly smaller RMSEs (see Section III.1). This is in contrast to twostage NN models where the RMSE of the first fit is one order of magnitude larger than that of the second fit Manzhos et al. (2006).
ii.2 Molecular descriptor
The success of ML is largely determined by the choice of an appropriate molecular descriptor. Many of the molecular descriptors proposed in literature are functions of the internuclear distances, for example the Coulomb matrixRupp et al. (2012); Hansen et al. (2013), the Bag of BondsHansen et al. (2015), the atomcentered symmetry functionsBehler and Parrinello (2007); Behler (2016), the bispectrum of the neighbor densityBartók et al. (2010), the smooth overlap of atomic positionsBartók, Kondor, and Csányi (2013), and othersMajumder et al. (2015). We designed and tested many descriptors for \ceCH_3Cl but overall the most accurate was a vector with ten elements corresponding to the ten interatomic pairs. Each element is defined as the corresponding internuclear distance in the nearequilibrium reference geometry of \ceCH_3Cl () divided by the current distance (), e.g. for the \ceCCl atomic pair. This form of descriptor ensures that the ML model is rotationally and translationally invariant.
Since the molecular descriptor also has to be atom index invariant, we sort the hydrogen nuclei by the sum of their internuclear repulsions with the four other nuclei for structurebased sampling (see Section II.3). Simple sorting of hydrogen nuclei in the molecular descriptor may however lead to instabilities in regions where the order of hydrogen nuclei changes. To avoid this problem we employ a normalized permutational invariant kernel in our ML calculations (Section II.1) as suggested in the literature:Bartók et al. (2013); Bartók and Csányi (2015)
(6) 
where permutes the order of hydrogen nuclei. There are permutations of three hydrogen nuclei. We found that results obtained using the normalized permutational invariant kernel are superior to those obtained using a sorted molecular descriptor.
ii.3 Grid points and sampling
The target PES needs to be evaluated on a large number of predefined grid points (nuclear configurations). These grid points can be determined through rather inexpensive initial energy computations. In our previous work on \ceCH_3Cl, this was done by creating a sparse grid along each onedimensional cut of the PES, calculating singlepoint energies using a reliable but relatively cheap ab initio method (e.g. CCSD(T)F12b/VTZF12), fitting a polynomial function to get rough energy estimates, and randomly selecting all remaining points using an energyweighted Monte Carlo type sampling algorithm to cover the desired lowenergy PES regionOwens et al. (2015). This procedure, which was inexpensive computationally, produced a grid of 44,819 geometries with energies up to 50,000 cm ( is the Planck constant and is the speed of light). In our present proofofprinciple study we use this grid, which was employed for the final CBS35 PES of \ceCH_3ClOwens et al. (2015), and partition it into training and test sets using the structurebased sampling procedure described below; the energies for the training set are taken from the available ab initio resultsOwens et al. (2015) while those for the remaining grid points are predicted essentially for free using ML.
In our sampling procedure we select nuclear configurations for the training set based on the relative distance of the molecular descriptors. This is done such that all remaining points used for prediction are within the boundaries of the training set or very close to it. The first point of the training set is taken closest to equilibrium. The second point of the training set is the one among all grid points that has the largest distance from the first point. For each remaining point on the grid we calculate its distance to all points in the training set using Eq. (3) and determine the shortest distance to any point in the training set. We then include the grid point that has the ‘longest’ shortest distance into the training set, as illustrated in Figure 1. This procedure is repeated until the required number of points are selected for the training set. The other remaining points are used for prediction, and by construction they lie within the training set or very close to it; at least one of their distances to the points in the training set should be shorter than the shortest distance between points in the training set.
This sampling procedure is closely related to the farthestpoint traversal iterative procedure used to select points such that they are as distant as possible from the previously selected points. In this respect structurebased sampling can be also viewed as a way to obtain a training set as diverse as possible.
The sampling procedure outlined above selects a training set of predetermined size from a larger set of predefined structures. The same sampling principles can be applied to test whether additional structures (beyond the initially chosen set) should be included into the training set. If this is the case, the ML model needs to be retrained (similar to an approach described in Ref. 31).
ii.4 Variational calculations
In this work we use the nuclear motion program TROVE Yurchenko, Thiel, and Jensen (2007); Yachmenev and Yurchenko (2015) for computing vibrational energy levels. Since rovibrational calculations have previously been reported for CHCl Owens et al. (2015, 2016), we summarize only the key aspects relevant for the present study.
In variational calculations the PES must be represented analytically. To do this we introduce the coordinates,
(7) 
(8) 
where for the C–Cl bond length , and for the three C–H bond lengths and . For the angular terms,
(9) 
(10) 
(11) 
where , , are the interbond angles, and , , are the dihedral angles between adjacent planes containing HCCl and HCCl. Here , and are the reference equilibrium structural parameters of CHCl.
The potential function (maximum expansion order ) is given by the expression,
(12) 
and contains the terms,
(13) 
which are symmetrized combinations of different permutations of the vibrational coordinates , and transform according to the representation of the molecular symmetry group Bunker and Jensen (1998).
The expansion coefficients are determined through a leastsquares fitting to the ab initio and/or ML data. Weight factors of the form Partridge and Schwenke (1997)
(14) 
are used in the fittings. Here and the normalization constant (all values in cm). Larger weights () are assigned to lower deformation energies (), which correspond to more spectroscopically important regions of the PES. As shown in Figure 2, this form ensures that structures with energies up to 10,000 cm above equilibrium have weights near unity, whilst other configurations are significantly downweighted with increasing energy.
To ensure a reliable comparison the CBS35 PES expansion parameter set was used to fit the ab initio and MLgenerated datasets. This contained 414 parameters and included linear expansion terms. For this reason we fixed the values of , , and , to be the same as those used for the CBS35 PES. Each fit employed Watson’s robust fitting scheme Watson (2003), which reduces the weights of outliers and lessens their influence in determining the final set of parameters.
Iii Results and discussion
In this section we investigate the accuracy of the ML model s trained on 50%, 25%, and 10% of the available 44,819 deformed structures and their ab initio energies used to construct the CBS35 PES Owens et al. (2015).
To evaluate the accuracy of predicted energies () we employ standard error measures as well as the weighted rootmeansquare error (wRMSE):
(15) 
Weights are calculated using Eq. 14.
iii.1 Optimal machine learning model
We first examine how the accuracy of ML calculated deformation energies for the prediction set depends on the number of ML layers and the sampling procedure. The results in Table 1 indicate that the errors are significantly lowered by adding the second layer for all models except for the one trained on 10% points selected with structurebased sampling. Including a third layer significantly reduces the errors in a few cases, while a fourth layer does not yield any noticeable improvements. We therefore expect that adding further layers will make no difference.
Number of layers  

1  2  3  4  
Structurebased sampling from unsliced data  
22,409 (50%)  22  3,985  6.57  0.37  0.37  0.37 
11,204 (25%)  1  1,033  3.21  1.87  1.60  1.60 
4,481 (10%)  1  195  5.00  4.99  4.83  4.83 
Structurebased sampling from data sliced into three regions  
22,408 (50%)  131  7,215  2.31  0.62  0.62  0.62 
11,203 (25%)  19  3,348  2.90  2.59  2.58  2.58 
4,480 (10%)  1  1,191  4.42  3.63  3.63  3.63 
Random sampling  
22,409 (50%)  
11,204 (25%)  
4,481 (10%)  
for the entire grid of 44,819 points.  
for the entire grid of 44,819 points.  
Standard deviations were calculated for 16, 20, and 30 various randomly drawn  
training sets for ML trained on 50%, 25%, and 10% grid points, respectively. 
For practical purposes the threelayer model appears to be sufficient. However, in the following we have applied the fourlayer model because the computational cost of including additional layers is rather low and we have not observed any significant accumulation of numerical noise. It takes around 4 hours to optimize the ML hyperparameters using a fairly inefficient optimization algorithm, around a minute to train and only a couple of seconds to make predictions with the 50%ML model on 20 cores of an Intel(R) Xeon(R) CPU E52687W v3 @ 3.10GHz processor.
As shown in Table 1, random sampling is clearly inferior to structurebased sampling; all fourlayer ML models trained on randomly drawn points have wRMSEs significantly higher than those of the fourlayer, structurebased sampling ML models. Interestingly, selfcorrection works well even for random sampling: it reduces the wRMSEs by 32–44% and the standard deviations by 70–90%. Despite this the remaining error of cm for the randomly sampled, fourlayer ML models trained on 50% of grid points is still much higher than the wRMSE of 0.37 cm for the fourlayer, structurebased sampling ML model trained on the same number of grid points. Standard deviations for random sampling are also relatively high and increase from 0.87 to 1.67 cm when going from the 50% to 10% training sets. There is no such problem with structurebased sampling which provides a unique training set — an important practical advantage for highaccuracy applications. As for the computational cost of sampling, it takes ca. 9 hours to sample 50% from 44,819 data points on 12 Intel(R) Xeon(R) CPU X5680 @ 3.33GHz processors.
As the training set becomes relatively small, structurebased sampling may lead to an underrepresentation of the training points in the lowenergy region, e.g. 10% training points drawn from the entire grid contain only 195 structures with deformation energies below 10,000 cm (compare the number of configurations with deformation energies below 1,000 and 10,000 cm, columns and in Table 1). Sorting the geometries by their distance to the equilibrium structure (which correlates strongly with the deformation energies, Figure 3), followed by slicing the data into several regions and sampling points from each of these regions leads to a more energybalanced training set. Looking at Figure 3, one can argue that splitting the set into three regions with equal number of structures should be close to optimal. The first region includes the most important structures with deformation energies below 10,000 cm and a significant portion of structures with energies between 10,000 and 20,000 cm. The second region mainly includes configurations with energies between 10,000 and 20,000 cm but also a considerable number of geometries above and below this region. The third slice includes all remaining highenergy structures. Structurebased sampling from each of the above regions leads to close to those expected from random sampling, e.g. 10% training points drawn from the sliced grid contain 1,191 structures with deformation energies below 10,000 cm (Table 1). As a result, wRMSE of the ML model trained on the latter training set (3.63 cm) is lower than wRMSE of the ML model trained on 10% points drawn from unsliced data (4.83 cm). However, such slicing does not generate training sets that are as diverse as possible, and therefore the errors of the ML models become higher for the sliced grids as the training set increases in size (training sets with 25% and 50% grid points, Table 1). Thus we recommend slicing only for very small training sets where lowenergy structures are underrepresented.
In the following we will discuss fourlayer ML models trained on 50%, 25%, and 10% points drawn using structurebased sampling from the available unsliced 44,819 grid points. We refer to these ML models as 50%ML, 25%ML, and 10%ML, respectively. We also compare with one of the randomly sampled, fourlayer ML models referred to as r50%ML which has been chosen at random from the ML models trained on 50% of grid points. In addition, we compare with the fourlayer ML model trained on 10% points drawn using structurebased sampling from each region of the dataset sliced into three regions. This model is referred to as s10%ML in the following.
A more detailed analysis of the ML model errors, listed in Table 2, reveals that r50%ML has the largest outliers with a wRMSE of 4.14 cm, which is more than twice as large as that of 25%ML (wRMSE of 1.60 cm) for their respective prediction sets. The nonweighted RMSE of r50%ML ( 167.19 cm) is more than four times higher than the nonweighted RMSE of 10%ML ( 39.63 cm). Moreover, the RMSE of r50%ML for energies below 1,000 cm is practically the same as the respective RMSE of 50%ML and for energies below 10,000 cm is higher than the respective RMSE of 50%ML, despite the fact that many more points from these regions are included into the training set (compare and for these two models in Table 1). These observations provide strong evidence for the superiority of structurebased sampling. s10%ML has RMSEs for energies below 1,000 and 10,000 cm close to those of 25%ML, while the respective RMSEs of 10%ML are much higher. On the other hand, 10%ML has a much lower nonweighted RMSE, MAE and MSE than s10%ML. Thus, slicing clearly improves the description of the lowenergy region at the cost of other regions. Apparently, for sparse training data, the benefits of a better description of the lowenergy region achieved by slicing outweigh the disadvantage of an overall worse description of the PES, which is exemplified by the lower wRMSE of s10%ML compared to the wRMSE of 10%ML.
50%ML  r50%ML  25%ML  10%ML  s10%ML  

Training set  
22,409  22,409  11,204  4,481  4,480  
LPO  0.00  4.18  0.00  0.00  0.00 
LNO  0.00  0.00  0.00  0.00  
MSE  0.00  0.00  0.00  0.00  0.00 
MAE  0.00  0.03  0.00  0.00  0.00 
RMSE (all)  0.00  0.11  0.00  0.00  0.00 
RMSE (10,000)  0.00  0.03  0.00  0.00  0.00 
RMSE (1,000)  0.00  0.07  0.00  0.00  0.00 
wRMSE  0.00  0.03  0.00  0.00  0.00 
Prediction set  
22,410  22,410  33,615  40,338  40,339  
LPO  319.75  2,015.44  1,035.63  1,617.61  1,481.69 
LNO  
MSE  0.02  0.18  0.19  1.85  
MAE  0.82  25.12  3.47  11.27  19.54 
RMSE (all)  6.23  167.19  16.12  39.63  61.96 
RMSE (10,000)  0.20  1.38  1.12  4.26  1.19 
RMSE (1,000)  0.08  0.07  0.16  1.40  0.25 
wRMSE  0.37  4.14  1.60  4.83  3.63 
Entire grid  
44,819  
LPO  319.75  2,015.44  1,035.63  1,617.61  1,481.69 
LNO  
MSE  0.01  0.14  0.17  1.66  
MAE  0.41  12.57  2.60  10.14  17.58 
RMSE (all)  4.41  118.22  13.96  37.60  58.79 
RMSE (10,000)  0.17  0.98  1.09  4.24  1.14 
RMSE (1,000)  0.08  0.07  0.16  1.40  0.25 
wRMSE  0.26  2.93  1.39  4.59  3.44 
From Figure 4 we see that deformation energies predicted by 50%ML, r50%ML, 25%ML, and s10%ML models correlate nicely with the reference ab initio energies; the value is always larger than 0.999. Deformation energies predicted by 10%ML model (not shown in Figure 4) correlate better () with the reference energies than the energies predicted by the s10%ML model (consistent with the above conclusions). Clearly, 50%ML is superior to all other models and has the best correlation with far fewer outliers and smaller residual errors. This is particularly relevant for highaccuracy work as we will see in Section III.2.
As for the effect of training set size, illustrated in Figure 5, wRMSEs in the prediction set drop from 144 cm for the 1%ML model to 0.05–0.06 cm for the 85–99%ML models. Interestingly, the 1%ML model trained on only 448 grid points may still be regarded as a chemically meaningful representation of the PES since its nonweighted RMSE is only 0.77 kcal/mol ( 271 cm). The error drops very quickly to 12.29 cm for the 5%ML model and is below 1.00 cm for 35% and above, finally becoming smaller than 0.5 cm for training sets with 50% or more of all configurations.
Regarding the four structurebased sampling ML models that were tested extensively, wRMSEs grow significantly from 50%ML ( 0.37 cm) to 25%ML ( 1.60 cm) to s10%ML (3.63 cm) to 10%ML ( 4.83 cm) . We investigate further the effect of the training set size for highresolution spectroscopy applications in Section III.2, where we report vibrational energies using PESs based on the 50%ML, r50%ML, 25%ML, 10%ML, and s10%ML models.
The analytic representation employed for the CBS35 (discussed in Sections II.4 and III.2) was fitted with a wRMSE of 3.00 cm for the entire grid (see Section III.2). For 50%ML the training set, prediction set, and entire grid of 44,819 points were reproduced with wRMSEs of 0.00, 0.37 and 0.26 cm, respectively (Table 2). Our ML approach, which has a defined analytic form, could therefore provide a more accurate description of the PES based on fewer training points and could possibly be employed directly in variational calculations. However, this is beyond the scope of this work and application of our ML technique in variational calculations will be the focus of future research.
As indicated by one reviewer, further finetuning of the ML models is possible by using the anisotropic Gaussian kernel with multiple kernel widths instead of a single parameter (as for instance in Ref. 13). In test calculations we have found that such ML models may reduce errors somewhat (e.g. by 10% for 10%ML), but there is a substantial increase in the complexity of the ML model and the computational cost for large training sets (due to parameter optimization), and hence we decided to use one single kernel width parameter in this study.
iii.2 Vibrational energy levels
For the 50%ML, r50%ML, 25%ML, 10%ML, and s10%ML PESs, the expansion coefficients of the potential function given in Eq. (12) were determined in a leastsquares fitting to the 44,819 grid points. The results of the fittings are listed in Table 3. In addition, PESs were determined for the five, associated training sets and the results are also included in Table 2). We see that the fits of the CBS35 PES, the ML model PESs, and the training set PESs are of a similar accuracy, with the exception of the 10%ML fits which exhibit significantly larger errors. For the other fits, the RMSEs below 10,000 cm range between 1.17 and 2.08 cm, with wRMSE values between 3.01 and 3.59 cm for the MLbased PESs and up to 3.60 cm for the training set PESs. The mean errors are particularly low for the 50%ML PES, its training set PES, and 25%ML PES (0.16, 0.24, and 0.02 cm, respectively).
CBS35  50%ML  r50%ML  25%ML  10%ML  s10%ML  

ML model PESs  
No. of parameters  414  414  414  412  402  409 
fitting wRMSE  0.82  0.83  0.89  0.99  1.39  1.13 
LPO  2,717.33  2,718.84  2,546.10  2,700.54  2,506.24  2,868.59 
LNO  
MSE  0.20  0.16  0.56  0.02  0.81  7.38 
MAE  20.82  20.83  21.01  21.83  31.73  24.63 
RMSE (all)  102.24  102.28  102.13  102.69  113.78  101.39 
RMSE (10,000)  1.18  1.18  1.22  1.22  3.66  1.37 
RMSE (1,000)  0.33  0.33  0.33  0.34  1.08  0.40 
wRMSE  3.00  3.01  3.09  3.19  5.40  3.59 
Training set PESs  
No. of parameters  414  414  414  410  411  
fitting wRMSE  0.98  0.82  0.94  0.50  0.99  
LPO  2,548.77  2,813.94  2,368.88  1,963.23  2,431.71  
LNO  
MSE  0.24  0.75  1.50  1.35  
MAE  21.03  21.00  21.95  24.11  23.94  
RMSE (all)  103.82  102.55  105.37  105.88  109.00  
RMSE (10,000)  1.17  1.18  1.30  2.08  1.18  
RMSE (1,000)  0.35  0.33  0.33  0.62  0.32  
wRMSE  3.01  3.02  3.15  3.60  3.58  
Fitting wRMSEs are relative to the PES being fitted and not to the CBS35 data.  
Note also that the weights differ slightly from Eq. 14 because Watson’s  
robust fitting scheme Watson (2003) was employed (see Section II.4). 
In TROVE calculations the Hamiltonian was represented as a power series expansion around the equilibrium geometry in terms of nine vibrational coordinates and was constructed numerically using an automatic differentiation method Yachmenev and Yurchenko (2015). The coordinates used were identical to those given in Eqs. (7) to (11), except for the kinetic energy operator where linear expansion terms, e.g. , replace the Morse oscillator functions for the stretching modes. The kinetic and potential energy operators were truncated at 6^{th} and 8^{th} order, respectively, and atomic mass values were employed throughout. Calculations were carried out using a mediumsized vibrational basis set with a polyad truncation number of (see Refs. 43; 58 for further details). The basis set was constructed using a multistep contraction scheme and contained 16,829 vibrational basis functions.
In Figure 6 we plot residual errors, , of computed vibrational energy levels using the 50%ML, r50%ML, 25%ML, and s10%ML model PESs with respect to the CBS35 PES values. The RMSE and meanabsolutedeviation (MAD) for energies up to 5,000 and 10,000 cm are also listed for each model in Table 4. Comparing 50%ML with r50%ML, it is clear that structurebased sampling produces results that are far more reliable than random sampling. The residual errors are consistently smaller and more uniform for the energy range considered. The 25%ML and s10%ML models still perform reasonably well but errors steadily increase with energy. The 10%ML model (not shown in Figure 6) has deteriorated and no longer gives accurate predictions (errors are much higher than 1 cm, Table 4).
50%ML  r50%ML  25%ML  10%ML  s10%ML  

ML model PESs  
RMSE (5,000 cm)  0.02  0.10  0.09  1.61  0.14 
MAD (5,000 cm)  0.01  0.08  0.07  1.29  0.10 
RMSE (10,000 cm)  0.04  0.18  0.16  1.75  0.28 
MAD (10,000 cm)  0.03  0.15  0.12  1.44  0.21 
Training set PESs  
RMSE ( 5,000 cm)  0.06  0.08  0.12  0.30  0.12 
MAD ( 5,000 cm)  0.05  0.06  0.10  0.25  0.10 
RMSE (10,000 cm)  0.12  0.14  0.19  0.74  0.32 
MAD (10,000 cm)  0.11  0.09  0.14  0.61  0.24 
TROVE assigns quantum numbers to the eigenvalues by analyzing the contribution from the basis functions. This is how we match energy levels computed with different ML model PESs. However, given the approximate nature of the labeling scheme these assignments can occasionally differ between surfaces. This tends to happen mostly at higher energies (above 10,000 cm) but does not necessarily mean that the energy levels are mismatched — they have simply been labeled differently. This occurs for 2% of the computed values in the case of 50%ML, 9% for r50%ML, 12% for 25%ML, and 18% for s10%ML. For 10%ML this percentage rises dramatically to 46% providing further evidence that this ML model is no longer reliable for highaccuracy applications.
We also computed vibrational energies using the training set PESs (also listed in Table 4). The errors are reasonably small and structurebased sampling again performs better than random sampling. For 50%ML, 25%ML, and s10%ML the predicted spectra are more reliable than the results of the respective training set PESs (constructed from 50%, 25%, and 10% of the entire CBS35, respectively), but this does not hold for the r50%ML model, and there is even a marked deterioration in the case of 10%ML.
Iv Conclusions
We propose a procedure for building highly accurate PESs using KRRbased machine learning. Our approach employs structurebased sampling to ensure that machine learning is applied in the interpolation regime where it performs best. Data slicing in terms of the energy distribution is recommended for a balanced representation of the PES for very small training sets. Selfcorrection capabilities are introduced into the ML model by including additional ML layers.
In a pilot study, we explored the merits of our ML model using a recently published highlevel ab initio PES of CHCl as an example. Several ML models were built and trained using training sets of different size and a different number of data slices, and their performance was assessed by comparisons with the original ab initio energies at the grid points in the training set and the prediction set. Excellent agreement was found for the 50%ML model, which reproduces the PES with subwavenumber accuracy.
For five selected MLbased PESs, the vibrational energy levels of CHCl were computed using variational TROVE calculations, in complete analogy to the published ab initio work Owens et al. (2015). The results clearly show that structurebased sampling produces more accurate ML models than random sampling. The structurebased sampling 50%ML model gives excellent agreement with the ab initio reference data for the vibrational energy levels (RMSE of 0.04 cm in the range up to 10,000 cm). The accuracy deteriorates slightly for the 25%ML model (RMSE of 0.16 cm) and for the s10%ML model (RMSE of 0.28 cm, training set sampled from the dataset sliced into three regions) and quite strongly for the 10%ML model (RMSE of 1.75 cm, training set sampled from the unsliced dataset).
The evidence from the present pilot study suggests that the number (and computational cost) of electronic structure calculations in highlevel ab initio studies of rovibrational spectroscopy may be reduced by up to 90% (depending on the user needs and the initial grid size) by using structurebased sampling and selfcorrecting ML models, with minimal loss of accuracy. We expect that this will also hold for small molecules other than CHCl (in the absence of obvious reasons to suspect anything else). Of course, this should be examined in future work, which should also aim at establishing standard protocols for such ML studies. Finally, given the fact that our ML model is available in analytic form, it seems worthwhile to explore whether it can be used directly in TROVEtype variational calculations (without an intermediate fit to a standard polynomial form).
V Outlook
The objective of this study was not to improve upon the existing ab initio PES of \ceCH3Cl but to demonstrate how the computational cost of building such a PES can be substantially reduced by performing fewer ab initio calculations and by interpolating efficiently with KRRbased ML. We have shown that this can be done by the following procedure.

Select points from this grid into the training set by using structurebased sampling.

Calculate the energy for each point in the training set as accurately as possible using a highlevel ab initio method.

Train the selfcorrecting ML model on the training set geometries and the highlevel ab initio energies.

Predict the energies for the remaining grid points using selfcorrecting machine learning.

Calculate rovibrational levels variationally using TROVE.

Repeat Steps 2 to 6 by including more points from the grid into the training set using structurebased sampling until the calculated rovibrational levels converge.
We plan to apply this procedure to generate accurate PESs for other small molecules. During these studies we will also investigate additional ways to reduce even further the computational cost of generating accurate PESs by combining our procedure with the socalled ML approachRamakrishnan et al. (2015a). In this approach, ML is trained on differences between highlevel and less demanding but still reliable lowlevel ab initio results; the ML model is then applied to correct lowlevel energies for the remaining grid points (similar to previous work on the water PESBartók et al. (2013)). We anticipate that this approach will allow us to investigate larger systems such as organic molecules with several carbon atoms.
Acknowledgements.
PD thanks Georgios Gerogiokas for software performance related advice and Alexey Dral for discussions. The authors thank the reviewers for their valuable comments.References
 Behler (2016) J. Behler, J. Chem. Phys. 145, 170901 (2016), see also references therein.
 Gastegger and Marquetand (2015) M. Gastegger and P. Marquetand, J. Chem. Theory Comput. 11, 2187 (2015).
 Kolb et al. (2016) B. Kolb, B. Zhao, J. Li, B. Jiang, and H. Guo, J. Chem. Phys. 144, 224103 (2016).
 Shen, Wu, and Yang (2016) L. Shen, J. Wu, and W. Yang, J. Chem. Theory Comput. 12, 4934 (2016).
 Bholoa, Kenny, and Smith (2007) A. Bholoa, S. D. Kenny, and R. Smith, Nucl. Instrum. Meth. B 255, 1 (2007).
 Malshe et al. (2009) M. Malshe, R. Narulkar, L. M. Raff, M. Hagan, S. Bukkapatnam, P. M. Agrawal, and R. Komanduri, J. Chem. Phys. 130, 184102 (2009).
 Artrith and Behler (2012) N. Artrith and J. Behler, Phys. Rev. B 85, 045439 (2012).
 Schütt et al. (2017) K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko, Nat. Comm. 8, 13890 (2017).
 Smith, Isayev, and Roitberg (2017) J. S. Smith, O. Isayev, and A. E. Roitberg, Chem. Sci. 8, 3192 (2017).
 Dral, von Lilienfeld, and Thiel (2015) P. O. Dral, O. A. von Lilienfeld, and W. Thiel, J. Chem. Theory Comput. 11, 2120 (2015).
 Ramakrishnan et al. (2015a) R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld, J. Chem. Theory Comput. 11, 2087 (2015a).
 Toyoura et al. (2016) K. Toyoura, D. Hirano, A. Seko, M. Shiga, A. Kuwabara, M. Karasuyama, K. Shitara, and I. Takeuchi, Phys. Rev. B 93, 054112 (2016).
 Bartók et al. (2013) A. P. Bartók, M. J. Gillan, F. R. Manby, and G. Csányi, Phys. Rev. B 88, 054104 (2013).
 Bartók, Kondor, and Csányi (2013) A. P. Bartók, R. Kondor, and G. Csányi, Phys. Rev. B 87, 184115 (2013).
 Cui and Krems (2016) J. Cui and R. V. Krems, J. Phys. B: At., Mol. Opt. Phys. 49, 224001 (2016).
 Manzhos and Carrington (2006) S. Manzhos and T. Carrington, J. Chem. Phys. 125, 084109 (2006).
 Manzhos et al. (2006) S. Manzhos, X. G. Wang, R. Dawes, and T. Carrington, J. Phys. Chem. A 110, 5295 (2006).
 Manzhos, Dawes, and Carrington (2014) S. Manzhos, R. Dawes, and T. Carrington, Int. J. Quantum Chem. 115, 1012 (2014).
 Majumder et al. (2015) M. Majumder, S. E. Hegger, R. Dawes, S. Manzhos, X.G. Wang, C. Tucker Jr., J. Li, and H. Guo, Mol. Phys. 113, 1823 (2015), see also references therein.
 Shao et al. (2016) K. Shao, J. Chen, Z. Zhao, and D. H. Zhang, J. Chem. Phys. 145, 071101 (2016).
 Blank et al. (1995) T. B. Blank, S. D. Brown, A. W. Calhoun, and D. J. Doren, J. Chem. Phys. 103, 4129 (1995).
 Chmiela et al. (2016) S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, K. Schütt, and K.R. Müller, ArXiv eprints , 1611.04678 (2016), arXiv:1611.04678 [physics.chemph] .
 Hobday, Smith, and Belbruno (1999) S. Hobday, R. Smith, and J. Belbruno, Model. Simul. Mater. Sci. 7, 397 (1999).
 Hobday, Smith, and BelBruno (1999) S. Hobday, R. Smith, and J. BelBruno, Nucl. Instrum. Meth. B 153, 247 (1999).
 Raff et al. (2005) L. M. Raff, M. Malshe, M. Hagan, D. I. Doughan, M. G. Rockley, and R. Komanduri, J. Chem. Phys. 122, 084104 (2005).
 Behler and Parrinello (2007) J. Behler and M. Parrinello, Phys. Rev. Lett. 98, 146401 (2007).
 Behler et al. (2008) J. Behler, R. Martonak, D. Donadio, and M. Parrinello, Phys. Rev. Lett. 100, 185501 (2008).
 Pukrittayakamee et al. (2009) A. Pukrittayakamee, M. Malshe, M. Hagan, L. M. Raff, R. Narulkar, S. Bukkapatnum, and R. Komanduri, J. Chem. Phys. 130, 134101 (2009).
 Eshet et al. (2010) H. Eshet, R. Z. Khaliullin, T. D. Kühne, J. Behler, and M. Parrinello, Phys. Rev. B 81, 184107 (2010).
 Khaliullin et al. (2010) R. Z. Khaliullin, H. Eshet, T. D. Kühne, J. Behler, and M. Parrinello, Phys. Rev. B 81, 100103(R) (2010).
 Behler (2011) J. Behler, Phys. Chem. Chem. Phys. 13, 17930 (2011), see also references therein.
 Khaliullin et al. (2011) R. Z. Khaliullin, H. Eshet, T. D. Kühne, J. Behler, and M. Parrinello, Nat. Mater. 10, 693 (2011).
 Morawietz, Sharma, and Behler (2012) T. Morawietz, V. Sharma, and J. Behler, J. Chem. Phys. 136, 064103 (2012).
 Behler (2015) J. Behler, Int. J. Quantum Chem. 115, 1032 (2015), see also references therein.
 Bartók et al. (2010) A. P. Bartók, M. C. Payne, R. Kondor, and G. Csányi, Phys. Rev. Lett. 104, 136403 (2010).
 Rupp et al. (2012) M. Rupp, A. Tkatchenko, K.R. Müller, and O. A. von Lilienfeld, Phys. Rev. Lett. 108, 058301 (2012).
 Hansen et al. (2013) K. Hansen, G. Montavon, F. Biegler, S. Fazli, M. Rupp, M. Scheffler, O. A. von Lilienfeld, A. Tkatchenko, and K.R. Müller, J. Chem. Theory Comput. 9, 3404 (2013).
 Botu and Ramprasad (2015) V. Botu and R. Ramprasad, Int. J. Quantum Chem. 115, 1074 (2015).
 Li, Kermode, and Vita (2015) Z. Li, J. R. Kermode, and A. D. Vita, Phys. Rev. Lett. 114, 096405 (2015).
 Geiger and Dellago (2013) P. Geiger and C. Dellago, J. Chem. Phys. 139, 164105 (2013).
 Suzuki, Tamura, and Miyazaki (2016) T. Suzuki, R. Tamura, and T. Miyazaki, Int. J. Quantum Chem. 117, 33 (2016).
 Häse et al. (2016) F. Häse, S. Valleau, E. PyzerKnapp, and A. AspuruGuzik, Chem. Sci. 7, 5139 (2016).
 Owens et al. (2015) A. Owens, S. N. Yurchenko, A. Yachmenev, J. Tennyson, and W. Thiel, J. Chem. Phys. 142, 244306 (2015).
 Hastie, Tibshirani, and Friedman (2009) T. Hastie, R. Tibshirani, and J. Friedman (SpringerVerlag, 2009) 2nd ed., p. 763.
 von Lilienfeld (2013) O. A. von Lilienfeld, Int. J. Quantum Chem. 113, 1676 (2013).
 Ramakrishnan et al. (2015b) R. Ramakrishnan, M. Hartmann, E. Tapavicza, and O. A. von Lilienfeld, J. Chem. Phys. 143, 084111 (2015b).
 Dral (2017) P. O. Dral, MLatom: a Package for Atomistic Simulations with Machine Learning, version 0.9, MaxPlanckInstitut für Kohlenforschung, Mülheim an der Ruhr, Germany (2013–2017).
 Ramakrishnan et al. (2014) R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld, Sci. Data 1, 140022 (2014).
 Bereau, Andrienko, and von Lilienfeld (2015) T. Bereau, D. Andrienko, and O. A. von Lilienfeld, J. Chem. Theory Comput. 11, 3225 (2015).
 Rupp, Ramakrishnan, and von Lilienfeld (2015) M. Rupp, R. Ramakrishnan, and O. A. von Lilienfeld, J. Phys. Chem. Lett. 6, 3309 (2015).
 Faber et al. (2016) F. A. Faber, A. Lindmaa, O. A. von Lilienfeld, and R. Armiento, Phys. Rev. Lett. 117, 135502 (2016).
 Rupp (2015) M. Rupp, Int. J. Quantum Chem. 115, 1058 (2015).
 (53) On only one occasion (one of the ML models trained on 50% randomly drawn grid points) the RMSE of the training set in the firstlayer became unreasonably high due to overfitting caused by very small value of the parameter . In this case we set to for the first layer and retrained the ML model.
 Hansen et al. (2015) K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O. A. von Lilienfeld, K.R. Müller, and A. Tkatchenko, J. Phys. Chem. Lett. 6, 2326 (2015).
 Bartók and Csányi (2015) A. P. Bartók and G. Csányi, Int. J. Quantum Chem. 115, 1051 (2015).
 Yurchenko, Thiel, and Jensen (2007) S. N. Yurchenko, W. Thiel, and P. Jensen, J. Mol. Spectrosc. 245, 126 (2007).
 Yachmenev and Yurchenko (2015) A. Yachmenev and S. N. Yurchenko, J. Chem. Phys. 143, 014105 (2015).
 Owens et al. (2016) A. Owens, S. N. Yurchenko, A. Yachmenev, J. Tennyson, and W. Thiel, J. Quant. Spectrosc. Radiat. Transf. 184, 100 (2016).
 Bunker and Jensen (1998) P. R. Bunker and P. Jensen, Molecular Symmetry and Spectroscopy, 2nd ed. (NRC Research Press, Ottawa, 1998).
 Partridge and Schwenke (1997) H. Partridge and D. W. Schwenke, J. Chem. Phys. 106, 4618 (1997).
 Watson (2003) J. K. G. Watson, J. Mol. Spectrosc. 219, 326 (2003).