Tree based machine learning framework for predicting ground state energies of molecules

Tree based machine learning framework for predicting ground state energies of molecules

Burak Himmetoglu Center for Scientific Computing, University of California, Santa Barbara, CA 93106 Enterprise Technology Services, University of California, Santa Barbara, CA 93106
July 19, 2019

We present an application of the boosted regression tree algorithm for predicting ground state energies of molecules made up of C, H, N, O, P, and S (CHNOPS). The PubChem chemical compound database has been incorporated to construct a dataset of 16,242 molecules, whose electronic ground state energies have been computed using density functional theory. This dataset is used to train the boosted regression tree algorithm, which allows a computationally efficient and accurate prediction of molecular ground state energies. Predictions from boosted regression trees are compared with neural network regression, a widely used method in the literature, and shown to be more accurate with significantly reduced computational cost. The performance of the regression model trained using the CHNOPS set is also tested on a set of distinct molecules that contain additional Cl and Si atoms. It is shown that the learning algorithms lead to a rich and diverse possibility of applications in molecular discovery and materials informatics.

I Introduction

Advances in electronic structure theory in combination with ever increasing computing power have made quantum mechanical simulations of molecules and solids rather commonplace. As a consequence, vast amounts of compounds have been studied by various electronic structure methods. In the last decade, efforts to collect and catalog these simulation data, in addition to those obtained from experimental studies have led to the ability to analyze and screen an extensive space of chemical compounds and materials Jain et al. (2013); Curtarolo et al. (2012). Automated screening of the experimental and theoretical compound spaces has become a powerful tool not only for discovering new systems, but also for rational design of chemicals and materials for targeted applications Hart et al. (2013); Hautier et al. (2010); Pizzi et al. (2016). Moreover, with the availability of various databases, analyses based on powerful statistical/machine learning methods have become feasible. Despite the availability of a vast number of systems in these databases, screening for new systems which have not yet been reported before requires a large number of new simulations to be performed. Density functional theory Hohenberg and Kohn (1964) (DFT), based on the effective single-particle Kohn-Sham equations Kohn and Sham (1965) has been a popular choice for performing accurate, yet computationally inexpensive simulations. Despite the relatively low cost of DFT simulations, screening the whole space of compounds and materials require electronic structure predictions at a much lower computational cost, ideally without performing new simulations for each system. Machine learning algorithms are a perfect match for this task, since these algorithms could learn from a given database of electronic structure calculations, and predict the electronic properties of a new set of systems (not included in the database) without the need of performing new simulations. Such a task requires a set of features (a.k.a descriptors) for each system that define their electronic makeup. Then, the electronic structure can be described by a general nonlinear function of these features, which the learning algorithm determines by a sophisticated fit to a given database (i.e. training the learning algorithm). As a result, predictions on new systems can be readily obtained through the trained model parameters.

The idea of predicting electronic structure using data has already been investigated in the literature. Various learning algorithms and choice of features have been proposed to predict electronic structures of molecules and solids. Earliest investigations considered a combined DFT and machine learning based modeling for predicting potential energy surfaces, ground state energies and formation enthalpies of molecules. These studies used neural networks Behler (2011); Hu et al. (2003); Balabin and Lomakina (2009); Sun et al. (2014) and support vector regression Balabin and Lomakina (2011) as learning algorithms. In case of solids, features were constructed from calculated electronic band gaps, cohesive energies, and crystalline volumes for a set of inorganic compounds and used for predicting a new set of band gaps using support vector regression Lee et al. (2016). In principle, apart from the approximation to the exchange-correlation functional (E), DFT calculations use atomic species and their positions as the only inputs for ground state electronic structure predictions 111In practice, the situation is complicated further by the choice of pseudopotentials, basis sets, kinetic energy cut-offs, boundary conditions (especially for molecular systems) and the simulation code Lejaeghere et al. (2016). In a similar manner, features for learning algorithms can be constructed based only on atomic coordinates and types, which provide an improvement over features based on calculated properties. For this objective, Coulomb matrices have been proposed as a robust set of features for a complete description of molecular systems Rupp et al. (2012). With Coulomb matrices as features, various learning algorithms have been tested, ranging from nonlinear kernels to neural networks, which resulted in accurate predictions for atomization energies of molecules Hansen et al. (2013).

Given the success of learning algorithms based on Coulomb matrices as features, it is therefore imperative to analyze available datasets Hill et al. (2016) with electronic structures determined by various DFT methods and implementations. For this purpose, we construct a database of electronic structure calculations based on the molecular structure data publicly available through the PubChem Substance and Compound database Kim et al. (2015). The electronic structure calculations are based on the highly scalable implementation of DFT which employs plane-waves and pseudopotentials. For the machine learning algorithm, we propose the adoption of boosted regression trees Friedman (2001), as a more computationally efficient alternative to previously used methods in the literature. As a demonstration, we show that the boosted regression trees outperform neural networks in predicting atomization energies, while significantly reducing the cost of model training. Our framework, based on Coulomb matrices as features and boosted regression trees as the machine learning algorithm, provide an accurate and efficient pathway to screen the chemical compound space for analyzing data and discovering new molecules.

The paper is organized as follows: We provide details about the computational methods used and the construction of the electronic structure database in section II. In section III, we provide a summary of the machine learning algorithms and the techniques used for model training. In section IV, we present the main results and provide a discussion on the performance of the methods used. Finally, in section V, we provide some concluding remarks.

Ii Methods

ii.1 Obtaining Data and Computational Methods

Molecular structures used in our study are generated from the PubChem Substance and Compound database Kim et al. (2015). The chemical structures which are probed have the substance identifier number (SID) ranging from 1 to 75,000. Using the structural information from this database, a subset of molecules are extracted based on the number and types of atoms, and the size of the molecules. This subset contains molecules that satisfy the following criteria: (i) Each molecule has to be composed of a subset of the elements from the set C, H, N, O, P and S (CHNOPS). (ii) Each molecule must have at least 2, at most 50 atoms. (iii) The maximum distance between two atoms in a molecule must not exceed 25 a ( 0.529 Å, i.e. Bohr radius), for convergence of plane-wave calculations, where each molecule is placed in a cubic box of side length 30 a. (iv) There must be an even number of electrons in the molecule. Applying these criteria to the first 75,000 entries in the PubChem database leads to a subset of 16,242 molecules, whose structure data files (SDF) are converted into input files for electronic structure calculations.

The electronic structure calculations are performed using the plane-waves pseudopotential implementation of DFT in the PWSCF code of the Quantum ESPRESSO package Giannozzi et al. (2009). The exchange-correlation energy is approximated using the generalized gradient approximation (GGA) with the Perdew-Burke-Ernzherof (PBE) parametrization Perdew et al. (1996). The choice of the PBE functional is purely due to its computational efficiency in the plane-waves basis set. Since the aim of this work is to predict simulation results, PBE is sufficient for our purposes. Functionals with exact-exchange that perform better can also be used for calculation of ground state energies which can be fed into the same machine learning algorithms used in this study. All of the atoms are represented by ultrasoft pseudopotentials Vanderbilt (1990). The electronic wavefunctions and charge density are expanded up to kinetic energy cutoffs of 30 Ry and 300 Ry, respectively. The calculations are performed in a cubic box of side length 30 a with a single k-point at . These choices lead to the calculation of the ground state energy with a numerical error of about 1 kcal/mol for the largest molecule in the dataset.

For each molecule in the dataset, we compute the pseudo-atomization energy (E), which is the quantity that serves as the prediction (outcome) for the learning algorithms. We compute E using


where E is the calculated DFT ground state energy, N is the number of atoms in the molecule, the index specifies the type of the atom (belonging to the CHNOPS set), n is the number of atoms of type , and E is the pseudo-energy of the isolated atom of type calculated during pseudopotential generation. All pesudopotentials are generated from the PSLibrary repository, provided in the Quantum ESPRESSO distribution. The histogram for the calculated pseudo-atomization energies from the dataset of 16,242 molecules are shown in Fig. 1.

Figure 1: Histogram of pseudo-atomization energies (E). The mean value of is 11.18 Ry (3506.37 kcal/mol) (indicated by the dashed vertical line) and the standard deviation is 3.66 Ry (1147.89 kcal/mol). A Gaussian probability density function with the same mean and standard deviation is also plotted for comparison.

The total variability in the dataset, quantified by the standard deviation of E, is 3.66 Ry (1147.89 kcal/mol), which is larger than the variability reported in some of the earlier works in the literature Hansen et al. (2013); Montavon et al. (2013), indicating that a much wider range of molecular systems being included in the current study. The consequences of this difference will be discussed in section IV.

ii.2 Data Description and Visualization

In order to build models for predictions from machine learning algorithms, construction of feature vectors (a.k.a descriptors) are needed to represent each molecule. We use the intermolecular Coulomb repulsion operators (will be referred to as Coulomb matrices from now on) introduced in Ref. Rupp et al., 2012, which are defined as


where Z are atomic numbers, are atomic coordinates, and indices run over the atoms in a given molecule. The off-diagonal terms correspond to ionic repulsion between atoms and and the diagonal terms (both the pre-factor and power) are obtained from a fit of the atomic numbers to the energies of isolated atoms Rupp et al. (2012). The Coulomb matrices represent the full set of parameters that DFT calculations take as inputs (Z and ), aside from the approximation to E, which are used to calculate the ground state energy. Therefore, the problem of predicting ground state energies (or atomization energies) using C is well defined. However, C does not provide a unique description, since a given molecule can be represented by more than one matrix that can be obtained by reshuffling the indices of the atoms. There are several ways to overcome this problem as described in Ref. Hansen et al., 2013. Here, aside from C itself, we will use its eigenspectrum (which is one of the unique representations proposed in Ref. Hansen et al., 2013) for a given molecule. Since we limited the number of atoms in a given molecule by 50, the Coulomb matrices we use are matrices. Molecules with less than 50 atoms have their Coulomb matrices appended by columns and rows of 0 to complete them to have dimensions of .

A given molecule in the dataset numbered with index is represented by a -dimensional feature vector , where is the total number of unique entries in the Coulomb matrix (i.e. the upper triangular part of the symmetric matrix C, unrolled into a 1275 dimensional vector) or the number of eigenvalues (i.e. 50 dimensional vector of eigenvalues). The whole dataset is then cast in a data matrix of dimensions , where N is the number of data points (16,242). In this representation, molecules are listed in rows, and each column is an entry in the p-dimensional feature vector (for the molecule). Namely, for a given molecule with index ,


when the full Coulomb matrix is used, while


when the eigenvalues of are used. Finally, the data matrix is explicitly given in terms of the feature vectors by


The outcomes, namely the pseudo-atomization energies E, are also cast in a N-dimensional vector explicitly given by


As will be explained in the next section, the objective of the machine learning problem is to find a nonlinear function that learns a relationship between the data and the outcomes .

Fig. 1 depicts the overall variability in the outcome vector , namely the distribution of E in the dataset. However, it is also useful to explore the features and how they relate to E. This is not a straightforward task, since the data matrix have a very large number of columns (1275 when data is represented by , 50 when data is represented by ). Instead of plotting E as a function of all possible features, a smaller set of features that describe the overall trends in the data can be constructed using principal components analysis (PCA) Friedman et al. (2001). PCA enables one to obtain linear combinations of features (i.e. columns of ) that explain the largest variation in the data. Before utilizing PCA, the data matrix is centered via the transformation


This transformation ensures that each feature has zero mean, i.e. the column sums of are 0. With the centered data, the covariance matrix takes a simple form


The principal components are constructed from the eigenvectors of the covariance matrix, and are ordered with increasing eigenvalues. For instance, the first principal component corresponds to the linear combination of columns of with the largest eigenvalue, hence has the largest variance. Formally, the principal component vector is given by




In the above equation, is the eigenvalue corresponding to principal component vector . Using this recipe, we have constructed the principal components of using the eigenspectrum (Eqn.(4)). Among the 50 principal components, the first two account for 32 % of the variability in the data (i.e., ).

Figure 2: E as a function of the first two principal components Z and Z. The reported values within each rectangular region are the mean values of E, which are determined by the regression tree algorithm outlined in section III.

Fig. 2 illustrates E as a function of the first two principal components ( and ), which display a peculiar nonlinear dependence on the two features. In the next section, we will summarize some of the learning techniques that we have used to accurately model this nonlinear behavior of E based on the data.

Iii Learning Methods

iii.1 Overview

The main purpose of training a learning algorithm in the context of regression is to fit a nonlinear function using the data to predict the outcome . This is achieved via the minimization of a loss function which is a measure of the difference between the actual data and the fit. For instance, in the case where is parameterized by a p-dimensional vector , and the predictions of the outcomes are given by , is obtained by optimizing


where are residuals and the loss function used for optimization is the residual sums squared (RSS). The term is the regularization term, and prevents overfitting Friedman et al. (2001). For example, in the well-known case of LASSO Friedman et al. (2001), a linear function is used for predictions , and the regularization term is . While the parameter vector is obtained by minimizing Eqn.(11), the regularization parameter is obtained by cross-validation. In this article, we consider two types of cross-validation approaches: (i) validation set, (ii) k-fold cross-validation. In both approaches, the full dataset is randomly divided into a training and test set. In this study, 70 % of the data is randomly selected as the training set, and 30 % as the test set. The training set is used to train the learning algorithm, i.e. to obtain values of the regularization parameter(s), while the test set, as an independent piece of the data, is used to report the accuracy of the trained model. The validation set and k-fold cross-validation approaches are schematically illustrated in Fig. 3.

Figure 3: (i) Validation set and (ii) k-fold cross-validation approaches used for training a model. In case of LASSO, a single regularization parameter is picked from the model with lowest cross-validation error. Final accuracy is reported on the independent testing set.

In the validation set approach, the training data is further split (randomly) into two, yielding a second level training set and a validation set. The model is trained on the training set for a range of parameters that determines the regularization term. Then, the model which results in the smallest mean squared error (MSE) is chosen. The MSE is the mean value of the error in the fit given by


In case of LASSO, is the parameter determined from cross-validation. Finally, the first level training set (the original one with 70 % of the data) is used to re-fit the model (with the regularization parameters fixed) and the final accuracy of the model is reported by the performance on the test set. While this method provides a clear path to determining the regularization parameters, the results generally depend on the way the data is split. The k-fold cross-validation approach attempts to resolve this issue by randomly diving the training set into k groups of approximately equal size. For each division (i.e. fold), the validation set approach is repeated: the (k-1) folds are used as the second level training set and the left out fold is used as the validation set. After obtaining k estimates for the MSE, the results are averaged and the model which leads to the smallest cross-validation error (CVE) is picked, as depicted in Fig. 3. k-fold cross-validation usually resolves the dependence of the model parameters on the splits used, at the expense of increased computational cost. We use 5-fold cross-validation for training learning algorithms in this study, except the case of neural networks when the full ’s are used as features (a validation set is used in that case).

iii.2 Regression Trees

The idea behind regression trees is to divide the space of features into regions where in each region, the value of the function is the mean of the observations inside. For example, Fig. 2 illustrates the result of a regression tree fitted using two features that are the principal components . In each of the rectangular regions of the feature space spanned by , the prediction for is the average of the observations (values inside each rectangle in Fig. 2). Namely,


where is a region in the feature space and is the number of data points in . The regions are determined by optimum splits in the feature space that leads to the smallest RSS. Formally, the following function is minimized to obtain the tree


where the first term is the RSS, and the second term act as the regularization. is the number of terminal nodes (or leaves) in a tree which coincide with the number of regions the feature space is split. The larger the number of terminal nodes , the more complex the structure of will be, which may lead to overfitting. The regularization term adds a penalty for complex trees to prevent overfitting. An example of a tree with 8 terminal nodes is shown in Fig. 4.

Figure 4: Regression tree trained on the dataset with the first two principal components used as features. The blue boxes contain the value of the average value of E as well as the percentage of data points in the region they belong to. These regions can also be seen in Fig. 2.

As can be seen, the regression tree is grown recursively. The largest reduction in RSS is obtained at the split , then in splits , and so on.

While the regression tree algorithm is simple and has low computational cost, it has low predictive accuracy. The tree shown in Fig. 4 leads to a very rough description of the nonlinear behavior of E as a function of the features, as can be seen in the regions of Fig. 2. An approach to more accurately describe the nonlinearities is to train an ensemble of trees and combine the predictions, also known as boosting Friedman (2001). The boosted tree algorithm starts with a null prediction () and sequentially train trees on the residuals from the previous tree. Namely,

with the final prediction


where denotes that the tree at iteration is trained on the residuals from the prediction of the tree. The parameter is known as shrinkage, which determines the weight of the predictions that are added on at each iteration. Both the shrinkage and the number of sequentially trained trees are parameters to be determined by cross-validation. Fig. 5 illustrates the process of boosting trees with as features.

Figure 5: Predicted using boosted regression trees for (left panel) and (right panel). While a single tree () results in a rough division of the feature space, boosting many trees () results in much higher accuracy.

While a single tree divides the feature space into rough regions (left panel in Fig. 5), with 100 trees (right panel of Fig. 5), the predicted E values are almost indistinguishable to the eye from the original data (Fig. 2).

In this study, we use the computationally efficient and scalable implementation of the boosting algorithm XGBoost Chen and Guestrin (2016). Apart from the shrinkage (), number of trees (), and the regularization term (Eqn.(14)), XGBoost has several other parameters which are optimized to achieve highly accurate predictions. The parameters we have chosen to determine using cross-validation, with their short description are listed in Table 1, while a full description can be found in Ref. Chen and Guestrin, 2016.

Parameter Description
R Number of trees grown
Regularization term
MD Maximum number of
terminal nodes () in a tree
CST Subsample ratio of randomly chosen
features while training a tree
MCW Minimum number of data points
in a region of feature space in each tree
Table 1: XGBoost parameters

iii.3 Neural Networks

While neural networks are more popular for applications in classification problems, they can also be used for regression Friedman et al. (2001). As input, neural networks take the feature vectors and using an activation function, nonlinear features are created and used for regression. The activation function connects the linear input through layers of neurons to the nonlinear output. The parameters used for connecting the layers in a neural network are obtained by optimizing the RSS (in regression setting) using the training data. A structure of a neural network with one (hidden) layer is shown in Fig. 6. Given the input vector of features , a nonlinear output is obtained by the following equations


In the first line, the vector is constructed from the input features by adding 1, which accounts for the constant term in regression. Then, the input vector is transformed into a derived feature vector in the hidden layer (with size ) using the nonlinear activation function . The transformation is parameterized by the coefficients which are elements of a matrix. The output layer, which serve as the prediction vector ( dimensional), results from a linear transformation via which is an dimensional vector. The parameters and are obtained by minimizing Eqn.(11) with the following regularization term:

Figure 6: A schematic representation of single layer neural network. Each line represents a term that connects a feature () to a derived feature () or a derived feature () to the output (). The activation functions use the parameters between the input and hidden layer, while a linear transformation using is used between the hidden layer and output.

There are several choices for the activation function, but we adopt here the most widely used sigmoid function which is given by


It is possible to obtain more complex nonlinear relationships between the features and the output layer, by including more hidden layers. While more hidden layers may result in higher accuracy, because of the added computational cost, we have limited our study to single-hidden-layer neural networks. In fact, the computational cost of training even a single-layer neural network is much higher than a boosted regression tree, and its discussion is included for the sake of assessing the prediction accuracy of the latter method.

Iv Results and Discussion

iv.1 Model training and Accuracy

Using the cross-validation techniques outlined in the previous section, we have trained both boosted regression trees and neural networks. Either the Coulomb matrices () or their eigenspectrum () were used as features for training. The training set, 70% of the full dataset chosen randomly, is used to determine the parameters of the models. The parameters obtained for boosted regression trees using 5-fold cross-validation is shown in Table. 2.

600 0.0156 0.0 16 0.4 10
400 0.0625 0.0 6 0.2 10
Table 2: Parameters determined for boosted regression trees using 5-fold cross-validation for and based features. Definitions of the parameters are given in Table. 1.

As an illustration, we show in Fig. 7 the results of the 5-fold cross-validation with as features. Since the space of parameters to be optimized is multi-dimensional (Table. 1), we present root mean squared error (RMSE) (i.e. the square root of MSE in Eqn.(12)), when MCW, CS and are fixed to their optimized values from Table. 2.

Figure 7: RMSE (in Ry) from 5-fold cross-validation for boosted regression trees for used as predictors as a function of number of trees (R) and maximum number of trees (MD). Some of the parameters are fixed in the plots: MCW = 10, CS = 0.4, = 0.0.

For neural networks, a 5-fold cross-validation approach is used for training the model when are used as features, while the validation set approach is used when are used as features due to heavy computational cost in this case. In the validation set approach, the initial training set (70% of the original dataset) is split further into a validation set (40% of initial training set) and a second level training set (60% of initial training set) randomly. The cross-validation results in (size of the hidden layer) and (the regularization parameter) when are used, while and is obtained when are used. The resulting cross-validation (or validation set) and test errors, measured by RMSE, for the trained models are summarized in Table. 3. The best performance is obtained with the boosted regression tree algorithm when the Coulomb matrices are used as features. The use of the eigenspectrum as features only slightly increase the RMSE, while the computational cost is much smaller. Therefore, it is reasonable to use the instead of the full . In Fig. 8 we show the difference between predicted and actual E evaluated on the test set, using .

As we pointed out earlier, the boosted regression tree algorithm is computationally more efficient than neural networks. For example, with as features, the elapsed time to train the boosted regression tree model (the model determined by cross-validation whose parameters are listed in Table. 2) is 12.1 seconds (on a laptop with 8 CPU cores). Instead, the elapsed time to train the neural network model (the model with and determined by cross-validation) is 258.2 seconds. In case of the elapsed time to obtain the fit to the testing data (after the models are trained), the neural network model takes 0.02 seconds, while the boosted regression tree model takes 0.01 seconds. In case of as features, the training times are 17.3 seconds for the boosted regression tree and 16917.2 seconds for the neural network, while the fitting times to the test data are 0.01 and 0.2 seconds, respectively. While the fitting time for neural networks are almost the same as boosted regression trees, the vast difference between training times, in addition to increased accuracy, makes boosted regression tree the preferred algorithm.

Previous studies which analyzed the GDB database Fink et al. (2005); Fink and Reymond (2007) of molecules, have found much smaller test RMSE for their best achieving learning methods Hansen et al. (2013); Montavon et al. (2013). For example, Ref. Hansen et al., 2013 reported a RMSE of 20.29 kcal/mol for multi-layer neural networks and 13.18 kcal/mol for a nonlinear kernel method when were used as features. These are smaller values of RMSE compared to our values of 41.81 and 60.06 kcal/mol for boosted trees and single layer neural networks, respectively. The main difference is due to the fact that the variability in the GDB database is much smaller than our dataset which is based on the PubChem data. Ref. Hansen et al., 2013 used a subset of the GDB database that contains 7165 molecular structures which has maximum 23 atoms per molecule. The reported standard deviation of the atomization energies (RMSE when mean value of is used as the prediction) was 223.92 kcal/mol. In our study, we have 16,242 structures with a maximum of 50 atoms per molecule, while the standard deviation of the atomization energies is 1147.89 kcal/mol, which indicates that the dataset used in this study encompasses a larger range of molecules. As a result of the larger variance in the training data (almost five times of Ref. Hansen et al., 2013), the learning algorithms result in higher RMSE values. Notice also that the number of molecules in this study is much larger, leading to a computationally more expensive model training when a method like neural network regression is used. It is possible to obtain much better accuracies by including more data with the use of the full PubChem database, instead of the first 75,000 entries as we did in this work. In addition, using several copies of the same molecule by representing them via randomly re-ordered Coulomb matrices (a method introduced in Ref. Hansen et al., 2013 to address the uniqueness problem of ) would reduce the variance in the dataset, leading to better accuracies. While the reported RMSE values are higher than what would be desired (e.g. the accuracy of a few kcal/mol), inclusion of more data presents a clear path to reach more accurate models. As an example, we have tested the accuracy of the randomly re-ordered Coulomb matrices and found that with only 4 random re-orderings included (yielding a training set four times larger than original), the RMSE of 36.63 kcal/mol (Table. 3) reduces to 27.74 kcal/mol. It is also possible to use an ensemble method Friedman and Popescu (2008), where predictions of boosted trees and neural networks are combined. In each method, the worst predictions are on different molecules, and by combining the two on the regions where they perform the best, improved accuracies can be obtained. Another approach has been proposed in a more recent work, where feature learning increased prediction accuracies Dai et al. (2016). While all of these approaches would lead to increased accuracy, they also come with added computational cost and are beyond the scope of this work. Therefore, we leave these tasks for a future study.

Feature Method 5-fold CV (or validation set) error test error
Boosted Tree 44.09 41.81
N. Network 62.15 60.06
Boosted Tree 38.75 36.63
N. Network 45.25 58.39
Table 3: 5-fold CV (validation set in case of neural network with ) and test RMSE in kcal/mol. The difference between 5-fold CV and test RMSE is smaller than that of validation set and test RMSE as expected, since 5-fold CV reduces the dependence of RMSE to the way the data is split.
Figure 8: The difference between actual and predicted E on the test set. Predictions are made using the boosted regression tree with as features. Color coding represents the number of molecules that fall into each bin in the histogram.

iv.2 Predictions on independent datasets

In the previous subsection, we have trained models to predict E based on a dataset of molecules made up of elements in the CHNOPS set. Namely, each molecule contains at least one of the atoms in the set . While a random division of the dataset into training and test sets allowed us to evaluate the accuracy of the models, another possible measure for performance would be based on a set of molecules outside the CHNOPS set. For this purpose, we have constructed two distinct datasets in order to test the models trained in the previous subsection: (i) Cl-set, which contains at least one Cl atom in addition to at least one of CHNOPS, (ii) Si-set, which contains at least one Si atom in addition to at least one of CHNOPS. From the first 75,000 entries in the PubChem database, we have found that there are 159 molecules in the Si-set and 4114 in the Cl-set. Similar to the original CHNOPS set, each molecule in both Si and Cl sets have a maximum of 50 atoms per molecule, and even number of electrons. Using the boosted regression trees we have trained in the previous subsection, we predict E in Cl and Si sets, for which the results are presented in Table. 4. Since the models are not trained with molecules comprising of Si or Cl, this test assesses applicability of our method when predictions on molecules with new elements need to be obtained.

Set RMSE () RMSE ()
Cl 75.18 114.20
Si 89.13 86.53
Table 4: Test RMSE (kcal/mol) on the Cl and Si sets. Predictions are based on the boosted regression tree algorithm trained on the CHNOPS set.

As expected, the RMSE values are higher for the Cl and Si sets, than that of the test errors reported in Table. 3. Therefore the models can only be used as exploratory tools when predictions on completely separate datasets are needed.

V Conclusions

In this work, we have proposed the use of boosted regression trees as a higher accuracy and computationally much more efficient alternative to some other machine learning methods proposed for electronic structure prediction. We have tested the performance of boosted regression trees using the PubChem database, and shown that it outperforms single-layer neural networks in predicting ground state energies (equivalently E). Due to the ability to grow many trees in parallel, boosted regression trees are much faster than neural networks, which require large matrix operations. We have also shown that the trained algorithms can be used for predicting electronic structure of molecules containing elements other than the ones included in the training set. While the prediction accuracy is reduced in this case, the method is still applicable for exploratory studies.

Machine learning techniques provide a rich possibility of applications for quantum mechanical simulations in computational chemistry and materials science Rupp (2015). In fact, there has been several other compelling applications of learning algorithms for predicting electronic properties other than atomization energies for molecules Montavon et al. (2013), prediction of electronic structure of solids Schütt et al. (2014); Faber et al. (2015), finding DFT functionals Snyder et al. (2012) and determining potential energy surfaces Lorenz et al. (2006, 2004); Khorshidi and A. (2016). The adoption of boosted regression trees for the learning method, as proposed here, would reduce the cost of model training compared to computationally heavier algorithms like neural networks and support vector regression, without sacrificing, and possibly increasing prediction accuracies.

With the ability to predict electronic properties without performing new simulations for each molecule, machine learning techniques open up exciting pathways for rational design of new compounds. Combined with numerous efforts to catalog and standardize datasets, these methods will be invaluable for many scientific and technological applications.

Vi Acknowledgements

We acknowledge support from the Center for Scientific Computing from the CNSI, MRL: an NSF MRSEC (DMR-1121053) and NSF CNS-0960316.


  • Jain et al. (2013) A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder,  and K. A. Persson, APL Mater. 1, 011002 (2013).
  • Curtarolo et al. (2012) S. Curtarolo, W. Setyawan, G. L. W. Hart, J. M, R. V. Chepulskii, R. H. Taylor, S. Wang, J. Xue, K. Yang, L. O., M. J. Mehl, H. T. Stokes, D. O. Demchenko,  and D. Morgan, Comput. Mater. Sci. 58, 218 (2012).
  • Hart et al. (2013) G. L. W. Hart, S. Curtarolo, T. B. Massalski,  and O. Levy, Phys. Rev. X 3, 041035 (2013).
  • Hautier et al. (2010) G. Hautier, C. C. Fischer, A. Jain, T. Mueller,  and G. Ceder, Chem. Mater. 22, 3762 (2010).
  • Pizzi et al. (2016) G. Pizzi, A. Cepellotti, R. Sabatini, N. Marzari,  and B. Kozinsky, Comput. Mater. Sci. 111, 218 (2016).
  • Hohenberg and Kohn (1964) P. Hohenberg and W. Kohn, Phys. Rev. 136, B864 (1964).
  • Kohn and Sham (1965) W. Kohn and L. J. Sham, Phys. Rev. 140, A1133 (1965).
  • Behler (2011) J. Behler, Phys. Chem. Chem. Phys. 13, 17930 (2011).
  • Hu et al. (2003) L. Hu, X. Wang, L. Wong,  and G. Chen, J. Chem. Phys. 119, 11501 (2003).
  • Balabin and Lomakina (2009) R. M. Balabin and E. I. Lomakina, J. Chem. Phys. 131, 074104 (2009).
  • Sun et al. (2014) J. Sun, J. Wu, T. Song, L. Hu, K. Shan,  and G. Chen, J. Phys. Chem. A 118, 9120 (2014).
  • Balabin and Lomakina (2011) R. M. Balabin and E. I. Lomakina, Phys. Chem. Chem. Phys. 13, 11710 (2011).
  • Lee et al. (2016) J. Lee, A. Seko, K. Shitara, K. Nakayama,  and I. Tanaka, Phys. Rev. B 93, 115104 (2016).
  • (14) In practice, the situation is complicated further by the choice of pseudopotentials, basis sets, kinetic energy cut-offs, boundary conditions (especially for molecular systems) and the simulation code Lejaeghere et al. (2016).
  • Rupp et al. (2012) M. Rupp, A. Tkatchenko, K.-R. Müller,  and O. A. von Lilienfeld, Phys. Rev. Lett. 108, 058301 (2012).
  • Hansen et al. (2013) K. Hansen, G. Montavon, F. Biegler, S. Fazli, M. Rupp, M. Scheffler, O. A. von Lilienfeld, A. Tkatchenko,  and K.-R. Müller, J. Chem. Theory Comput. 9, 3404 (2013).
  • Hill et al. (2016) J. Hill, G. Mulholland, K. Persson, R. Seshadri, C. Wolverton,  and B. Meredig, MRS Bull. 41, 399 (2016).
  • Kim et al. (2015) S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B. A. Shoemaker, J. Wang, B. Yu, J. Zhang,  and S. H. Bryant, Nucleic Acids Res. , gkv951 (2015).
  • Friedman (2001) J. H. Friedman, Ann. Statist. 29, 1189 (2001).
  • Giannozzi et al. (2009) P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli, G. Chiarotti, M. Cococcioni, I. Dabo, et al., J. Phys. Condens. Matter 21, 395502 (2009).
  • Perdew et al. (1996) J. P. Perdew, K. Burke,  and M. Ernzerhof, Phys. Rev. Lett. 77, 3865 (1996).
  • Vanderbilt (1990) D. Vanderbilt, Phys. Rev. B 41, 7892 (1990).
  • Montavon et al. (2013) G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko, K.-R. Müller,  and O. A. von Lilienfeld, New J. Phys. 15, 095003 (2013).
  • Friedman et al. (2001) J. Friedman, T. Hastie,  and R. Tibshirani, The elements of statistical learning (Springer series in statistics Springer, Berlin, 2001).
  • Chen and Guestrin (2016) T. Chen and C. Guestrin, arXiv preprint arXiv:1603.02754  (2016).
  • Fink et al. (2005) T. Fink, H. Bruggesser,  and J.-L. Reymond, Angew. Chem. Int. Ed. 44, 1504 (2005).
  • Fink and Reymond (2007) T. Fink and J.-L. Reymond, J. Chem. Inf. Model. 47, 342 (2007).
  • Friedman and Popescu (2008) J. H. Friedman and B. E. Popescu, Ann. Appl. Stat. 2, 916 (2008).
  • Dai et al. (2016) H. Dai, B. Dai,  and L. Song, arXiv preprint arXiv:1603.05629  (2016).
  • Rupp (2015) M. Rupp, Int. J. Quantum Chem. 115, 1058 (2015).
  • Schütt et al. (2014) K. T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K. R. Müller,  and E. K. U. Gross, Phys. Rev. B 89, 205118 (2014).
  • Faber et al. (2015) F. Faber, A. Lindmaa, O. A. von Lilienfeld,  and R. Armiento, Int. J. Quantum Chem. 115, 1094 (2015).
  • Snyder et al. (2012) J. C. Snyder, M. Rupp, K. Hansen, K.-R. Müller,  and K. Burke, Phys. Rev. Lett. 108, 253002 (2012).
  • Lorenz et al. (2006) S. Lorenz, M. Scheffler,  and A. Gross, Phys. Rev. B 73, 115431 (2006).
  • Lorenz et al. (2004) S. Lorenz, A. Gross,  and M. Scheffler, Chem. Phys. Lett. 395, 210 (2004).
  • Khorshidi and A. (2016) A. Khorshidi and P. A. A., Computer Physics Communications 207, 310 (2016).
  • Lejaeghere et al. (2016) K. Lejaeghere, G. Bihlmayer, T. Björkman, P. Blaha, S. Blügel, V. Blum, D. Caliste, I. E. Castelli, S. J. Clark, A. Dal Corso, et al., Science 351, aad3000 (2016).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description