Constant Size Molecular Descriptors For Use With Machine Learning
Abstract
A set of molecular descriptors whose length is independent of molecular size is developed for machine learning models that target thermodynamic and electronic properties of molecules. These features are evaluated by monitoring performance of kernel ridge regression models on wellstudied data sets of small organic molecules. The features include connectivity counts, which require only the bonding pattern of the molecule, and encoded distances, which summarize distances between both bonded and nonbonded atoms and so require the full molecular geometry. In addition to having constant size, these features summarize information regarding the local environment of atoms and bonds, such that models can take advantage of similarities resulting from the presence of similar chemical fragments across molecules. Combining these two types of features leads to models whose performance is comparable to or better than the current state of the art. The features introduced here have the advantage of leading to models that may be trained on smaller molecules and then used successfully on larger molecules.
CMUChem]Department of Chemistry, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 United States CMUML]Machine Learning Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 United States University of Basel]Department of Chemistry, Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials (MARVEL), University of Basel, 4056 Basel, Switzerland CMUChem]Department of Chemistry, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 United States
1 Introduction
Cheminformatics has a long tradition of using the techniques of machine learning to predict properties of molecules. A particular focus has been properties that emerge from interactions of the molecule with complex external systems, such as arise in drug discovery^{1, 2}. More recently, informatics has been successfully used to develop models of the properties of the molecules themselves. This includes thermodynamic properties^{3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, electronic^{13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26} properties and molecular structure^{27, 28, 29, 30}. Successful predictions of such molecular properties provides an alternative to quantum chemical computations, leading to useful predictions at a small fraction of the cost of quantum computations^{4, 5}. Because the target properties involve only the molecule itself, as opposed to interactions with complex external systems, the descriptors that bring the most explanatory power to the informatics tasks may differ from those developed for targets involving external systems. This paper explores a number of descriptors, or features, for predicting the properties of individual molecules.
A key dimension along which informatics models of molecular properties differ is the degree to which results from chemical computations are used as features for the prediction task. On one end of this spectrum, the model corrects predictions from standard computational chemistry methods, so that they better match more expensive calculations^{31, 17, 18} or experiments^{3, 32}. At the other end of the spectrum, the only information used is what one might get from a line drawing or SMILES^{33} string. At the latter level, there have been many examples of using simple counts of atoms and bonds to make predictions of molecular properties^{34, 5}. The low computational cost of such models makes them useful for searching molecular databases^{35}. Their restricted applicability to potential energy minima or known bonding patterns; however, reduces their predictive power.
Between the above two extremes are models that utilize the 3D molecular geometry to make predictions. Models that use features based on the Coulomb matrix, such as the recent Bag of Bonds (BoB) feature vector, have been shown to be highly effective with this level of information^{34, 5}. However, the length of these feature vectors scales formally as the square of the number of atoms in the largest molecule of interest^{36}. There have been attempts to rectify this issue, but they have been unable to achieve the same level of performance^{20}.
A primary goal of the work presented here is to develop feature vectors whose length is independent of molecular size and that subsequently lead to models that generalize well from smaller to larger systems. This should be possible if we assume: a) an effective chemical periodicity where large systems consist of smaller fragments which resemble exemplary chemistries present in the data base used for training, and b) locality implying additivity. For certain systems, these assumptions can break down; in particular, the latter one when strong electron delocalization effects are present as in metals or in the bonds of conjugated polymers. To achieve our goal, two classes of features are explored.
The first class of features are based on connectivity counts, which have been explored in a number of different ways.^{8, 37, 38, 39, 40} Here, these features include counts of atom types and bond types, which we refer to as rank1 and rank2 features since they summarize patterns related to one and two atoms, respectively. Higherrank features, that count occurrences of patterns related to three or more bonded atoms, are also explored.
The second class of features are encoded distances. These summarize the pairwise distances between atoms in a molecule through a feature vector whose length is independent of molecular size. The feature vector is a smoothed histogram of distances between atoms, including nonbonded atoms.
Both classes of features summarize information regarding the local environment of atoms and bonds in the molecule. The resulting feature vectors therefore contain information on the fragments present in the molecule. In particular, when two molecules possess the same molecular fragment, the interactions amongst the atoms in that fragment make similar contributions to the feature vector. Use of these feature vectors to predict similarity between two molecules, as in the kernel methods below, therefore incorporates similarity resulting from the presence of similar fragments. In addition, the feature vectors include information regarding interactions amongst the fragments. The features may therefore enable models to generalize from smaller to larger molecules.
2 Related Work
The features studied here are somewhat related to features developed for neural net models of potentials^{41, 10, 42, 43}. In those models, the total energy is written as a sum over atomic energies, with the atomic energies being functions of the local environment^{44}. The environment is described using features based on two and three body correlation functions around the atom. The encoded distance features explored here may be viewed as smoothed versions of twobody correlation functions between each pair of elements. We note that the latter lack uniqueness, a crucial feature property for guaranteeing error convergence with training set size^{20}. In particular, they will fail to distinguish homometric molecules.
Below, we use kernelbased learning methods to explore the utility of the features introduced here. Our studies therefore have some similarities to the Regularized Entropy Match Smooth Overlap of Atomic Positions (REMatchSOAP) kernel, which provides an alternative means for predicting molecular properties using only 3D molecular geometries^{23}. REMatchSOAP does not introduce a molecular feature vector and, instead, provides a means to directly compute the similarity between two molecular structures, as needed for kernel methods. The approach begins by computing the similarity between each of the atoms in the two molecules being compared, where the similarity measure includes the molecular environment. The molecular similarity is then obtained from a regularized entropy match of the resulting atomic similarities.
In contrast to REMatchSOAP, the features introduced here are not limited to kernel methods and, when used in kernel methods, allow us to separately consider the influences of three factors: the feature vector, the distance metric used to connect features to the similarity between molecules, and the kernel function. Our emphasis here is on the features themselves. We use cross validation to select the kernel, including the width parameter for an exponential kernel function and the choice of a Laplacian kernel, which uses an L1 distance metric, versus a Gaussian kernel, which uses an L2 distance metric. The performance of the resulting models on well studied data sets is examined, with particular attention paid to the ability of the method to scale between different size molecules.
3 Data Sets
This work compares performance of various molecular features on data sets that have been the subject of past work, demonstrating the predictive power of machine learning models to predict molecular properties. These data sets utilize molecules selected from the following two data sets, each of which contains only the bonding pattern as summarized in a SMILES string. The GDB13 data set contains 977 million molecules made of C, H, O, N, S, and Cl, with up to 13 heavy atoms^{45}, and the GDB17 data set contains 166 billion molecules made up of C, H, O, N, and F with up to 17 heavy atoms^{46}.
The QM7 data set corresponds to a subset of the GDB13 molecules, consisting of all 7,165 molecules that contain 7 or fewer heavy atoms of elements C, N, O, or S for which optimized structures and atomization energies were calculated with the PerdewBurkeErnzerhof hybrid functional (PBE0)^{4}. Models for the atomization energies in QM7 have been extensively studied, primarily using features derived from the Coulomb matrix^{4, 5, 6, 34, 11}. As such, this data set provides a useful benchmark for testing models and comparing to previous work.
The QM7b data set extends the QM7 data set to molecules containing chlorine, and includes a broader range of properties such as HOMO and LUMO energies, excitation energies, and polarizabilities calculated at various levels of theory^{16}. QM7b allows performance of machine learning models to be explored on a wider variety of properties. Past works have also used QM7b to explore multitarget regression methods which, by simultaneously predicting various properties, can potentially benefit from relationships between the target properties^{6, 34, 16}.
The QM9 data set^{47} contains data for a subset of the GDB17 molecules, consisting of 133,885 molecules that contain nine or fewer heavy atoms. (The name QM9 is used here to be consistent with that of QM7.) The QM9 data set has been used in several studies^{34, 19, 21, 20} and includes optimized structures and 18 different properties calculated using B3LYP/631g(2df,p). Below, QM9 is used to explore the degree to which models training on smaller molecules can transfer to larger molecules. The large amount of data in QM9 also allows more extensive studies on the degree to which inclusion of additional data improves model performance.
4 Machine learning
A variety of machine learning methods have already been used to predict molecular properties, including linear regression, linear ridge regression, support vector regression, kernel ridge regression, regression trees, knearest neighbors, and neural nets/deep learning^{16, 5, 10}. For this study, we have elected to use two standard machine learning methods^{48}, linear ridge regression (LRR) and kernel ridge regression (KRR) with the Gaussian and Laplacian kernels. These methods were chosen due to their relative simplicity, compared for instance to deep learning, and because past work has shown them to be effective on the data sets studied here^{5, 34}.
The models involve very many parameters which, during “model training”, are adjusted to obtain predictions that agree with a given set of examples. In addition, both LRR and KRR involve hyperparameters that serve to define further the model form. The wide range of values for the hyperparameters leads to a wide range of possible models. “Model selection” involves searching through these to find a model, or equivalently a set of hyperparameters, that leads to good performance. Here, we do an exhaustive search over a grid of hyperparameter values using cross validation as described below.
In LRR, there is one hyperparameter, , which is the weight on the ridge term. When is zero, the model produces the same result as linear regression. Finite values for penalize larger regression parameters and so using finite values for favors simpler models (those with smaller regression parameters). For LRR, the following values were scanned over: . KRR retains the parameter of LRR and adds two more. The choice of kernel for KRR can be considered as a discrete hyperparameter and here the Gaussian () and Laplacian () kernels are included in the search. For each of these kernels, there is an additional continuous hyperparameter, , which sets the width of the kernel. The search over and is exhaustive on a square grid with values .
Model selection and testing use fold cross validation (CV). In fold CV, the examples in the data set are split into equally sized sets. One of these folds is then held out, and the examples in the remaining folds are used to train a model. The mean absolute error (MAE) for predictions made on the examples in the hold out set is then computed. This process is repeated, with each one of the folds being the hold out set, leading to different MAEs. The average of these MAEs is then taken as a measure of model performance.
Here, model selection is based on a loop over =5 splits. Within this loop, one split is ignored and the remaining 80% of data is used to evaluate each set of hyperparameters, via 5fold cross validation. This generates 5 figures of merit for each set of hyperparameters. The hyperparameters that have the best average performance are then selected as the final model. The MAE for this model, reported below and in the Supporting Information, is the average from a final 5fold cross validation performed on the entire data set, using the selected hyperparameters. (See Supporting Information for the hyperparameters of each model.)
5 Features
The primary goal of this work is a comparison of models developed using different types of molecular features. Ideally, identical molecular geometries should lead to identical feature vectors, implying that the features are invariant with respect to molecular translations and rotations, and to reordering of the atoms.
For models trained on smaller molecules to transfer to larger molecules, a number of other properties of the feature vectors are desirable. One such property is that the length of the feature vector be independent of molecular size. For the Coulomb matrix and BoB features, this property does not hold, as the number of elements in the feature vector scales quadratically with the number of atoms in the largest molecule in the data set. To support models of differently sized molecules, such feature vectors are expanded to a length that can accommodate the largest molecules in the data set, with zeros added to pad the extended regions for smaller molecules. For the remaining features considered below, the feature vectors have a length that depends on the molecular diversity, e.g. atom and bond types, but does not depend on molecular size.
5.1 Null model
A “null” model is used to provide a baseline measure of the difficulty of the prediction task. The null model always predicts the mean value of the training data. The MAE of the null model therefore reflects the spread of the data.
5.2 Coulomb Matrix and Bag of Bonds
A class of features that has led to wellperforming models of molecular properties are those derived from the Coulomb matrix^{4, 5, 6, 16, 17, 18, 19, 21},
(1) 
where and are the atomic number and Cartesian position of the atom. This feature is appealing in that it summarizes the molecular structure in terms of Coulomb interactions between atoms. Alternative forms for the interaction, other than Coulomb interactions between bare nuclei, have been tried but were not found to enhance model performance^{34}.
While the Coulomb matrix is invariant to molecular rotations and translations, it is not invariant to reordering of the atoms. A number of means have been explored to address the dependence on atom order, including sorting the matrix based on the magnitude of the norm of the rows, using randomly permuted sets of matrices, and using the eigenvalues of the matrix^{5, 6, 16}. (The results generated here do not sort the Coulomb matrix, and simply arrange the lower triangle of the symmetric matrix into a vector.) A recent successful approach rearranges the Coulomb matrix into a “Bag of Bonds”, BoB, feature ^{34}. In BoB, offdiagonal elements of the matrix are first divided into bags, based on the elements involved in the Coulomb interaction (CH, CC, CO, etc). The values in each bag are then sorted from high to low values. The maximum size of each bag, across the molecules in the data set, is then determined and zeros are added to each bag so that all molecules have bags of the same length. The bags are then concatenated to make a single feature vector that is invariant to reordering of the atoms. Because the BoB model is just a reordering of Coulomb matrix elements, the length of the vector still grows quadratically with the size of the largest molecule in the data set. BoB has also recently been extended to include threebody and higher terms^{49}, leading to feature vectors who length grows as the third or higher power of the number of atoms in the largest molecule.
5.3 Connectivity Counts
This class of features summarizes the bonding pattern of a molecule through features that count occurrences of some pattern regarding the bonding between atoms. Each feature may be assigned a rank, based on the number of distinct atoms that are examined when testing for the presence of the pattern. Rank1 features count atom types, such as element or element plus coordination number, leading to a vector whose length is equal to the number of atom types. Rank2 features count bond types (single, aromatic, double, triple), leading to a vector whose length is the number of bond types. Rank3 features count triplets of bonded atoms, and so on.
For rank 1, each element of the feature vector, , holds the number of atoms in the molecule that have type ,
(2) 
where the sum is over atoms, , is a function that returns the atom type, and is the Kronecker symbol. Two atom typing schemes are explored here, leading to two different rank1 feature vectors denoted as and . For , atom type is defined by element only. For , atom type is defined by a combination of element and coordination number. Coordination number is taken as the number of bonds in which the atom participates.
For rank 2, each element of the feature vector, , holds the number of bonds in the molecule that have type ,
(3) 
where the sum is over atoms, and is a function that returns the bond type between atoms and , or null if a bond is not present. Two bond typing schemes are explored here, leading to two different rank2 feature vectors denoted as and . For , bond type is defined by the two elements participating in the bond. A bond is assumed to be present if the separation between atoms is less than the cutoffs listed in the Supporting Information. For , bond type is defined by the two elements participating in the bond and the bond order, with bond orders assigned based on bond length as discussed in the Supporting Information.
For rank 3, each element of the feature vector corresponds to a tuple of two bond types, , with a value
(4) 
where the sum is over atoms. The feature counts only situations where atoms and are bonded to a common atom, . The summation is thus over all triples of atoms to which a bond angle would typically be ascribed. Higherrank patterns are generalizations of Eq. (4) to higher order. For example, the rank 4 summation is
(5) 
with the summation being over all sets of four atoms to which a dihedral angle would typically be ascribed.
5.4 Encoded distances
The connectivity count features of Section 5.3 summarize the bonding pattern of a molecule but do not include information about the bond lengths, bond angles, or dihedral angles. The encoded features developed here allow information on the 3D structure of the molecule to be used while keeping the length of the feature vector independent of the size of the molecules included in the data set. We include only rank2 encodings, which summarize information regarding atomatom distances. Extension to higher ranks would lead to a substantial increase in the length of the total feature vector and are not explored here.
The encoded distance features may be viewed as generalizations of the above rank2 connectivity features. Because the bond types of Eq. (3) are based on atomatom distances, the rank2 connectivity counts of Section 5.3 may be viewed as encoding atomatom distances in a coarse grained manner. To introduce the notation we will use for encoding, we rewrite Eq. (3) as
(6) 
where and are the lower and upper limits of the distance ranges that define the bond type, , for the corresponding pair of elements, and is the distance between atoms and . Eq. (6) may be viewed as creating a histogram for each pair of elements. For feature vector , the histogram has just one bin and reports the number of bonds between those elements in the molecule. For feature vector , the number of bins in the histogram is the number of allowed bond orders between those elements.
A simple approach to adding more information to the feature vectors is to add more bins to the histogram. The information encoded can also be expanded to include distances between nonbonded atoms. To accomplish this, we create a uniformly spaced grid,
(7)  
For each pair of elements, we then get a set of features,
(8) 
The feature vector of Eq. (8) has two disadvantages. The first is that a small change in can lead to a discontinuous change in the feature vector when is near a grid point, . In addition, for values of that are large compared to the number of pairs of elements in the molecules of interest, the feature vector becomes sparse, making the learning prone to overfitting.
To overcome these disadvantages of the histogramlike features, we generalize Eq. (8) by replacing of Eq. (8) with an encoding function, ,
(9) 
returns a value in the range and may depend on a parameter, , that is used to alter the smoothness of the function. We consider two general classes of encoding functions: probability density functions (PDF) and cumulative distribution functions (CDF). For both of these, we consider three types of functions: the normal distribution, the logistic distribution, and the spike distribution (Table 1).
Name  Name2  in Eq. (9) 

Normal PDF  
Normal CDF  
Logistic PDF  
Logistic CDF  
Spike PDF  
Spike CDF 
For the PDF class of functions, the feature vector may be viewed as adding noise to each of the atomatom distances, , with noise distributed according to the distribution of the encoding function. The CDF class of functions integrate the corresponding PDFs, leading to a feature vector that increases monotonically with distance and has a slope that indicates the presence of atomatom separations near .
Empirical results, discussed below, suggest that model performance may be improved by limiting the geodesic distance, , between atoms included when computing the feature,
(10) 
where is an integer specifying the number of bonds along the shortest path between the atom and atom . The last term limits the summation to pairs of atoms that are separated by up to bonds. Other works have taken a similar approach by setting a cutoff based on the Euclidean distance between atoms ^{41, 10, 42, 43, 11}. In the current studies, model performance was found to be somewhat better when using a cutoff based on geodesic distance. Examples of encoded features for benzene can be seen in Figure 1.
The parameters specifying the encoded distance features were tuned by evaluating performance on the QM7 data set. The Supporting Information discusses the choice of and of Eq. (9). For the results reported below, was , and the grid consists of 100 points between and . The choice of cutoff distance, of Eq. (10), is based on the results in Figure 2, which show how model performance varies with for various encoded features. Results for the Coulomb matrix and BoB features are also shown, in which case is implemented by setting to zero those values of the Coulomb matrix that correspond to atoms separated by a geodesic distance greater than . For nearly all features, the best performance is obtained with of 2 or 3, corresponding to distances over which angles and dihedrals would be assigned to the molecular structure. This is true of all features, when LRR is used, and nearly all features, when KRR is used. The exceptions are KRR with (results not shown as the MAE is greater than 50 kcal/mol) and KRR with the Coulomb matrix. Note, however, that when the Coulomb matrix elements are reordered to form BoB features vectors, the optimal is in the 23 range. The observation that is in this range for a broad range of feature types and for both KRR and LRR models suggests that the atomization energy has a local character, whereby the energy of a given bond is primarily a function of its local environment. For the remainder of this work, is set to 3 for the encoded features. To allow comparison with past work, the BoB features do not limit .
Because the KRR models outperform LRR by more than 0.5 kcal/mol, we also discuss only results from KRR. (Full results from both LRR and KRR are tabulated in the Supporting Information.)
6 Results
Throughout this section, we summarize results by listing the features, with notation as in Section 5, followed in parentheses by the MAE of a KRR model developed as in Section 4. For example, (7.69 kcal/mol) indicates that a KRR model using a feature vector leads to an MAE of 7.69 kcal/mol. The Supporting Information contains an extensive list of results from both LRR and KRR models based on different feature sets. Selected results are discussed here.
We first consider models that use a single type of feature from Section 5 to predict the atomization energies of the QM7 data set. For connectivity counts, the best performing single features are (7.69 kcal/mol) and (6.36 kcal/mol). The use of the 3D molecular structure via encoded distance features lowers the error substantially, with the best feature being (2.45 kcal/mol). The other encoding functions perform nearly as well, except for (146 kcal/mol) which is essentially a histogram and so gives a discontinuous and sparse feature vector. This suggests that smooth encoding functions lead to enhanced performance.
Model performance can be enhanced by concatenating features of different ranks. The effects of adding features of increasing rank are shown in Figure 3, with selected results shown in Table 2. The upper branch in Figure 3 (solid lines) shows results when only connectivity counts are included and all feature vectors begin with . These have the advantage of requiring only a line drawing or SMILES string of the molecule. Addition of higherrank connectivity counts improves performance up though rank 4, with addition of rank 5 leading to a small degradation. Figure 3 also examines whether inclusion of bond order improves performance, e.g. whether (red line) or (black line) leads to a lower MAE. For feature vectors including only connectivity counts, inclusion of bond order enhances performance at rank 3, but degrades performance at higher ranks. The lowest MAE for concatenated connectivity counts is (3.40 kcal/mol).
The lower branch (dashed lines) of Figure 3 shows results from feature vectors that begin with , and so include encoded distances from 3D molecular structures. For feature vectors that include encoded distances, inclusion of rank 3 leads to substantial improvements in model performance and inclusion of rank 5 degrades performance. Inclusion of rank 4 either slightly improves or slightly degrades performance, depending on both the type of rank2 encoding employed and the property being predicted. In contrast to the connectivity only features, inclusion of bond order leads to better performance at both rank 3 and rank 4. The best performance on QM7 atomization energies is (1.19 kcal/mol). Inclusion of rank 4, as (1.25 kcal/mol), slightly degrades performance. The performance is significantly better than the 2.40 kcal/mol from BoB, which was the previous stateoftheart for the QM7 data set^{34}.
First thousand  
molecules of QM9  
feature  QM7  CV  20  50  133 
Null  179.01  189.45  290.57  322.72  425.12 
Coulomb matrix  3.37  3.83  43.78  66.56  106.09 
BoB  2.40  2.43  9.68  22.21  29.99 
14.58  15.46  18.20  21.65  20.99  
7.69  7.52  10.42  16.67  15.73  
6.36  6.84  8.26  15.39  15.73  
6.88  6.37  8.75  11.11  11.17  
5.28  4.19  6.53  10.03  9.99  
4.51  4.25  6.34  8.36  8.59  
3.64  3.68  5.70  9.18  9.89  
4.13  3.81  6.10  9.90  10.51  
3.40  3.56  5.43  8.64  9.42  
3.87  3.69  5.92  9.53  10.25  
2.74  2.10  3.26  5.83  5.76  
2.45  1.65  2.65  5.18  4.71  
1.68  1.57  2.42  4.20  4.09  
1.61  1.44  2.32  4.13  3.95  
1.43  1.28  2.03  3.72  3.60  
1.73  1.38  2.21  4.36  4.21  
1.46  1.34  1.89  3.77  3.52  
1.56  1.24  1.98  3.93  3.77  
1.45  1.31  1.83  3.66  3.41  
1.45  1.14  2.03  3.84  3.51  
1.19  1.05  1.63  3.37  3.05  
1.25  1.16  1.62  3.44  3.30 
We next use the QM9 data set to explore the extent to which models trained on smaller molecules may be transferred to larger molecules. KRR models, using the features of Table 2, were trained to the 3,993 molecules of the QM9 data set that have seven or fewer heavy atoms. Twofold cross validation was used to select hyperparameters and the average MAE of the two folds, computed using the selected hyperparameters, is listed in the column labeled CV (for cross validation) in Table 2. The remaining columns list the MAE as molecules from the QM9 data set are added to the small molecules used to train the model. Because the molecules are sorted by size, the average molecular size increases as molecules are added. A useful quantitative measure of this increase is provided by the MAE of the null model, which reports on the spread of the atomization energies. The MAE of the null model for the entire QM9 data set is 2.4 times that of the smaller molecules of QM7.
We first consider models that use only connectivity features. For simple models based on features that count elements (feature ) or bond types (feature ) the MAE for QM9 is 1.4 and 2.0 times that of QM7, respectively. This is less than the factor of 2.4 seen for the null model (Table 2) and suggests that the average atomization energy per element, or average strength of a given bond type, does not change substantially with molecular size. Inclusion of higherrank connectivity counts substantially improves model performance, with leading to the best performance on both QM7 and QM9. With , the MAE of QM9 is 2.8 times that of QM7, only slightly larger than the 2.4 increase in the null model.
Not surprisingly, models that use the Coulomb matrix and BoB features do not transfer well to larger molecules. The MAEs of such models increase by over a factor of ten as one moves from small molecules to the entire QM9 data set. In addition, the MAEs on the full QM9 data set are larger than those from models based either on atom counts, , or bond counts, .
Good transfer to larger molecules is, on the other hand, obtained for models based on encoded distance features. For models based on the or feature alone, the MAE for QM9 relative to QM7 increases by factors of 2.1 and 1.9, respectively, which is comparable to or less than the 2.4 increase of the null model. This advantage is retained upon inclusion of connectivity features of different rank. For both and , the MAE of QM9 is 2.6 times that of QM7, which is only slightly larger than the 2.4 increase in the null model.
Mean Absolute Errors  

Property  Null  CM^{16}  BoB\tnotextnote:qm7b10  SOAP^{23}  QM  
E (PBE0) (eV)  7.69  0.04  0.05  0.16  0.08  0.04  0.090.23\tnotextnote:qm7b1 
(PBE0) ()  1.04  0.06  0.07  0.11  0.08  0.05  0.040.27\tnotextnote:qm7b2 
(SCS) ()  1.15  0.05  0.06  0.08  0.04  0.02  0.040.27\tnotextnote:qm7b2 
HOMO (GW) (eV)  0.54  0.13  0.14  0.16  0.14  0.12  – 
HOMO (PBE0) (eV)  0.49  0.12  0.12  0.15  0.13  0.11  2.08\tnotextnote:qm7b6 
HOMO (ZINDO) (eV)  0.78  0.13  0.13  0.15  0.14  0.13  0.79\tnotextnote:qm7b7 
LUMO (GW) (eV)  0.31  0.13  0.13  0.13  0.15  0.12  – 
LUMO (PBE0) (eV)  0.53  0.09  0.09  0.12  0.11  0.08  1.30\tnotextnote:qm7b7 
LUMO (ZINDO) (eV)  1.08  0.10  0.10  0.11  0.14  0.10  0.93\tnotextnote:qm7b7 
IP (ZINDO) (eV)  0.78  0.16  0.18  0.17  0.19  0.19  0.20, 0.15\tnotextnote:qm7b4 
EA (ZINDO) (eV)  1.17  0.09  0.11  0.11  0.15  0.13  0.16\tnotextnote:qm7b8, 0.11\tnotextnote:qm7b4 
(ZINDO) (eV)  1.54  0.23  0.28  0.13  0.20  0.18  0.18\tnotextnote:qm7b8, 0.21\tnotextnote:qm7b9 
(ZINDO) (eV)  2.62  1.23  1.27  1.06  1.30  1.56  – 
(ZINDO) (Arb.)  0.15  0.06  0.07  0.07  0.08  0.08  – 

0.15 eV from PBE0, MAE of formation enthalpy for the G3/99 set^{50, 51}.
0.23 eV from PBE0, MAE of atomization energy for six small molecules^{52, 53}.
0.090.22 eV from B3LYP, MAE of atomization energy from various studies^{54}. 
0.050.27 from B3LYP, MAE from various studies^{54}.
0.040.14 from MP2, MAE from various studies^{54}. 
B3LYP, MAE from various studies^{54}.

MAE from GW values^{16}.

ZINDO, MAE for a set of 17 retinal analogues^{55}.

PBE0, MAE for the G3/99 set^{50, 51}.

TDDFT(PBE0), MAE for a set of 17 retinal analogues^{55}.

Calculations done in this work for comparison.
The degree to which various feature vectors are useful for creating models of other target properties is explored using the QM7b data set. Table 3 lists the MAE from BoB models developed here, along with results reported from use of the Coulomb matrix ^{16} and REMatchSOAP kernel^{23}. These are compared to models developed using , which gives the best performance on QM7 atomization energies (Table 2), and , which extends this to rank 4 and gives comparable results on QM7b. For the remaining studies reported below, which include larger molecules and larger data sets, we use the longer feature. The results show that the performance with and is comparable to BoB and REMatchSOAP, all of which lead to predictions whose MAEs are at least as low as the typical deviation of the quantum methods from experiment used to generate the QM7b data set (QM in Table 3).
The learning rate of models based on various feature vectors are shown in Figure 4. Similar learning rates are obtained across feature vectors, with the decrease in MAE for models based on simple bond counting, , being similar to those of BoB, , and . Figure 5 provides additional insight by showing the MAE for both the training and test data. The point at which the train and test error converge indicates the limit beyond which inclusion of additional data may no longer be expected to improve model performance. The model based on the concatenated feature approaches this convergence significantly more slowly than the single or features, suggesting that the increase in performance of comes at the cost of a somewhat slower learning rate. For BoB, the training error is nearly zero, suggesting that inclusion of additional data has the potential to further reduce error.
Sampled Molecules  

Property  Null  500  1000  2000  4000  8000  16000 
(Debye)  1.19  0.92(1)  0.86(1)  0.83(1)  0.76(1)  0.68(0)  0.63(0) 
(Bohr)  6.30  0.98(4)  0.77(2)  0.66(1)  0.60(4)  0.49(2)  0.41(1) 
HOMO (eV)  0.44  0.25(0)  0.22(1)  0.19(0)  0.17(0)  0.15(0)  0.12(0) 
LUMO (eV)  1.05  0.39(1)  0.32(2)  0.26(0)  0.22(0)  0.18(0)  0.15(0) 
Gap (eV)  1.08  0.44(0)  0.39(2)  0.32(0)  0.28(0)  0.24(0)  0.19(0) 
(Bohr)  202.52  86(3)  71(2)  59.41(91)  50.40(54)  44.94(35)  36.31(19) 
zpve (kcal/mol)  16.63  0.29(1)  0.21(1)  0.17(0)  0.14(0)  0.11(0)  0.09(0) 
U (kcal/mol)  188.47  7.28(33)  4.67(10)  3.64(13)  2.75(5)  2.08(5)  1.58(2) 
U (kcal/mol)  190.18  7.31(33)  4.69(10)  3.63(11)  2.77(5)  2.09(5)  1.59(2) 
H (kcal/mol)  191.55  7.32(33)  4.69(10)  3.63(11)  2.77(5)  2.09(5)  1.59(2) 
G (kcal/mol)  173.30  7.08(31)  4.58(7)  3.57(11)  2.73(4)  2.06(5)  1.56(2) 
C (cal/(mol K))  4.95  0.32(0)  0.24(0)  0.20(0)  0.16(0)  0.12(0)  0.10(0) 
The performance and learning rate of models, based on the feature vector, that target a range of properties in the QM9 data set are shown in Table 4. Performance and learning rate on all thermodynamic energies are roughly equivalent, giving MAEs of 1.6 kcal/mol. The heat capacity is also well predicted, with an MAE of 0.10 cal/(mol K) that is 2% that of the null model. Models of the HOMO and LUMO energies, and the HOMOLUMO gap, also lead to reasonable performance, with MAEs below 0.2 eV.
The performance on dipole moment, , is relatively poor, with the model reducing the MAE of the null model by only about one half. This may be due to having a strong dependence on the overall geometry of the molecule. Significantly better performance is obtained for the polarizability, , where MAE of the model is 6.5% that of the null model. Although polarizability is a global property, it is primarily dependent on the total volume of the molecule and thus is likely well modeled as a sum of fragment properties. The dipole moment, on the other hand, depends on the detailed global arrangement of the atoms.
7 Discussion
This paper introduces feature vectors for machine learning of molecular properties whose length depends on the diversity of molecules in the data set, e.g. the number of elements and bond types, but is independent of the size of the molecules. These features lead to models whose performance on atomization energies, and other properties tabulated in well studied data sets, is comparable to or better than previous models based on the Coulomb matrix or BoB features. In contrast to these previous models, the features introduced here allow models trained on small molecules to be applied successfully to larger molecules.
Although models based solely on connectivity counts would be ideal, because they require only the information present in a SMILES string, such models do not perform well. This reflects the fact that chemistry is more complex than simple Lewis structures—even for organic molecules. When information regarding the 3D geometry of the molecule is included via encoded distance features, the model performance improves substantially and this performance enhancement is retained upon transfer from smaller to larger molecules.
The current work focuses on KRR models, for which the features are used to compute distances between pairs of molecules. These distances are then passed through a kernel for use in the regression. The above results suggest that the features developed here are useful for computing distances between molecules, when coupled to the standard kernels employed in machine learning. The development of kernels that better describe molecular similarity is left to future work. The degree to which the features introduced here lead to improved performance in deep learning and other ML models also remains to be explored; although, we note that the models with the best current performance on the data sets studied here use KRR.
8 Acknowledgements
The authors thank Haichen Li for helpful discussions. The authors would also like to thank Raghunathan Ramakrishnan for helpful discussions on details regarding the BoB model and the data sets studied here. This work was supported by the National Science Foundation under grant CHE1027985.
9 Supporting Information Available
Supporting information includes the following: distances between atoms used to define the bond types of Section 5.3; details regarding the implementation of the encoded distance features of Section 5.4; and a listing of the MAE and hyperparameters from LRR and KRR models of the QM7 data set, obtained using a wide variety of different feature vectors.
References
 Lill 2007 Lill, M. A. Drug Discovery Today 2007, 12, 1013–1017.
 Lusci et al. 2013 Lusci, A.; Pollastri, G.; Baldi, P. J. Chem. Inf. Model. 2013, 53, 1563–1575.
 Wang et al. 2004 Wang, X.; Wong, L.; Hu, L.; Chan, C.; Su, Z.; Chen, G. J. Phys. Chem. A 2004, 108, 8514–8525.
 Rupp et al. 2012 Rupp, M.; Tkatchenko, A.; Müller, K.R.; von Lilienfeld, O. A. Phys. Rev. Lett. 2012, 108, 058301.
 Hansen et al. 2013 Hansen, K.; Montavon, G.; Biegler, F.; Fazli, S.; Rupp, M.; Scheffler, M.; von Lilienfeld, O. A.; Tkatchenko, A.; Müller, K.R. J. Chem. Theory Comput. 2013, 9, 3404–3419.
 Montavon et al. 2012 Montavon, G.; Hansen, K.; Fazli, S.; Rupp, M.; Biegler, F.; Ziehe, A.; Tkatchenko, A.; Lilienfeld, A. V.; Müller, K.R. Learning Invariant Representations of Molecules for Atomization Energy Prediction. Advances in Neural Information Processing Systems. 2012; pp 440–448.
 Bartók et al. 2013 Bartók, A. P.; Kondor, R.; Csányi, G. Phys. Rev. B: Condens. Matter Mater. Phys. 2013, 87, 184115.
 Qu et al. 2013 Qu, X.; Latino, D. A.; Airesde Sousa, J. J. Cheminf. 2013, 5, 34.
 Meredig et al. 2014 Meredig, B.; Agrawal, A.; Kirklin, S.; Saal, J. E.; Doak, J.; Thompson, A.; Zhang, K.; Choudhary, A.; Wolverton, C. Phys. Rev. B: Condens. Matter Mater. Phys. 2014, 89, 094104.
 Behler 2011 Behler, J. Phys. Chem. Chem. Phys. 2011, 13, 17930–17955.
 Barker et al. 2016 Barker, J.; Bulin, J.; Hamaekers, J.; Mathias, S. 2016,
 Hirn et al. 2015 Hirn, M.; Poilvert, N.; Mallat, S. arXiv preprint arXiv:1502.02077 2015,
 OlivaresAmaya et al. 2011 OlivaresAmaya, R.; AmadorBedolla, C.; Hachmann, J.; AtahanEvrenk, S.; SánchezCarrera, R. S.; Vogt, L.; AspuruGuzik, A. Energy Environ. Sci. 2011, 4, 4849–4861.
 Hachmann et al. 2014 Hachmann, J.; OlivaresAmaya, R.; Jinich, A.; Appleton, A. L.; BloodForsythe, M. A.; Seress, L. R.; RománSalgado, C.; Trepte, K.; AtahanEvrenk, S.; Er, S.; Shrestha, S.; Mondal, R.; Sokolov, A.; Bao, Z.; AspuruGuzik, A. Energy Environ. Sci. 2014, 7, 698–704.
 O’Boyle et al. 2011 O’Boyle, N. M.; Campbell, C. M.; Hutchison, G. R. J. Phys. Chem. C 2011, 115, 16200–16210.
 Montavon et al. 2013 Montavon, G.; Rupp, M.; Gobre, V.; VazquezMayagoitia, A.; Hansen, K.; Tkatchenko, A.; Müller, K.R.; von Lilienfeld, O. A. New J. Phys. 2013, 15, 095003.
 Ramakrishnan et al. 2015 Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. J. Chem. Theory Comput. 2015, 11, 2087–2096.
 Ramakrishnan et al. 2015 Ramakrishnan, R.; Hartmann, M.; Tapavicza, E.; von Lilienfeld, O. A. J. Chem. Phys. 2015, 143.
 Ramakrishnan and von Lilienfeld 2015 Ramakrishnan, R.; von Lilienfeld, O. A. CHIMIA Internationaltional Journal for Chemistry 2015, 69, 182–186.
 von Lilienfeld et al. 2015 von Lilienfeld, O. A.; Ramakrishnan, R.; Rupp, M.; Knoll, A. Int. J. Quantum Chem. 2015, 115, 1084–1093.
 Rupp et al. 2015 Rupp, M.; Ramakrishnan, R.; von Lilienfeld, O. A. J. Phys. Chem. Lett. 2015, 6, 3309–3313.
 LopezBezanilla and von Lilienfeld 2014 LopezBezanilla, A.; von Lilienfeld, O. A. Phys. Rev. B: Condens. Matter Mater. Phys. 2014, 89, 235411.
 De et al. 2016 De, S.; Bartók, A. P.; Csányi, G.; Ceriotti, M. Phys. Chem. Chem. Phys. 2016, 18, 13754–13769.
 Ghiringhelli et al. 2015 Ghiringhelli, L. M.; Vybiral, J.; Levchenko, S. V.; Draxl, C.; Scheffler, M. Phys. Rev. Lett. 2015, 114, 105503.
 Kusne et al. 2014 Kusne, A. G.; Gao, T.; Mehta, A.; Ke, L.; Nguyen, M. C.; Ho, K.M.; Antropov, V.; Wang, C.Z.; Kramer, M. J.; Long, C.; Takeuchi, I. Sci. Rep. 2014, 4.
 Isayev et al. 2015 Isayev, O.; Fourches, D.; Muratov, E. N.; Oses, C.; Rasch, K.; Tropsha, A.; Curtarolo, S. Chem. Mater. 2015, 27, 735–743.
 Sadowski and Baldi 2013 Sadowski, P.; Baldi, P. J. Chem. Inf. Model. 2013, 53, 3127–3130.
 Nagata et al. 2012 Nagata, K.; Randall, A.; Baldi, P. Proteins: Struct., Funct., Bioinf. 2012, 80, 142–153.
 Carr et al. 2016 Carr, S.; Garnett, R.; Lo, C. BASC: Applying Bayesian Optimization to the Search for Global Minima on Potential Energy Surfaces. International Conference on Machine Learning. 2016; pp 898–907.
 Wong and Kolter 2015 Wong, E.; Kolter, J. Z. An SVD and Derivative Kernel Approach to Learning from Geometric Data. Association for the Advancement of Artificial Intelligence. 2015; pp 1889–1895.
 Ediz et al. 2009 Ediz, V.; Monda, A. C.; Brown, R. P.; Yaron, D. J. J. Chem. Theory Comput. 2009, 5, 3175–3184.
 Sun et al. 2014 Sun, J.; Wu, J.; Song, T.; Hu, L.; Shan, K.; Chen, G. J. Phys. Chem. A 2014, 118, 9120–9131.
 Weininger 1988 Weininger, D. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36.
 Hansen et al. 2015 Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; von Lilienfeld, O. A.; Müller, K.R.; Tkatchenko, A. J. Phys. Chem. Lett. 2015, 6, 2326–2331.
 Morgan 1965 Morgan, H. L. J. Chem. Doc. 1965, 5, 107–113.
 Moussa 2012 Moussa, J. E. Phys. Rev. Lett. 2012, 109, 059801.
 Huan et al. 2015 Huan, T. D.; MannodiKanakkithodi, A.; Ramprasad, R. Phys. Rev. B: Condens. Matter Mater. Phys. 2015, 92, 014106.
 Rücker and Rücker 1993 Rücker, G.; Rücker, C. J. Chem. Inf. Comput. Sci. 1993, 33, 683–695.
 Rogers and Hahn 2010 Rogers, D.; Hahn, M. J. Chem. Inf. Comput. Sci. 2010, 50, 742–754.
 Mahé et al. 2005 Mahé, P.; Ueda, N.; Akutsu, T.; Perret, J.L.; Vert, J.P. J. Chem. Inf. Comput. Sci. 2005, 45, 939–951.
 Behler and Parrinello 2007 Behler, J.; Parrinello, M. Phys. Rev. Lett. 2007, 98, 146401.
 Gastegger and Marquetand 2015 Gastegger, M.; Marquetand, P. J. Chem. Theory Comput. 2015, 11, 2187–2198.
 Gastegger et al. 2016 Gastegger, M.; Kauffmann, C.; Behler, J.; Marquetand, P. J. Chem. Phys. 2016, 144.
 Seko et al. 2015 Seko, A.; Takahashi, A.; Tanaka, I. Phys. Rev. B: Condens. Matter Mater. Phys. 2015, 92, 054113.
 Blum and Reymond 2009 Blum, L. C.; Reymond, J.L. J. Am. Chem. Soc. 2009, 131, 8732.
 Ruddigkeit et al. 2012 Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J.L. J. Chem. Inf. Comput. Sci. 2012, 52, 2864–2875.
 Ramakrishnan et al. 2014 Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. Sci. Data 2014, 1.
 Pedregosa et al. 2011 Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E. Journal of Machine Learning Research 2011, 12, 2825–2830.
 Bing and von Lilienfeld 2016 Bing, H.; von Lilienfeld, O. A. arXiv preprint arXiv:1502.02077 2016,
 Staroverov et al. 2003 Staroverov, V. N.; Scuseria, G. E.; Tao, J.; Perdew, J. P. J. Chem. Phys. 2003, 119, 12129–12137.
 Curtiss et al. 2005 Curtiss, L. A.; Redfern, P. C.; Raghavachari, K. J. Chem. Phys. 2005, 123, 124107.
 Zhao et al. 2004 Zhao, Y.; Pu, J.; Lynch, B. J.; Truhlar, D. G. Phys. Chem. Chem. Phys. 2004, 6, 673–676.
 Lynch and Truhlar 2003 Lynch, B. J.; Truhlar, D. G. J. Phys. Chem. A 2003, 107, 8996–8999.
 Koch and Holthausen 2015 Koch, W.; Holthausen, M. C. A Chemist’s Guide to Density Functional Theory; John Wiley & Sons, 2015.
 López et al. 2006 López, C. S.; Faza, O. N.; Estévez, S. L.; de Lera, A. R. J. Comput. Chem. 2006, 27, 116–123.