Measuring the Similarity between Materials with an Emphasis on the Materials’ Distinctiveness
Abstract
In this study, we establish a basis for selecting similarity measures when applying machine learning techniques to solve materials science problems. This selection is considered with an emphasis on the distinctiveness between materials that reflect their nature well. We perform a case study with a dataset of rareearth transition metal crystalline compounds represented using the Orbital Field Matrix descriptor and the Coulomb Matrix descriptor. We perform predictions of the formation energies using knearest neighbors regression, ridge regression, and kernel ridge regression. Through detailed analyses of the yield prediction accuracy, we examine the relationship between the characteristics of the material representation and similarity measures, and the complexity of the energy function they can capture. Empirical experiments and theoretical analysis reveal that similarity measures and kernels that minimize the loss of materials’ distinctiveness improve the prediction performance.
escriptor; similarity measure; learning method; material distinctiveness
1 Introduction
A small change in the chemical composition or structure of materials can lead to a significant change in the properties of materials. For example, differences in the chirality of a honeycomb network of carbon atoms can lead to a distinctive difference in physical properties of nanotubes. The distinctiveness of materials, which results in the diversity of materials in the nature, is a main characteristic of the material data. Thus, this feature needs to be represented in a metric that allows for a comparison of materials in a reliable, efficient, and useful way.
The main target of machine learning systems when mining material data is to determine a likely function , which indicates the relation between the materials’ attributes and their physical properties. Typically, these systems include two main components: (i) data representations (i.e., descriptors); and (ii) operators (including similarity measures between materials and learning methods) for mining the physical and chemical properties of materials. These components are designed with the aim of reflecting domain knowledge and the nature of material data.
To render computational methods tractable for materials in datasets, the geometrical, topological, or electronic characteristics of the materials need to be represented in form of numerical variables. Descriptors commonly encode the information of a material by a vector whose number of dimensions, , and values in each dimension depend on the information selected to describe the materials with a specific purpose for mining tasks. To represent material structures, several descriptors have been proposed. Behler et al. utilized atomdistributionbased symmetry functions to represent the local chemical environment of atoms [1]. Rupp et al. proposed the Coulomb matrix (CM), which represents materials via the Coulomb repulsion between all possible nuclei in the material [2]. In addition, Isayev et al. used the band structure and density of states (DOS) fingerprint vectors as descriptors of materials to visualize material space [3]. Zhu et al. introduced another fingerprint representation for crystals and used this to define the configurational distance between crystalline structures [4]. Pham et al. proposed a descriptor for encoding atomic orbital information, called the orbital field matrix (OFM) [5, 6].
Similarity measures are mathematically implemented as scalar valued functions that take two vectors representing materials and as input: . The use of these measures is subjective insofar as they depend on a specific domain or application. Conventionally, materials science studies begin by grouping similar materials in order to explore the patterns and rules in these materials. Consequently, measuring material similarity is considered a key technique in material informatics [7]. The advantages and disadvantages of many similarity measures were addressed in [8] and the argument that similar structures lead to similar properties was offered in [9, 10]. However, the validity of this argument was reconsidered by Maggiora et al., who showed that small chemical modifications can lead to significant changes in biological activity [11]. Because the nature of materials is fundamentally diverse, Riniker et al. addressed the problem of partially losing the transparency among fingerprint types by using fuzzier similarity methods [12]. In addition, Maldonado et al. optimized measures of molecular similarity and diversity based on selecting and classifying descriptors [13]. Moreover, several methods have been proposed for comparing crystalline materials [14, 4].
Most similarity measures estimate the difference between two materials represented by vectors according to each vectorâs dimension, and then provide an average for these differences. This can make the local differences (i.e., the difference in each dimension) fainter (or fuzzier). However, small modifications in materials can induce significant changes in the materials’ properties, as mentioned in previous studies [11, 12]. This poses a key problem of how to select a similarity measure—that is, whether the loss in the materials’ distinctiveness is acceptable when comparing materials in a specific context.
This study aims to establish the basis for choosing appropriate similarity measures between materials in a given context by bridging fundamental concepts in machine learning with the nature of material data. We focus on modeling materials’ distinctiveness, and we explore whether an association exists between this property and the quality of approximating the energy function. By analyzing the characteristics of the energy function and descriptors, we propose novel quantitative protocols for selecting similarity measures.
The paper is organized with four main sections, as follows:

In Section 2, we study the orbital field matrix and Coulomb matrix as descriptors to represent materials in vector space. These descriptors can effectively predict materials’ formation energies, as we explain based on previous studies.

In Subsection 5.1, we propose a method for investigating how similarity measures of interest minimize the loss of materials’ distinctiveness. From Subsection 5.2 to Subsection 5.7, by analyzing several learning methods that use similarity measures to predict crystal formation energies, we demonstrate that similarity measures need to be selected such that they fit with the characteristics of the descriptors and learning methods, as illustrated in Figure 1.
Our experimental results indicate that all of these methods show improved prediction performance when they capture the complexity of the energy function of crystals. Theoretical and empirical interpretations of these results from multiple perspectives reveal that descriptors that reflect the materials’ distinctiveness (or identity) and similarity measures that minimize the loss of materials’ distinctiveness help to improve the performance of formation energy prediction.
2 Material descriptor
In this study, we aim to explore the rules of using similarity measures for materials by interpreting the formation energy prediction performance. Therefore, we consider two recently proposed vector representations of materials, which showed effective predictions of materials’ formation energies: the Coulomb matrix (CM) and the orbital field matrix (OFM).
2.1 Orbital field matrix
The OFM is a novel descriptor that was proposed recently [5, 6]. It uses the valence atomic configuration to represent the structure of materials. In the OFM representation, a material is assumed to be composed of building blocks that are called local structures. Each local structure includes a central atom and its environmental (neighboring) atoms. First, each atom is represented by a onehot vector based on a dictionary of subshell orbitals: . We denote the vector of the central atom by , and the vector of the neighboring atom by . Second, the vector representing the environment of each atom in a structure, , is computed as follows:
(1) 
where the weight, , measures the contribution of the neighboring atom, and is the number of neighboring atoms. The local structure is represented by a matrix, , where represents the number of an environmental atomic orbital (orbital ) coordinated with a central atomic orbital (orbital ). Hence, the representation matrix of a local structure is
(2)  
where is the weight representing the contribution from atom to the coordination number of the central atom; is the solid angle determined by the face of the Voronoi polyhedral that separates the atom and the central atom; and is the maximum of all solid angles determined by this Voronoi polyhedral.
The distance between the central atom and the neighboring atom is incorporated in the representation of local structures as follows:
(3) 
where is the distancedependent weight function. Finally, the descriptor for the entire material is a mean of descriptors for its local structures.
In an extension of the OFM, the information regarding the central atom is incorporated by simply concatenating to the matrix as a new column, as follows:
(4) 
In this study, we use this extension to the OFM to predict crystals’ formation energies.
2.2 Coulomb matrix
The CM [2, 15] is a descriptor that encodes the structure of a material using nuclear charges and the 3D coordinates of each constituent atom in the material, as follows:
(5) 
To deal with the atomordering problem in CM, the authors used (i) the eigenspectrum representation that first obtains eigenvalues of each Coulomb matrix, and then uses the sorted eigenvalues (i.e., spectrum) as the representation, and (ii) sorted Coulomb matrices that choose the permutation of atoms whose associated Coulomb matrix satisfies , where is the row of the Coulomb matrix. In practice, padding the Coulomb matrices by zerovalued entries is required in order to avoid the difference in matrix size induced by the difference in the number of atoms in each material.
3 Similarity measures of interest
In materials informatics, the similarity between materials is quantified through several measures that mostly estimate the difference between materials. These measures are called distances if they satisfy all conditions of a metric (i.e., nonnegativity, identity of indiscernibles, symmetry, and triangular inequality). If they do not fully satisfy these conditions, they are called dissimilarity measures.
Let be two vectors. Similarity measures are mathematical functions taking and as their input with a scalar as output. In this study, we investigate several wellknown similarity measures that are commonly utilized in the vector space, as follows:

Hamming distance: , the Hamming distance between two vectors here is computed by counting the number of different elements in these vectors. With OFM and CM, material vectors are numerical vectors, but they share common zerovalued elements. Hence, their Hamming distance may differ from one.

norm distance: with in which the 1norm and 2norm are known as the Manhattan and Euclidean distances, respectively.

Cosine distance: .

BrayCurtis (BC) dissimilarity: this is not a distance measure because it does not obey the triangular inequality,
4 Data
This study utilizes the dataset of magnetic materials based on rare earthtransition metal alloys extracted from the Open Quantum Materials Database (OQMD) [16, 17]. The dataset consists of 5967 crystals. Crystals containing rareearth and transition elements are considered because the diversity of their structures induces a diverse range of electronic properties on account of the interval magnetic freedom [18, 19]. In other words, the distinctiveness of crystals and their properties as well as the importance of considering small changes in crystals are the main characteristics of this dataset and thus useful for our study.
5 Similarity measure selection based on an analysis on descriptors and model complexity
5.1 Quantitative evaluation of the materials’ distinctiveness loss using similarity measure
Most similarity measures used in the vector space take an average of the difference in each vector’s dimension, leading to the loss of the materials’ distinctiveness. To preserve the materials’ distinctiveness when measuring the material similarity we should accumulate all the differences in every vector’s dimension. The Hamming distance, which counts the minimum number of substitutions required to change one vector to the other, can be considered the simplest measure that fits this purpose.
In fact, totally preserving the materials’ distinctiveness is meaningless in terms of discovering knowledge because it prevents the generalization of information when seeking patterns or rules. Therefore, the loss of the distinctiveness between materials can be tolerated. However, the extent to which this is tolerated must be adjusted based on the nature of data and the specific purposes when mining. To estimate the loss of materials’ distinctiveness when using similarity measures, we compute the correlation between the pairwise similarity of materials created by these measures and that created by the Hamming distance. In other words, given a measure, if two materials are different (or if they are distant in the vector space), this is determined by both this measure and the Hamming distance. This measure will be considered to preserve the materials’ distinctiveness.
We estimate affinity matrices of materials represented by OFM and CM with the Hamming distance, the norm distance, the cosine distance, and the BC dissimilarity. Next, we calculate the correlation coefficients between flattened forms of these matrices, as shown in Table 1. The table shows that the 1norm (Manhattan) distance and BC dissimilarity correlate more to the Hamming distance than the others with both descriptors. Therefore, the use of these two measures results in only a small loss of the materials’ distinctiveness.
Descriptor  Dissimilarity measure  Correlation 

OFM  1norm  0.624 
2norm  0.45  
3norm  0.37  
cosine  0.569  
BC  0.564  
CM  1norm  0.57 
2norm  0.394  
3norm  0.345  
cosine  0.308  
BC  0.551 
5.2 Knearest neighbors regression
Knearest neighbors (KNN) is known as a “lazy learning” algorithm. It predicts a target value of an instance by averaging nearest neighbor target values of this instance without any assumption regarding the relation between this instance and its target value. As such, KNN is useful when exploring the nature of data because realworld data does not obey any typical theoretical assumption. In KNN, there are two hyperparameters that need to be defined beforehand: (i) the number of nearest neighbors, denoted by , and (ii) an appropriate similarity measure between instances. In the case study predicting crystal formation energy, we aim to clarify the importance of taking descriptors and the number of nearest neighbors into account when selecting similarity measures.
Let denote sample data generated from a function . For a new instance , KNN estimates the function value for this instance as
(6) 
where is the set of nearest neighbors of .
In KNN, we do not extract generalized patterns or models from instances in . This method utilizes all data points to approximate the function . The model structure is characterized by the number of nearest neighbors () and the similarity measure between instances.
Owing to a lack of generalization, overfitting can occur in KNN, particularly when is too small. Overfitting occurs when the model aptly predicts instances in the existing data but poorly predicts new instances. Therefore, in some cases, the high prediction performance of KNN is irrelevant when exploring the nature of data because of its regularization. It tolerates prediction errors in existing data in order to better predict new instances. In KNN, using a large value reduces the risk of overfitting. However, a large may not be a suitable approximation for the existing data if the function is a complex curve with many extreme points, as illustrated in Appendix A.
This study focuses on exploring the nature of a given material dataset by interpreting empirical prediction results. Thus, overfitting must be avoided to prevent confusion in the interpretation. If the data is generated from multiple distributions, we separately fit the model for each group of instances assumed to be generated from a distribution.
To effectively approximate the energy function, we need to understand characteristics of this function. Visualizing the energy function is a simple way to derive an intuitive understanding of this function. This process is discussed in Subsection 5.2.1. In Subsection 5.2.2, we present the experimental results from predicting crystals’ formation energies using KNN, and we offer several remarks on the number of neighbors and similarity measures. Our interpretation of the number of neighbors, used to fit the energy function, reveals the complexity of this function. The details of this interpretation are presented in Subsection 5.2.3. Relying on the characteristics of the energy function and descriptors, we can select appropriate similarity measures to obtain high formation energy prediction performance, as described in Subsections 5.2.4 and 5.2.5.
5.2.1 Formation energy surface visualization
In this study, is the energy function of crystals. Visualizing the energy surface of crystals can help with a preliminary and intuitive understanding of the properties of this function. To visualize this, we use principal component analysis (PCA) to project data instances to a 2D subspace. Next, we plot the energy surface with the projected data in Figure 2.
Figure 2 shows that the data is diverse because it includes several groups of instances under different energy functions. To avoid overfitting, we separate the data into groups, and considered each group individually. We divide the dataset (comprising 5967 crystals) into three groups to make the number of instances in each group sufficiently large to train the model. After clustering, we obtain six samples, denoted by CMG0, CMG1, CMG2, OFMG0, OFMG1, and OFMG2. The energy surface of each sample is also plotted, as shown in Figure 2. The visualization shows that the energy surface in all samples is complex, with many extreme points in their small vicinities.
5.2.2 Remarks from experimental results
To find the most appropriate value for and a similarity measure for different representations of the data, we perform tenfold crossvalidation with OFM and CM for six samples, as shown in Figures 3, 4. The prediction performance is evaluated using the root mean squared error (RMSE), mean absolute error (MAE), and (the coefficient of determination).
The experimental results show that a small ( or ) results in the highest accuracy for most samples. In addition, increasing the value of degrades the prediction performance. As an exception, in sample OFMG2, results in better performance than with smaller values of . In sample OFMG1, larger values result in better performance in terms of the RMSE and .
The 1norm distance and BC dissimilarity, which minimize the loss of materials’ distinctiveness, as explained in Section 5.1, result in more accuracy than other measures for most samples. With these two measures, we compare two descriptors and found that using these measures for OFM results in more improvement than using them for CM. In addition, the 1norm distance and BC dissimilarity are worse than the others in CMG1. This shows the dependency of choosing similarity measures for the descriptors.
5.2.3 The optimal value of reveals the complexity of the energy surface
As mentioned above, KNN locally approximates by averaging the energies of the nearest neighbors of each data point. For a data point , let be a subset of neighbors in nearest neighbors whose target values are greater than : with . Let be the subset of neighbors whose target values are smaller than : with . We note that . The formula for estimating the target value of is rewritten as follows:
(7) 
Relying on Equation 7 to precisely estimate requires that neither nor are empty and that the positive residual and the negative one can eliminate each other. Therefore, instances, which are extreme points in the energy function (see Appendix A), are difficult to fit using KNN. The nearest neighbors of these extreme points often belong in their vicinity, and energies of these neighbors () are only smaller or greater than (i.e., or is empty). Thus, the use of a small number of neighbors can help to reduce the residual in the estimation in such a situation. Therefore, a small value, which is optimal for most of the samples (as shown in Figures 3, 4), is consistent with the fact that the energy surface is rough, with many extreme points within a small vicinity, as shown in Figure 2.
The experimental results show that large values of perform better in sample OFMG1. The underlying reason for this is that the sample can be divided into three smaller groups under three different functions, as bounded by the red circles in Figure 2. Thus, choosing a small value can lead to overfitting in this sample and degrade the prediction performance. In this case, using a large plays the role of regularization when dealing with overfitting. Therefore, using a large in sample OFMG1 does not conflict with our hypothesis regarding the relation between and the complexity of the energy function.
5.2.4 Similarity measure selection based on the energy function complexity
The analysis of the value of presented above reveals the complexity of the energy function used as the basis for investigating similarity measures. Suppose that is an extreme point of the function, and that we need to approximate the energy value at this point. is called a query point. The closest point to the query point is determined, and the distance between these points is denoted by . To identify other neighbors of , we enlarge the region surrounding by a radius , called the neighboring region of . Data points within this region are considered neighbors of the query point (in Figure 5).
Alternatively, rather than determining nearest neighbors, KNN can take an average of all neighbors belonging to the neighboring region of each point in the data (query point), determined by a distance threshold. This method is called fixedradius nearestneighbors regression [20]. To estimate the energy at , we average all data points falling in its neighboring region. Of course, the number of neighbors of depends on which similarity measure is used. As mentioned above, to predict the energy at precisely, we need to select similarity measures that help to identify a small number of neighbors in the vicinity of . Although different measures will produce values in different ranges, they share the common factor . Thus, can be understood as the relative value of these measures. As such, it is possible to compare these measures.
For each crystal represented by OFM and CM, we determine its neighboring regions according to each similarity measure and . Next, for each crystal, we count the number of neighbors in its neighboring region. We take an average of the number of neighbors of all crystals in the dataset with a specific similarity measure and , as shown in Figure 6. This figure shows that the 1norm distance and BC dissimilarity determine fewer neighbors than other measures with both descriptors. Therefore, these measures are more appropriate for approximating the energy function. In fact, the experimental results also show the improvement in prediction by using the 1norm distance and BC dissimilarity. Indeed, a small number of neighbors in a fixed radius determined by the 1norm distance and BC dissimilarity also indicates how these measures preserve the materials’ distinctiveness.
5.2.5 Similarity measure selection based on descriptors’ characteristics
By collating the experimental results shown in Figure 3 and Figure 4, we see that there is significant improvement when using the 1norm distance and BC dissimilarity for OFM. Meanwhile, the improvement is not significant for CM, and these measures perform even worse than other measures in sample CMG1. In other words, choosing appropriate similarity measures needs to be carried out strictly for OFM because this strongly affects the prediction performance. Meanwhile, for CM, inappropriately selecting similarity measures is tolerable. The underlying reason for this pertains to how distinct the instances are when they are represented by descriptors. The distinctiveness of data instances is indicated via (i) the number of dimensions in the representation, and (ii) whether the information encoded in representations can distinguish instances, which can be measured by estimating the variance of data.
Regarding (i), as shown in Table 2, there are more dimensions in the OFMbased representation (1024 dimensions) than in the CMbased representation. Obviously, increasing the number of dimensions means enhancing the instances’ distinctiveness. Therefore, the emphasis on the materials’ distinctiveness in the OFM is stronger than it is in the CM. This results in the need to strictly select appropriate similarity measures for the OFM.
Regarding (ii), we examine the variance of data with different representations. Let be random variables that correspond to features in representations. The data variance is computed as , which is shown in Table 2. The variance of data when represented by OFM is higher than when represented by CM. This also shows that materials represented by the OFM are more distinct than those represented by the CM, and that the OFM is more appropriate for representing materials in this case.
Descriptor  # dimensions  data variance 

OFM  1056  0.013 
CM  68  0.004 
Both the complexity of the energy function and the selection of similarity measures depend on the characteristics of the descriptors, as presented in Figure 1. If the representation essentially indicates the distinctiveness among instances, similarity measures that minimize the loss of instances’ distinctiveness are more suitable for fitting a complex function of instances.
5.3 Ridge regression
Ridge regression is a parametric model that approximates the energy function by a linear function. In this method, the linear coefficients are estimated to minimize the penalized residual sum of squares, as follows:
(8) 
where the matrix X is the input data, and is a predefined parameter indicating an amount of coefficient shrinkage towards zero (weight decay). Ridge regression has the following closedform solution:
(9) 
where I is the identity matrix.
This differs from locally approximating models such as the KNN model, insofar as parametric models are more generalized; they do not require all instances in the dataset to fit the energy function. Because of this generalization, it is possible to explore the nature of data from experimental results when fitting the model for the whole dataset. This incurs less risk of overfitting than the KNN interpretation.
We compare the performance (via RMSE, MAE, and the ) of ridge regression with the OFM and CM descriptors when predicting crystal formation energies. The most likely hyperparameter is chosen by doing a grid search. We use tenfold crossvalidation to find the optimal . The optimal values of for OFM and CM are 0.01 and 1.0, respectively. The prediction accuracies are presented in Table 3. From the table, we see that by using ridge regression, the OFM descriptor outperforms the CM descriptor.
Metric  OFM  CM 
RMSE  0.2390.002  0.9140.01 
MAE  0.1840.002  0.7220.006 
0.970.01  0.5560.01 
5.4 Kernel ridge regression
KRR is the dual form of the ridge regression solution (see Appendix B). KRR aims to improve the performance of linear methods by mapping instances from the original space (Hilbert space) to a higherdimensional space to acquire linearly separable patterns. Let be the mapping function. The pairwise dot product of instances in the new space is approximated by kernel functions , which form kernel matrices K (i.e., Gram matrices). The radial basis function (RBF) kernel and Laplacian kernel, which are constructed from the 2norm and 1norm distances, respectively, have been widely used. The formulas for these kernels are as follows:

RBF: , where is the 2norm distance between , and is a predefined scalar.

Laplacian: , where is the 1norm distance between .
In addition, we can modify existing kernels by replacing the 1norm and 2norm distances by other similarity measures., we considered several derivations of RBF and Laplacian kernels using the 3norm, cosine distances and BC dissimilarity as follows:

: where is the 3norm distance between and .

: where is the cosine distance between and

: where is the BrayCurtis dissimilarity between and .
We compare the crystal formation energy prediction performance by using KRR with different kernel functions (as listed above) and descriptors. The results are given in Table 5. In KRR, the most likely hyperparmeters and are selected based on a grid search with tenfold crossvalidation. The results are given in Table 4. Table 5 shows that the Laplacian kernel and outperforms the others with both OFM and CM descriptors.
Descriptor  Kernel  

OFM  RBF  
Laplacian  
CM  RBF  
Laplacian  
Descriptor  Kernel  RMSE  MAE  

OFM  RBF  0.1580.002  0.1130.001  0.9870.001 
Laplacian  0.1080.002  0.0670.001  0.9940.001  
0.4290.006  0.3230.004  0.9020.003  
0.7290.19  0.2890.046  0.5210.272  
0.1090.003  0.0670.001  0.9940.001  
CM  RBF  0.3940.018  0.2450.006  0.9160.008 
Laplacian  0.3190.008  0.1940.003  0.9460.003  
0.3950.013  0.2460.005  0.9170.005  
1.0710.138  0.6710.01  0.2850.208  
0.3280.011  0.190.01  0.9420.004 
5.5 Model complexity
The complexity of the model can be quantitatively interpreted by the degrees of freedom. The degrees of freedom are denoted by and defined as the number of freely varying parameters in the model (or function). In terms of model complexity, the greater the number of free parameters, the more complex the model is. For computation, the degrees of freedom are defined as the trace of the first derivatives of according to y as follows:
(10) 
where y and are the real target value and the estimated target value, respectively [21].
In ridge regression, because is estimated with Equation 9, we have
(11) 
Hence, the modelâs degrees of freedom with a predefined , denoted by , are estimated as .
In KRR, because (see Equation 20 in Appendix B), we obtain
(12)  
Therefore, the modelâs degrees of freedom in KRR are estimated as .
The modelâs degrees of freedom in ridge regression depend on the descriptor and the hyperparameter . Meanwhile, in KRR, this depends on the descriptor, similarity measure (used in kernel functions), and the hyperparmeters and . Utilizing the most likely hyperparameters and , as presented in Section 5.3 and Table 4, we can estimate the modelâs degrees of freedom in ridge and kernel ridge regression corresponding to each descriptor and kernel function. This is shown in Table 6.
Method  Descriptor and kernel  

Ridge  OFM  441.87 
CM  34.97  
KRR  OFM + RBF  2132.02 
OFM + Laplacian  4087.08  
OFM +  1790.23  
OFM +  1981.61  
OFM +  4843.64  
KRR  CM + RBF  711.06 
CM + Laplacian  3018.84  
CM +  2819.88  
CM +  47.85  
CM +  3957.95 
5.6 Descriptor selection based on model complexity
As discussed in Sections 5.2.1 and 5.2.3, the energy function of crystals is complex. Hence, to appropriately fit the energy function, we should choose models with high complexity [22]. This means we need to select models that have high degrees of freedom.
Relying on the degrees of freedom of ridge regression, we can evaluate the appropriateness of the OFM and CM descriptors for approximating the energy function. From Table 6, we see that the modelâs degrees of freedom in ridge regression when using OFM () are greater than when using CM(). Therefore, the linear model approximated from the OFM representation of materials has a higher complexity than that approximated from the CM representation. This may explain why representing materials by OFM results in better performance when predicting the formation energies of materials compared to representing materials by CM, as shown in Table 3.
5.7 Kernel selection based on model complexity
By estimating the degrees of freedom of KRR, we can select not only the appropriate descriptor but also the appropriate kernel. In KRR, the use of a Laplacian and results in higher complexity of the model than using other kernels. This explains why a Laplacian and perform better than others with both OFM and CM in terms of predicting crystal formation energies, as shown in Table 5. As mentioned in Section 5.4, the function maps instances in the data to a higherdimensional space, which enhances the distinctiveness among materials. In fact, kernel functions can be treated as generalized similarity measures in the new space of these instances [23]. Therefore, supposing that if we fit the energy function based on instances in the higherdimensional space by KNN, kernel functions, which minimize the loss of materials’ distinctiveness, will result in better performance, as discussed in Section 5.2. The Laplacian and , which are monotonic functions of the 1norm distance and BC dissimilarity, also minimize the loss of materials’ distinctiveness. Thus, these kernels are more appropriate than the others for fitting the energy function in the new space. In fact, KRR approximates this function in the same manner as KNN, but the former uses the model with a closedform expression rather than locally fitting models, and we see the advantages of using the Laplacian and .
Relying on the model complexity estimation presented in Table 6 when using the Laplacian and , the model using OFM has higher complexity than the model using CM. This induces better performance with the KRR compared to using OFM.
Through the interpretations presented above, we clarify the association between the emphasis on materials’ distinctiveness (which includes reflecting the instances’ distinctiveness of the descriptors and minimizing the loss of this distinctiveness of similarity measures) and the model complexity. This forms the basis for selecting appropriate descriptors and kernel functions (or similarity measures) to effectively mine material data.
6 Conclusion
This paper introduces a protocol for interpreting experimental results when mining material data in a case study that predicts crystal formation energies by KNN, ridge regression, and KRR. Through empirical and theoretical interpretations of the prediction performance from multiple perspectives, we found the dependence among descriptors, similarity measures, and learning methods. This forms the basis for model selection to effectively mine materials data. In case these factors simultaneously reflect the nature of data, high performance can be obtained with mining tasks. Through the case study, we found that descriptors that suitably reflect the materials’ distinctiveness and similarity measures that minimize the loss of this distinctiveness result in better performance when predicting the formation energies of materials. This research can serve as the groundwork for future studies regarding the use of machine learning methods for mining material data.
Acknowledgements
This work was partly supported by PRESTO and the “Materials Research by Information Integration” initiative (MII) project of the Support Program for Starting Up Innovation Hub, by the Japan Science and Technology Agency (JST) and the Elements Strategy Initiative Project under the auspices of MEXT, and also by MEXT as a social and scientific priority issue (Creation of New Functional Devices and HighPerformance Materials to Support NextGeneration Industries; CDMSI) to be tackled using a postK computer.
Disclosure statement
The authors declare that they have no conflict of interest.
References
 [1] Behler J. Atomcentered symmetry functions for constructing highdimensional neural network potentials. The Journal of chemical physics. 2011;134(7):074106.
 [2] Rupp M, Tkatchenko A, Müller KR, et al. Fast and accurate modeling of molecular atomization energies with machine learning. Physical review letters. 2012;108(5):058301.
 [3] Isayev O, Fourches D, Muratov EN, et al. Materials cartography: representing and mining materials space using structural and electronic fingerprints. Chemistry of Materials. 2015;27(3):735–743.
 [4] Zhu L, Amsler M, Fuhrer T, et al. A fingerprint based metric for measuring similarities of crystalline structures. The Journal of chemical physics. 2016;144(3):034203.
 [5] Lam Pham T, Kino H, Terakura K, et al. Machine learning reveals orbital interaction in materials. Science and technology of advanced materials. 2017;18(1):756–765.
 [6] Pham TL, Nguyen ND, Nguyen VD, et al. Learning structureproperty relationship in crystalline materials: A study of lanthanide–transition metal alloys. The Journal of chemical physics. 2018;148(20):204106.
 [7] Bender A, Glen RC. Molecular similarity: a key technique in molecular informatics. Organic & biomolecular chemistry. 2004;2(22):3204–3218.
 [8] Maggiora GM, Shanmugasundaram V. Molecular similarity measures. 2004;:1–50.
 [9] Barbosa F, Horvath D. Molecular similarity and property similarity. Current topics in medicinal chemistry. 2004;4(6):589–600.
 [10] Willett P. The calculation of molecular structural similarity: principles and practice. Molecular informatics. 2014;33(67):403–413.
 [11] Maggiora G, Vogt M, Stumpfe D, et al. Molecular similarity in medicinal chemistry: miniperspective. Journal of medicinal chemistry. 2013;57(8):3186–3204.
 [12] Riniker S, Landrum GA. Similarity mapsa visualization strategy for molecular fingerprints and machinelearning methods. Journal of cheminformatics. 2013;5(1):43.
 [13] Maldonado AG, Doucet J, Petitjean M, et al. Molecular similarity and diversity in chemoinformatics: from theory to applications. Molecular diversity. 2006;10(1):39–79.
 [14] Kupriyanov A, Kirsh D. Estimation of the crystal lattice similarity measure by threedimensional coordinates of lattice nodes. Optical Memory and Neural Networks. 2015;24(2):145–151.
 [15] Montavon G, Hansen K, Fazli S, et al. Learning invariant representations of molecules for atomization energy prediction. In: Advances in Neural Information Processing Systems; 2012. p. 440–448.
 [16] Kirklin S, Saal JE, Meredig B, et al. The open quantum materials database (oqmd): assessing the accuracy of dft formation energies. npj Computational Materials. 2015;1:15010.
 [17] Saal JE, Kirklin S, Aykol M, et al. Materials design and discovery with highthroughput density functional theory: the open quantum materials database (oqmd). Jom. 2013;65(11):1501–1509.
 [18] Hinatsu Y. Diverse structures of mixedmetal oxides containing rare earths and their magnetic properties. Journal of the ceramic society of Japan. 2015;123(1441):845–852.
 [19] Lynch DW, Cowan R. Effect of hybridization on 4d 4f spectra in light lanthanides. Physical Review B. 1987;36(17):9228.
 [20] Chen GH, Shah D, et al. Explaining the success of nearest neighbor methods in prediction. Foundations and Trends® in Machine Learning. 2018;10(56):337–588.
 [21] Krämer N, Braun ML. Kernelizing pls, degrees of freedom, and efficient model selection. 2007;:441–448.
 [22] Exterkate P. Modelling issues in kernel ridge regression. 2011;.
 [23] Pekalska E, Paclik P, Duin RP. A generalized kernel approach to dissimilaritybased classification. Journal of machine learning research. 2001;2(Dec):175–211.
Appendix A Appendices
Appendix B Example of approximating complex functions using KNN regression
KNN regression approximates a global function of all data points by averaging the neighbors of each point. To illustrate how KNN approximates a flexible and complex function, we consider a univariate dataset, in which the distance between data points is estimated by , where and are scalar. Suppose that we generate a sample of data points from the following function:
(13)  
where indicates data points that are scalar, and is the noise generated from a normal distribution . For each data point , we use KNN to approximate its target value, where the number of neighbors () is 4, 8, and 10. Setting by even numbers ensures that the number of neighbors on both sides of each data point are equal. The accuracy at different values of is evaluated via RMSE and MAE, and the results are given in Figure 7.
Two remarks are worth noting regarding the approximation of a complex function by KNN: (i) it is difficult to estimate the function values precisely at data points that are extreme points of the functions; (ii) the use of a smaller value of () gives a better approximation than the use of a larger value, and a small is particularly suitable for approximating extreme points.
Appendix C Kernel ridge regression – dual form of ridge regression
We rewrite the optimization problem for ridge regression as
(14)  
subject to 
The solution is equivalent to
(15)  
where is the Lagrangian function. We solve the minimization problem by setting to zero the first derivatives of the Lagrangian function according to and r:
(16)  
Plugging and into the Lagrangian function obtains
(17)  
Now, the dual problem is , which is equivalent to the following (noting that ):
(18) 
where is called the kernel matrix. To obtain , we also set the first derivatives of the dual objective function to zero, to obtain
(19)  
Based on Equation 16, we obtain
(20) 