A Chemical BondBased Representation of Materials
Abstract
This paper introduces a new representation method that is mainly based on chemical bonds among atoms in materials. Each chemical bond and its surrounded atoms are considered as a unified unit or a local structure that is expected to reflect a part of materials’ nature. First, a material is separated into local structures; and then represented as matrices, each of which is computed by using information about the corresponding chemical bond as well as orbitalfield matrices of two related atoms. After that, all local structures of the material are utilized by using the statistics point of view. In the experiment, the new method was applied into a materials informatics application that aims at predicting atomization energies using QM7 data set. The results of the experiment show that the new method is more effective than two stateoftheart representation methods in most of the cases.
1 Introduction
As remarked in [14], a key element of developing advanced materials is to learn from materials knowledge and available materials data to guide the next experiments or calculations in order to focus on materials with targeted properties. Traditionally, materials knowledge has been discovered by experimental studies. In the last few decades, the knowledge has also been discovered by a conventional approach, called computational materials science, whose scope is to model or predict the behavior of materials based on their composition, microstructure, process history, and interactions.
Recently, the development of materials informatics [1, 23], known as a combination of materials science and data science, has opened up a new opportunity for accelerating the discovery of new materials knowledge. Regarding the literature, data science [6] is a field of study that employs a wide range of datadriven techniques from a large number of research fields, such as applied mathematics, statistics, computational science, information science, and computer science, in order to understand and analyze data. In materials informatics, datadriven techniques are applied into existing materials data for the purpose of automatically discovering new materials knowledge such as hidden features, hidden chemical and new physical rules, and new patterns [10, 11, 25]. Remarkablely, materials informatics is expected not only to provide foundations for a new paradigm of materials descovery [22], but also to be the next generation of exploring new materials [27].
Over the years, a large volume of materials data has been generated [15], and these data are commonly described by using a set of atoms with their coordinates and periodic unit cell vectors and categorized as unstructured data [20]. In practice, datadriven techniques can be hardly applied directly on materials data. Before applying datadriven techniques, materials data have to be transformed into new representations (or descriptors). The representations need to reflect the nature of materials and the actuating mechanisms of chemical/physical phenomena. In addition, the operators such as comparison and calculations can be performed by using the representations.
So far, various methods for representing materials have been developed. Behler and coworkers [3, 7, 8] utilized atomdistributionbased symmetry functions to represent the local chemical environment of atoms and employed a multilayer perceptron to map this representation to atomic energy. The arrangement of structural fragments has also been used to represent materials in order to predict the physical properties of molecular and crystalline systems [21]. Isayev used the band structure and density of states fingerprint vectors as a representation of materials to visualize material space [11]. Rupps and coworkers developed a representation known as Coulomb matrix for the prediction of atomization energies and formation energies [9, 16, 24]. In [20], the authors pointed out that distribution of valence orbitals (electrons) of atoms in materials is important information that should be included in the representation of materials. The author in [20] also proposed a representation method, called orbitalfield matrix, which exploits the distribution.
It is wellknown that properties of almost materials are determined by the chemical bonds which may result from the electrostatic force of attraction between atoms with opposite charges, or through the sharing of electrons. In addition, chemical bonds hold an enormous amount of energy and building and breaking chemical bonds is part of the energy cycle. Therefore, in this research, we aim at developing a new representation method that mainly based on chemical bonds. In short, the main contributions of the research include (1) a new method to exploit chemical bonds of atoms in materials and (2) a new method to utilize local structures of a material by adopting statistics point of view.
2 The proposed representation method
Generally, a material is composed of chemical bonds that connect atoms together. Let us consider a material, denoted by , which consists of chemical bonds denoted by . Assume that a chemical bond with is generated by a connection between two atoms and , and this bond is surrounded by several other atoms, each of them can connect to atom or atom , as illustrated in Figure 1. The surrounding atoms generate a chemical environment that holds chemical bond in a stable state.
Chemical bond and its chemical environment can be considered as an unified unit corresponding to a local structure of material . In other words, material can be separated into local structures corresponding to chemical bonds and their chemical environments.
Atoms are represented by 32dimensional vectors, called onehot vectors [20], which are generated by using a set of valence subshell orbitals (e.g. indicates that the valence orbital holds 2 electrons in the electron configuration). In addition, we adopt the method of using orbitalfield matrix [20] for representing two atoms and . Let and denote two orbitalfield matrices corresponding two atoms and , respectively. Two matrices and are defined by
(1) 
where and are two onehot vectors corresponding to atoms and , and and are two vectors representing chemical environments of these two atoms [20]. Two vectors and are defined by:
(2) 
where and are total numbers of atoms connecting to atoms and respectively, and is a coefficient representing the importance role of atom . Weight coefficient is defined by
(3) 
where is the solid angle determined by the face of the Voronoi polyhedron [2] that separates atom and its connected atom (atom or atom ), is the maximum solid angle among solid angles corresponding the atoms that connect to the connected atom, and is the distance between atom and its connected atom.
Chemical bond and its chemical environment are then represented by a matrix as follows:
(4) 
where and are the coefficients representing the importance roles of atoms and in chemical bond , respectively. Coefficients and should be selected according to specific applications; here, we propose that these coefficients are computed by the following equation:
(5) 
where and are the atomic numbers of two atoms and respectively, and is the distance between these two atoms.
Because material contains chemical bonds, this material is separated into local structures corresponding to matrices . Regarding the statistics point of view, the set containing the number of local structures, mean and standard deviation of local structures can be used to describe material . Here, mean and standard deviation of local structures, denoted by and , are defined as follows:
(6) 
We propose that using this set to represent material . Furthermore, in order to apply datadriven techniques, the representation of material needs to be transformed into a vector or matrix. Therefore, mean and standard deviation matrices are raved and then combined with the number of chemical bonds in order to form a vector. In other words, material is represented by a vector as follows:
(7) 
Let us consider two materials represented as and respectively. One can employ various types of distance measurements for measuring the similarity between these two materials, such as listed below:
3 Experiment
To evaluate the new representation method, we applied it into a materials informatics application that aims at predicting atomization energies by using machine learning [18]. For analyzing materials data in the application, we selected linear regression technique [18] with two learning algorithms, knearest neighbors (KNN) [18] and kernel ridge (KR) [18]. Additionally, we selected QM7 data set [24] for the application. This data set contains 7165 materials (molecules), each of them is composed of a maximum of 23 atoms including C, N, O, S, and H. Coordinates of atoms in materials are presented by Cartesian coordinate system. Information about Coulomb matrix and atomization energies of materials is available in the data set; and the atomization energies are ranging from 800 to 2000 . To determine chemical bonds atoms in materials, we employed pymatgen [19], an opensource library for analyzing materials; however, Voronoi polyhedra [19] could not be determined for 250 materials; thus, they were eliminated from the data set. As a consequence, 6195 materials were actually used in the experiments.
For comparison, we selected two stateoftheart representastion methods, orbitalfield matrix [20] and Coulomb matrix (eigenspectrum) [17, 24], as two baselines. For measuring performances of predicting atomization energies we used three wellknown assessment methods [18]: mean absolute error (), rootmeansquare error (), and coefficient of determination (). Moreover, we applied 5 times 10 folds cross validation into the experiments.
In order to measure the impacts of distances between atoms in chemical bonds (or the lengths of chemical bonds) on performances of predicting atomization energies, we chose KNN learning algorithm with the number of nearest neighbors (denoted by ) and Euclidean distance method. The performance according to assessment method was presented in Figure 2. As we can see in this figure, the performance increases when and then decreases when . It also can be seen that the application archives high accuracy of prediction when the values of from 3 to 5.
Next, we measure the performance of prediction by using the proposed representation method with and both learning algorithms KNN and KR. For KNN, we selected and Euclidean distance method, and for KR, we selected Laplacian kernel [18]. The results of prediction are illustrated in Figure 3. In this figure, parts (a) and (b) show performances of prediction by using KNN and KR learning algorithms, respectively. As we can observe, the performances according to KR are better than those according KNN.
Distance  

measure  (*)  (**)  (***)  (*)  (**)  (***)  (*)  (**)  (***) 
Euclidean  12.877  14.411  78.721  24.071  30.015  102.528  0.988  0.981  0.790 
Manhattan  11.447  14.102  68.664  22.967  30.218  90.181  0.989  0.980  0.838 
Cosine  26.690  42.836  85.885  55.503  97.061  111.835  0.934  0.798  0.751 
BaryCurtis  11.684  14.346  68.829  23.665  30.839  90.347  0.988  0.980  0.8372 
Canberra  71.527  47.010  18.832  110.528  72.887  25.526  0.738  0.886  0.987 

Chemical bondbased

Orbitalfield matrix

Coulomb matrix (eigenspectrum)
To compare the proposed representation method with two selected baselines, we also selected . The results of comparison were summarized in Tables 1 and 2. In these tables, each assessment method for a representation method is represented in a column, and the bold values indicate the best performances in each row and according to the corresponding evaluation assessment method. As detailed in Table 1, the proposed representation method is better than two baselines with the first four distance measure methods, and the representation method by using Coulomb matrix is more effective than the proposed method and the other baseline according to the Canberra distance method. In addition, it can be seen in Table 2, the proposed method achieves the best performance according to criterion , and the representation method by using Coulomb matrix obtains the best performances according to criteria and . However, as observed in Table 2, the performances of the proposed method can be comparable with those of the representation method by using Coulomb matrix.
Kernel  
(*)  (**)  (***)  (*)  (**)  (***)  (*)  (**)  (***)  
Laplacian  9.934  13.942  9.960  15.106  24.769  13.886  0.995  0.987  0.996 

Chemical bondbased

Orbitalfield matrix

Coulomb matrix (eigenspectrum)
4 Conclusion
In this paper, we have proposed a new method for representing materials in materials informatics applications. This method focuses on exploiting information about chemical bonds among atoms in materials and also inherits the benefit of orbitalfield matrix representation that is based on the distribution of valence shell electrons. Additionally, we have demonstrated that different similarity measure methods can be integrated with the proposed method. Note that, the proposed method can apply into a large diversity of atomic compositions and structures and facilitate the learning and predicting targeted properties of molecular and crystalline systems.
In the experiment, the proposed method is tested with an application that aims to predict atomization energies; and the results of the experiment indicate that the proposed method is more effective in most the cases when comparing with two selected baselines. In the near future, we plan to further evaluate the proposed method by using different materials data as well as materials informatics applications.
References
 [1] Ankit Agrawal and Alok Choudhary. Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science. APL Mater., 4(053208):1–10, 2016.
 [2] Franz Aurenhammer. Voronoi diagrams  a survey of a fundamental geometric data structure. ACM Comput. Surv., 23(3), 1991.
 [3] Jorg Behler. Atomcentered symmetry functions for constructing highdimensional neural network potentials. J. Phys. Chem, 134:074106–1, 2011.
 [4] J. Roger Bray and J. T. Curtis. An ordination of the upland forest communities of southern wisconsin. Ecol. Monograph., 27(4):325–349, 1957.
 [5] Michel M. Deza and Elena Deza. Encyclopedia of Distances. Springer Berlin Heidelberg, 2009.
 [6] Vasant Dhar. Data science and prediction. Commun. ACM, 56(12):64–73, 2013.
 [7] Hagai Eshet, Rustam Z. Khaliullin, Thomas D. Kühne, Jörg Behler, and Michele Parrinello. quality neuralnetwork potential for sodium. Phys. Rev. B, 81:184107, 2010.
 [8] Hagai Eshet, Rustam Z. Khaliullin, Thomas D. Kühne, Jörg Behler, and Michele Parrinello. Microscopic origins of the anomalous melting behavior of sodium under high pressure. Phys. Rev. Lett, 108:115701, 2012.
 [9] Felix Faber, Alexander Lindmaa, O. Anatole von Lilienfeld, and Rickard Armiento. Crystal structure representations for machine learning models of formation energies. Int. J. Quantum Chem., 115(16):1094–1101, 2015.
 [10] Luca M. Ghiringhelli, Jan Vybiral, Sergey V. Levchenko, Claudia Draxl, and Matthias Scheffler. Big data of materials science: Critical role of the descriptor. Phys. Rev. Lett., 114:105503, 2015.
 [11] Olexandr Isayev, Denis Fourches, Eugene N. Muratov, Corey Oses, Kevin Rasch, Alexander Tropsha, and Stefano Curtarolo. Materials cartography: Representing and mining materials space using structural and electronic fingerprints. Chem. Mater, 27:735, 2015.
 [12] Eugene F. Krause. Taxicab Geometry: An Adventure in NonEuclidean Geometry. Dover Publications, 1987.
 [13] G. N. Lance and W. T. Williams. Computer programs for hierarchical polythetic classification (‘similarity analyses’). Comput. J., 9(1):60–64, 1966.
 [14] Turab Lookman, Francis J. Alexander, and Alan R. Bishop. Perspective: Codesign for materials science: An optimal learning approach. APL Mater., 4(5):053501, 2016.
 [15] Turab Lookman, Francis J. Alexander, and Krishna Rajan. Information Science for Materials Discovery and Design. Springer Publishing Company, 2015.
 [16] Rupp Matthias. Machine learning for quantum mechanics in a nutshell. Int. J. Quantum Chem., 115(16):1058–1073, 2015.
 [17] Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, O. Anatole von Lilienfeld, and KlausRobert Müller. Learning invariant representations of molecules for atomization energy prediction. In Proceedings of NIPS’12  Volume 1, pages 440–448, 2012.
 [18] Kevin P. Murphy, editor. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
 [19] Shyue P. Ong, William D. Richards, Anubhav Jain, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent L. Chevrier, Kristin A. Persson, and Gerbrand Ceder. Python materials genomics (pymatgen): A robust, opensource python library for materials analysis. Comp. Mater. Sci, 68:314 – 319, 2013.
 [20] Tien L. Pham, Hiori Kino, Kiyoyuki Terakura, Takashi Miyake, Koji Tsuda, Ichigaku Takigawa, and Hieu C. Dam. Machine learning reveals orbital interaction in materials. Sci. Technol. Adv. Mater., 18(1):756–765, 2017.
 [21] Ghanshyam Pilania, Chenchen Wang, Xun Jiang, Sanguthevar Rajasekaran, and Ramamurthy Ramprasad. Accelerating materials property predictions using machine learning. Sci. Rep., 3:2810, 2013.
 [22] Krishna Rajan. Materials informatics: The materials “Gene” and big data. Annu. Rev. Mater. Sci., 45:153–169, 2015.
 [23] John R. Rodgers and David Cebon. Materials informatics. MRS Bull., 31(12):975–980, 2006.
 [24] Matthias Rupp, Alexandre Tkatchenko, KlausRobert Müller, and O. Anatole von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett., 108:058301, 2012.
 [25] Yousef Saad, Da Gao, Thanh Ngo, Scotty Bobbitt, James R. Chelikowsky, and Wanda Andreoni. Data mining for materials: Computational experiments with compounds. Phys. Rev. B, 85:104104, 2012.
 [26] Amit Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35–43, 2001.
 [27] Keisuke Takahashi and Yuzuru Tanaka. Materials informatics: A journey towards material design and synthesis. Dalton Trans., 45:10497–10499, 2016.