A Chemical Bond-Based Representation of Materials

A Chemical Bond-Based Representation of Materials

Van-Doan Nguyen(a), Le Dinh Khiet(a), Pham Tien Lam(a,b), Dam Hieu Chi(a,b,c,*)

(a)Japan Advanced Institute of Science and Technology
1-1 Asahidai, Nomi, Ishikawa 923-1211, Japan.
(b)Elements Strategy Initiative Center for Magnetic Materials,
National Institute for Materials Science,
1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan.
(c)Center for Materials Research by Information Integration,
Research and Services Division of Materials Data and Integrated System,
National Institute for Materials Science,
1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan.

Email: (*)dam@jaist.ac.jp

This paper introduces a new representation method that is mainly based on chemical bonds among atoms in materials. Each chemical bond and its surrounded atoms are considered as a unified unit or a local structure that is expected to reflect a part of materials’ nature. First, a material is separated into local structures; and then represented as matrices, each of which is computed by using information about the corresponding chemical bond as well as orbital-field matrices of two related atoms. After that, all local structures of the material are utilized by using the statistics point of view. In the experiment, the new method was applied into a materials informatics application that aims at predicting atomization energies using QM7 data set. The results of the experiment show that the new method is more effective than two state-of-the-art representation methods in most of the cases.


1 Introduction

As remarked in [14], a key element of developing advanced materials is to learn from materials knowledge and available materials data to guide the next experiments or calculations in order to focus on materials with targeted properties. Traditionally, materials knowledge has been discovered by experimental studies. In the last few decades, the knowledge has also been discovered by a conventional approach, called computational materials science, whose scope is to model or predict the behavior of materials based on their composition, micro-structure, process history, and interactions.

Recently, the development of materials informatics [1, 23], known as a combination of materials science and data science, has opened up a new opportunity for accelerating the discovery of new materials knowledge. Regarding the literature, data science [6] is a field of study that employs a wide range of data-driven techniques from a large number of research fields, such as applied mathematics, statistics, computational science, information science, and computer science, in order to understand and analyze data. In materials informatics, data-driven techniques are applied into existing materials data for the purpose of automatically discovering new materials knowledge such as hidden features, hidden chemical and new physical rules, and new patterns [10, 11, 25]. Remarkablely, materials informatics is expected not only to provide foundations for a new paradigm of materials descovery [22], but also to be the next generation of exploring new materials [27].

Over the years, a large volume of materials data has been generated [15], and these data are commonly described by using a set of atoms with their coordinates and periodic unit cell vectors and categorized as unstructured data [20]. In practice, data-driven techniques can be hardly applied directly on materials data. Before applying data-driven techniques, materials data have to be transformed into new representations (or descriptors). The representations need to reflect the nature of materials and the actuating mechanisms of chemical/physical phenomena. In addition, the operators such as comparison and calculations can be performed by using the representations.

So far, various methods for representing materials have been developed. Behler and co-workers [3, 7, 8] utilized atom-distribution-based symmetry functions to represent the local chemical environment of atoms and employed a multilayer perceptron to map this representation to atomic energy. The arrangement of structural fragments has also been used to represent materials in order to predict the physical properties of molecular and crystalline systems [21]. Isayev used the band structure and density of states fingerprint vectors as a representation of materials to visualize material space [11]. Rupps and co-workers developed a representation known as Coulomb matrix for the prediction of atomization energies and formation energies [9, 16, 24]. In [20], the authors pointed out that distribution of valence orbitals (electrons) of atoms in materials is important information that should be included in the representation of materials. The author in [20] also proposed a representation method, called orbital-field matrix, which exploits the distribution.

It is well-known that properties of almost materials are determined by the chemical bonds which may result from the electrostatic force of attraction between atoms with opposite charges, or through the sharing of electrons. In addition, chemical bonds hold an enormous amount of energy and building and breaking chemical bonds is part of the energy cycle. Therefore, in this research, we aim at developing a new representation method that mainly based on chemical bonds. In short, the main contributions of the research include (1) a new method to exploit chemical bonds of atoms in materials and (2) a new method to utilize local structures of a material by adopting statistics point of view.

2 The proposed representation method

Generally, a material is composed of chemical bonds that connect atoms together. Let us consider a material, denoted by , which consists of chemical bonds denoted by . Assume that a chemical bond with is generated by a connection between two atoms and , and this bond is surrounded by several other atoms, each of them can connect to atom or atom , as illustrated in Figure 1. The surrounding atoms generate a chemical environment that holds chemical bond in a stable state.

Figure 1: Chemical bond and its chemical environment.

Chemical bond and its chemical environment can be considered as an unified unit corresponding to a local structure of material . In other words, material can be separated into local structures corresponding to chemical bonds and their chemical environments.

Atoms are represented by 32-dimensional vectors, called one-hot vectors [20], which are generated by using a set of valence subshell orbitals (e.g. indicates that the valence orbital holds 2 electrons in the electron configuration). In addition, we adopt the method of using orbital-field matrix [20] for representing two atoms and . Let and denote two orbital-field matrices corresponding two atoms and , respectively. Two matrices and are defined by


where and are two one-hot vectors corresponding to atoms and , and and are two vectors representing chemical environments of these two atoms [20]. Two vectors and are defined by:


where and are total numbers of atoms connecting to atoms and respectively, and is a coefficient representing the importance role of atom . Weight coefficient is defined by


where is the solid angle determined by the face of the Voronoi polyhedron [2] that separates atom and its connected atom (atom or atom ), is the maximum solid angle among solid angles corresponding the atoms that connect to the connected atom, and is the distance between atom and its connected atom.

Chemical bond and its chemical environment are then represented by a matrix as follows:


where and are the coefficients representing the importance roles of atoms and in chemical bond , respectively. Coefficients and should be selected according to specific applications; here, we propose that these coefficients are computed by the following equation:


where and are the atomic numbers of two atoms and respectively, and is the distance between these two atoms.

Because material contains chemical bonds, this material is separated into local structures corresponding to matrices . Regarding the statistics point of view, the set containing the number of local structures, mean and standard deviation of local structures can be used to describe material . Here, mean and standard deviation of local structures, denoted by and , are defined as follows:


We propose that using this set to represent material . Furthermore, in order to apply data-driven techniques, the representation of material needs to be transformed into a vector or matrix. Therefore, mean and standard deviation matrices are raved and then combined with the number of chemical bonds in order to form a vector. In other words, material is represented by a vector as follows:


Let us consider two materials represented as and respectively. One can employ various types of distance measurements for measuring the similarity between these two materials, such as listed below:

  • Euclidean distance [5], denoted by :

  • Manhattan distance [12], denoted by :

  • Cosine distance [26], denoted by :

  • Bary-Curtis distance [4], denoted by :

  • Canberra distance [13], denoted by :


3 Experiment

To evaluate the new representation method, we applied it into a materials informatics application that aims at predicting atomization energies by using machine learning [18]. For analyzing materials data in the application, we selected linear regression technique [18] with two learning algorithms, k-nearest neighbors (KNN) [18] and kernel ridge (KR) [18]. Additionally, we selected QM7 data set [24] for the application. This data set contains 7165 materials (molecules), each of them is composed of a maximum of 23 atoms including C, N, O, S, and H. Coordinates of atoms in materials are presented by Cartesian coordinate system. Information about Coulomb matrix and atomization energies of materials is available in the data set; and the atomization energies are ranging from -800 to -2000 . To determine chemical bonds atoms in materials, we employed pymatgen [19], an open-source library for analyzing materials; however, Voronoi polyhedra [19] could not be determined for 250 materials; thus, they were eliminated from the data set. As a consequence, 6195 materials were actually used in the experiments.

For comparison, we selected two state-of-the-art representastion methods, orbital-field matrix [20] and Coulomb matrix (eigenspectrum) [17, 24], as two baselines. For measuring performances of predicting atomization energies we used three well-known assessment methods [18]: mean absolute error (), root-mean-square error (), and coefficient of determination (). Moreover, we applied 5 times 10 folds cross validation into the experiments.

Figure 2: The impact of lengths of chemical bonds on performances of prediction according assessment method .

In order to measure the impacts of distances between atoms in chemical bonds (or the lengths of chemical bonds) on performances of predicting atomization energies, we chose KNN learning algorithm with the number of nearest neighbors (denoted by ) and Euclidean distance method. The performance according to assessment method was presented in Figure 2. As we can see in this figure, the performance increases when and then decreases when . It also can be seen that the application archives high accuracy of prediction when the values of from 3 to 5.

Figure 3: Comparison of predicted atomization energies by using KNN (part a) and KR (part b) learning algorithms and reference atomization energies calculated by using DFT.

Next, we measure the performance of prediction by using the proposed representation method with and both learning algorithms KNN and KR. For KNN, we selected and Euclidean distance method, and for KR, we selected Laplacian kernel [18]. The results of prediction are illustrated in Figure 3. In this figure, parts (a) and (b) show performances of prediction by using KNN and KR learning algorithms, respectively. As we can observe, the performances according to KR are better than those according KNN.

measure (*) (**) (***) (*) (**) (***) (*) (**) (***)
Euclidean 12.877 14.411 78.721 24.071 30.015 102.528 0.988 0.981 0.790
Manhattan 11.447 14.102 68.664 22.967 30.218 90.181 0.989 0.980 0.838
Cosine 26.690 42.836 85.885 55.503 97.061 111.835 0.934 0.798 0.751
Bary-Curtis 11.684 14.346 68.829 23.665 30.839 90.347 0.988 0.980 0.8372
Canberra 71.527 47.010 18.832 110.528 72.887 25.526 0.738 0.886 0.987
  • Chemical bond-based

  • Orbital-field matrix

  • Coulomb matrix (eigenspectrum)

Table 1: Cross-validated , and in the prediction of the atomization energies obtained by using learning algorithm KNN with the selected distance measurement methods.

To compare the proposed representation method with two selected baselines, we also selected . The results of comparison were summarized in Tables 1 and 2. In these tables, each assessment method for a representation method is represented in a column, and the bold values indicate the best performances in each row and according to the corresponding evaluation assessment method. As detailed in Table 1, the proposed representation method is better than two baselines with the first four distance measure methods, and the representation method by using Coulomb matrix is more effective than the proposed method and the other baseline according to the Canberra distance method. In addition, it can be seen in Table 2, the proposed method achieves the best performance according to criterion , and the representation method by using Coulomb matrix obtains the best performances according to criteria and . However, as observed in Table 2, the performances of the proposed method can be comparable with those of the representation method by using Coulomb matrix.

(*) (**) (***) (*) (**) (***) (*) (**) (***)
Laplacian 9.934 13.942 9.960 15.106 24.769 13.886 0.995 0.987 0.996
  • Chemical bond-based

  • Orbital-field matrix

  • Coulomb matrix (eigenspectrum)

Table 2: Cross-validated , and in the prediction of the atomization energies obtained by learning algorithm KR.

4 Conclusion

In this paper, we have proposed a new method for representing materials in materials informatics applications. This method focuses on exploiting information about chemical bonds among atoms in materials and also inherits the benefit of orbital-field matrix representation that is based on the distribution of valence shell electrons. Additionally, we have demonstrated that different similarity measure methods can be integrated with the proposed method. Note that, the proposed method can apply into a large diversity of atomic compositions and structures and facilitate the learning and predicting targeted properties of molecular and crystalline systems.

In the experiment, the proposed method is tested with an application that aims to predict atomization energies; and the results of the experiment indicate that the proposed method is more effective in most the cases when comparing with two selected baselines. In the near future, we plan to further evaluate the proposed method by using different materials data as well as materials informatics applications.


  • [1] Ankit Agrawal and Alok Choudhary. Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science. APL Mater., 4(053208):1–10, 2016.
  • [2] Franz Aurenhammer. Voronoi diagrams - a survey of a fundamental geometric data structure. ACM Comput. Surv., 23(3), 1991.
  • [3] Jorg Behler. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. J. Phys. Chem, 134:074106–1, 2011.
  • [4] J. Roger Bray and J. T. Curtis. An ordination of the upland forest communities of southern wisconsin. Ecol. Monograph., 27(4):325–349, 1957.
  • [5] Michel M. Deza and Elena Deza. Encyclopedia of Distances. Springer Berlin Heidelberg, 2009.
  • [6] Vasant Dhar. Data science and prediction. Commun. ACM, 56(12):64–73, 2013.
  • [7] Hagai Eshet, Rustam Z. Khaliullin, Thomas D. Kühne, Jörg Behler, and Michele Parrinello. quality neural-network potential for sodium. Phys. Rev. B, 81:184107, 2010.
  • [8] Hagai Eshet, Rustam Z. Khaliullin, Thomas D. Kühne, Jörg Behler, and Michele Parrinello. Microscopic origins of the anomalous melting behavior of sodium under high pressure. Phys. Rev. Lett, 108:115701, 2012.
  • [9] Felix Faber, Alexander Lindmaa, O. Anatole von Lilienfeld, and Rickard Armiento. Crystal structure representations for machine learning models of formation energies. Int. J. Quantum Chem., 115(16):1094–1101, 2015.
  • [10] Luca M. Ghiringhelli, Jan Vybiral, Sergey V. Levchenko, Claudia Draxl, and Matthias Scheffler. Big data of materials science: Critical role of the descriptor. Phys. Rev. Lett., 114:105503, 2015.
  • [11] Olexandr Isayev, Denis Fourches, Eugene N. Muratov, Corey Oses, Kevin Rasch, Alexander Tropsha, and Stefano Curtarolo. Materials cartography: Representing and mining materials space using structural and electronic fingerprints. Chem. Mater, 27:735, 2015.
  • [12] Eugene F. Krause. Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Dover Publications, 1987.
  • [13] G. N. Lance and W. T. Williams. Computer programs for hierarchical polythetic classification (‘similarity analyses’). Comput. J., 9(1):60–64, 1966.
  • [14] Turab Lookman, Francis J. Alexander, and Alan R. Bishop. Perspective: Codesign for materials science: An optimal learning approach. APL Mater., 4(5):053501, 2016.
  • [15] Turab Lookman, Francis J. Alexander, and Krishna Rajan. Information Science for Materials Discovery and Design. Springer Publishing Company, 2015.
  • [16] Rupp Matthias. Machine learning for quantum mechanics in a nutshell. Int. J. Quantum Chem., 115(16):1058–1073, 2015.
  • [17] Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, O. Anatole von Lilienfeld, and Klaus-Robert Müller. Learning invariant representations of molecules for atomization energy prediction. In Proceedings of NIPS’12 - Volume 1, pages 440–448, 2012.
  • [18] Kevin P. Murphy, editor. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
  • [19] Shyue P. Ong, William D. Richards, Anubhav Jain, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent L. Chevrier, Kristin A. Persson, and Gerbrand Ceder. Python materials genomics (pymatgen): A robust, open-source python library for materials analysis. Comp. Mater. Sci, 68:314 – 319, 2013.
  • [20] Tien L. Pham, Hiori Kino, Kiyoyuki Terakura, Takashi Miyake, Koji Tsuda, Ichigaku Takigawa, and Hieu C. Dam. Machine learning reveals orbital interaction in materials. Sci. Technol. Adv. Mater., 18(1):756–765, 2017.
  • [21] Ghanshyam Pilania, Chenchen Wang, Xun Jiang, Sanguthevar Rajasekaran, and Ramamurthy Ramprasad. Accelerating materials property predictions using machine learning. Sci. Rep., 3:2810, 2013.
  • [22] Krishna Rajan. Materials informatics: The materials “Gene” and big data. Annu. Rev. Mater. Sci., 45:153–169, 2015.
  • [23] John R. Rodgers and David Cebon. Materials informatics. MRS Bull., 31(12):975–980, 2006.
  • [24] Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O. Anatole von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett., 108:058301, 2012.
  • [25] Yousef Saad, Da Gao, Thanh Ngo, Scotty Bobbitt, James R. Chelikowsky, and Wanda Andreoni. Data mining for materials: Computational experiments with compounds. Phys. Rev. B, 85:104104, 2012.
  • [26] Amit Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35–43, 2001.
  • [27] Keisuke Takahashi and Yuzuru Tanaka. Materials informatics: A journey towards material design and synthesis. Dalton Trans., 45:10497–10499, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description