DScribe: Library of Descriptors for Machine Learning in Materials Science
Abstract
DScribe is a software package for machine learning that provides popular feature transformations (“descriptors”) for atomistic materials simulations. DScribe accelerates the application of machine learning for atomistic property prediction by providing userfriendly, offtheshelf descriptor implementations. The package currently contains implementations for Coulomb matrix, Ewald sum matrix, sine matrix, Manybody Tensor Representation (MBTR), Atomcentered Symmetry Function (ACSF) and Smooth Overlap of Atomic Positions (SOAP). Usage of the package is illustrated for two different applications: formation energy prediction for solids and ionic charge prediction for atoms in organic molecules. The package is freely available under the opensource Apache License 2.0.
keywords:
machine learning, materials science, descriptor, python, open sourcenumbers,sort&compress
PROGRAM SUMMARY
Program Title: DScribe
Licensing provisions: Apache2.0
Programming language: Python/C/C++
Supplementary material: Supplementary Information as PDF
Nature of problem: The application of machine learning for materials science is hindered by the lack of consistent software implementations for feature transformations. These feature transformations, also called descriptors, are a key step in building machine learning models for property prediction in materials science.
Solution method: We have developed a library for creating common descriptors used in machine learning applied to materials science. We provide an implementation the following descriptors: Coulomb matrix, Ewald sum matrix, sine matrix, Manybody Tensor Representation (MBTR), Atomcentered Symmetry Functions (ACSF) and Smooth Overlap of Atomic Positions (SOAP). The library has a python interface with computationally intensive routines written in C or C++. The source code, tutorials and documentation are provided online. A continuous integration mechanism is set up to automatically run a series of regression tests and check code coverage when the codebase is updated.
1 Introduction
Machine learning of atomistic systems is a highly active, interdisciplinary area of research. The power of machine learning lies in the ability of interpolating existing calculations with surrogate models to accelerate predictions for new systems Takahashi and Tanaka (2016); Zdeborová (2017); Gubernatis and Lookman (2018); Butler et al. (2018). The set of possible applications is very rich, including highthroughput search of stable compounds with machine learning based energy predictions for solids Ward et al. (2017); Huo and Rupp (2017); Faber et al. (2015); Seko et al. (2017), accelerated molecular property prediction Rupp et al. (2012); Stuke et al. (2018); Ghosh et al. (2019), creation of new forcefields based on quantum mechanical training data Smith et al. (2017); Bartók et al. (2010, 2013); Chmiela et al. (2017); Behler (2011, 2016); Li et al. (2017a); Jacobsen et al. (2018), search for catalytically active sites in nanoclusters Li et al. (2017b); Goldsmith et al. (2018); Chowdhury et al. (2018); Jäger et al. (2018); Zahrt et al. (2019) and efficient optimization of complex structures Kiyohara et al. (2016); Zalake et al. (2017); Todorović et al. (2019).
Atomistic machine learning establishes a relationship between the atomic structure of a system and its properties. This so called structureproperty relation is illustrated in Fig. 1. It is analogous to the structureproperty relation in quantum mechanics. For a set of nuclear charges and atomic positions of a system, the solution of the Schrödinger equation yields the properties of the system since both the Hamiltonian and the wave function depend only on and . Atomistic machine learning bypasses the computationally costly step of solving the Schrödinger equation^{1}^{1}1Of course in practice the Schrödinger equation is not solved directly but by approximate methods. Also approximate solutions of the Schrödinger equation are computationally expensive, compared to surrogate machine learning models. by training a surrogate model. Once trained, the surrogate model is typically very fast to evaluate facilitating almost instant structureproperty predictions.
Unlike for the Schrödinger equation, the nuclear charges and atomic positions are not a suitable input representation of atomistic systems for machine learning. They are, for example, not rotationally or translationally invariant. If presented with atomic positions, the machine learning method would have to learn rotational and translational invariance for every data set, which would significantly increase the amount of required training data. For this reason, the input data has to be transformed into a representation that is suitable for machine learning. This transformation step is often referred to as feature engineering and the selected features are called a descriptor Ghiringhelli et al. (2015)^{2}^{2}2Sometimes the separation between a descriptor and a learning model is however blurred. For example, in various models based on neural networks Xie and Grossman (2018); Arbabzadah et al. (2017); Gilmer et al. (2017); Schütt et al. (2018) the description of the atomistic structure is embedded inside the weights of the network, essentially mixing the descriptor and the learning model together.. Various feature engineering approaches have been proposed Rupp et al. (2012); Montavon et al. (2012); Hansen et al. (2015); Faber et al. (2015); Bartók et al. (2013); Huo and Rupp (2017); Behler (2011); Gastegger et al. (2018); Ward et al. (2017); Isayev et al. (2017); Faber et al. (2016, 2018); Pronobis et al. (2018); Seko et al. (2017), and often multiple approaches have to be tested to find a suitable representation for a specific task Faber et al. (2017). Features are often based on the atomic structure, but it is also common to extend the input to other system properties Ghiringhelli et al. (2015); Ouyang et al. (2018); Isayev et al. (2017); Ward et al. (2017).
There are several requirements for good descriptors in atomistic machine learning Huo and Rupp (2017); Faber et al. (2015). We identify the following properties to be most important for an ideal descriptor:

Invariant with respect to spatial translation of the coordinate system: isometry of space.

Invariant with respect to rotation of the coordinate system: isotropy of space.

Invariant with respect to permutation of atomic indices: changing the enumeration of atoms does not affect the physical properties of the system.

Unique: there is a single way to construct a descriptor from an atomic structure and the descriptor itself corresponds to a single property.

Continuous: small changes in the atomic structure should translate to small changes in the descriptor.

Compact: the descriptor should contain sufficient information of the system for performing the prediction while keeping the feature count to minimum.

Computationally cheap: the computation of the descriptor should be significantly faster than any existing computational model for directly calculating the physical property of interest.
In this article we present the DScribe package that can be used to transform atomic structures into machinelearnable input features. The aim of this software is to provide a coherent and easily extendable implementation for atomistic machine learning and fast prototyping of descriptors. Currently in the DScribe package we include descriptors that can be represented in a vectorial form and are not dependent on any specific learning model. By decoupling the descriptor creation from the machine learning model, the user can experiment in parallel with various descriptor/model combinations and has the possibility of directly applying emerging learning models on existing data. This freedom to switch between machine learning models becomes important because currently no universally best machine model exists for every problem, as stated by the “No Free Lunch Theorem” Wolpert and Macready (1997). In practice this means that multiple models have to be tested to find optimal performance. Furthermore, vectorial features provide easier insight into the importance of certain features and facilitate the application of unsupervised learning methods, such as clustering and subsequent visualization with informative “materials maps” Ceriotti et al. (2011); De et al. (2016); Isayev et al. (2015).
Descriptors that encode an atomic structure are typically designed to either depict a local atomic environment, or the structure as a whole. Global descriptors encode information about the whole atomic structure. These global descriptors can be used to predict properties related to the structure as a whole, such as molecular energies Rupp et al. (2012), formation energies Ward et al. (2017) or band gaps Isayev et al. (2017). In this work we cover four such global descriptors: the Coulomb matrix Rupp et al. (2012), the Ewald sum matrix Faber et al. (2015), the sine matrix Faber et al. (2015) and the ManyBody Tensor Representation (MBTR) Huo and Rupp (2017). Local descriptors are instead used to represent a localized region in an atomic structure, and are thus suitable for predicting localized properties, like atomic forces Bartók et al. (2010), adsorption energiesJäger et al. (2018), or properties that can be summed from local contributions. In this article we discuss two local descriptors, Atomcentered Symmetry functions (ACSFs) Behler (2011) and the Smooth Overlap of Atomic Positions (SOAP) Bartók et al. (2013).
We first introduce the descriptors that have been implemented in the DScribe package and then we discuss the structure and usage of the package. After this we illustrate the applicability of the package by showing results for formation energy prediction of periodic crystals and partial charge prediction for molecules. We conclude, by addressing the impact and future extensions of this package.
2 Descriptors
Here we briefly introduce the different descriptors that are currently implemented in DScribe. In some cases, we have deviated from the original literature due to computational or other reasons, and if so this is explicitly mentioned. For more indepth presentations of the descriptors we refer the reader to the original research papers. At the end of this section we also discuss methods for organizing the descriptor output so that it can be effectively used in typical machine learning applications.
2.1 Coulomb matrix
The Coulomb matrix Rupp et al. (2012) encodes the atomic species and interatomic distances of a finite system in a pairwise, twobody matrix inspired by the form of the Coulomb potential. The elements of this matrix are given by:
where is the atomic number, and is the Euclidean distance between atoms and . The form of the diagonal terms was determined by fitting the potential energy of neutral atoms Ramakrishnan et al. (2015).
2.2 Ewald sum matrix
The Ewald sum matrix Faber et al. (2015) can be viewed as a logical extension of the Coulomb matrix for periodic systems. In periodic systems each atom is infinitely repeated in the three crystal lattice vector directions, , and and the electrostatic interaction between two atoms becomes
(1) 
where is the sum over all lattice vectors .
For , this sum converges only conditionally and will become infinite if the system is not charge neutral. In the Ewald sum matrix, the Ewald summation technique Ewald (1921); Toukmaji and Board (1996) and a neutralizing background charge Hub et al. (2014) is used to force this sum to converge. One can separate the total Ewald energy into pairwise components, which will result in the following matrix:
where the terms are given by
(2)  
(3)  
(4)  
(5) 
Here the primed notation means that when the pairs are not taken into account. is the screening parameter controlling the size of the gaussian charge distributions used in the Ewald method, is a reciprocal space lattice vector with an implicit prefactor and is the volume of the cell. A more detailed derivation is given in the supplementary information. By default we use the value Jackson and Catlow (1988), where is the number of atoms in the unit cell.
It is important to notice that the offdiagonal contribution given here differs from the original work. In the original formulation this sum was defined as Faber et al. (2015) . Our correction makes the total matrix elements independent of the screening parameter , which is not the case in the original formulation.
2.3 Sine matrix
The Ewald sum matrix encodes the correct Coulomb interaction between atoms, but can become computationally heavy especially for large systems. The sine matrix Faber et al. (2015) captures some important features of interacting atoms in a periodic system with a much reduced computational cost. The matrix elements are defined by
where
(6) 
Here is a matrix formed by the lattice vectors and are the cartesian unit vectors. This functional form has no physical interpretation, but it captures some of the properties of the Coulomb interaction, such as the periodicity of the crystal lattice and an infinite energy when two atoms overlap.
The Coulomb, Ewald sum and sine matrices for diamond are depicted in Fig. 2. Notice that the matrices given here are not unique, as different cell sizes can be used for a periodic crystal, and the indexing of the rows and columns depends on the ordering of atomic indices in the structure. Section 2.7 discusses some ways to overcome the issues related to this nonuniqueness.
By construction the Coulomb matrix is not periodic as manifested by the unequivalent row elements in the matrix (one carbon in the system has four bonded neighbours, three carbons have two neighbours and four carbons have a single bonded neighbour). Conversely, both the Ewald sum and the sine matrix are periodic and correctly encode the identical environment of the carbon atoms in the lattice. As a result, each row and each column has the same matrix elements, but neighbouring rows and columns are shifted by one element relative to each other. Unlike the other matrices, Ewald sum matrix often contains negative elements due to the interaction of the positive atomic nuclei with the added uniform negative background charge. This energetically favourable interaction shifts the offdiagonal elements down in energy compared to the other two matrices. Moreover, the diagonal elements of the Ewald sum matrix encode the physical selfinteraction of atoms with their periodic copies, instead of the potential energy of the neutral atoms.
2.4 Manybody Tensor Representation
The manybody tensor representation (MBTR) Huo and Rupp (2017) encodes finite or periodic structures by breaking them down into distributions of differently sized structural motifs and grouping these distributions by the involved chemical elements. In MBTR, a geometry function is used to transform a configuration of atoms into a single scalar value representing that particular configuration. Our implementation provides terms up to , and as the geometry functions uses (atomic numbers), (inverse distances) and (cosines of angles). These scalar values are then broadened by using kernel density estimation with a gaussian kernel, leading to the following distributions
(7) 
(8) 
(9) 
Here is the standard deviation of the gaussian kernel and runs over a predefined range of values covering the possible values for . A weighted sum of the distributions are then made separately for each possible combination of chemical species present in the dataset. For these distributions are given by
(10)  
(11)  
(12) 
where the sums for , and run over all atoms with the atomic number , or respectively, and is a weighting function that is used to control the importance of different terms. When calculating MBTR for periodic systems, the periodic copies of atoms in neighbouring cells are taken into account by extending the simulation cell periodically, and at least one of the indices has to belong to an atom in the original simulation cell. Unlike in the original formulation Huo and Rupp (2017), we don’t include the possible correlation between chemical elements directly in equations (10)–(12). We don’t however lose any generality, as the correlation between chemical elements can be introduced as a postprocessing step that combines information from the different species.
For , typically no weighting is used: . In the case of and , the weighting function can, however be used to give more importance to values that correspond to configuration where the atoms are closer together. For fully periodic systems, a weighting function must be used, as otherwise the sums in equations (10)–(12) do not converge. For we provide exponential weighting functions of the form
(13)  
(14) 
where the parameter can be used to effectively tune the cutoff distance. For computational purposes a cutoff of can be defined to ignore any contributions for which .
Some of the distributions, for example and , contain identical information. In our implementation this symmetry is taken into consideration by only calculating the distributions for which the last atomic number is bigger or equal to the first: in the case of or in the case of . This reduces the computational time and the number of features in the final descriptor without losing information. The final MBTR output for a water molecule is illustrated in Fig. 3.
There are multiple systemdependent parameters that have to be decided for the MBTR descriptor. At each level , the broadness of the distribution has to be chosen. A too small value for will lead to a deltalike distribution that is very sensitive to differences in system configurations. Conversely, a too large value will make it hard to resolve individual peaks as the distribution becomes broad and featureless. The choice of the weighting function also has a direct effect on the final distribution, as it controls how much importance is given to atom combinations that are physically far apart. When combining information from multiple terms, it can be beneficial to control the contribution from different terms. As the number of features related to higher values is bigger, the machine learning model may by default give more importance to these higher terms. For example if similarity between two MBTR outputs is measured by an Euclidean distance in the machine learning model, individually normalizing the total output for each term to unit length helps in equalizing the information content of the different terms.
2.5 Atomcentered Symmetry Functions
Atomcentered Symmetry Functions (ACSFs) Behler (2011) can be used to represent the local environment near an atom by using a fingerprint composed of the output of multiple two and threebody functions that can be customized to detect specific structural features. ACSF encodes the configuration of atoms around a single central atom with index by using so called symmetry functions. The presence of atoms neighbouring the central atom are detected by using three different twobody symmetry functions , and , which are defined as
where the summation for runs over all atoms with atomic number , , and are userdefined parameters, and is a smooth cutoff function defined as
where is a cutoff radius.
Additionally, threebody functions may be used to detect specific motifs defined by three atoms, one being the central atom. These threebody functions include a dependence on the angle between triplets of atoms within cutoff, as well as their mutual distance. The package implements the following functions
(15)  
(16) 
where the summation of and runs over all atoms with atomic numbers or respectively, , and are userdefined parameters and is the angle between the three atoms (th atom in the center).
In practice, multiple symmetry functions of each type are used in the descriptor, each with a different parametrization (, , , and ), encoding different portions of the chemical environment. While no optimal parameter set exists, the choice of parameters is guided by the generally desired properties of a descriptor previously listed (unique, continuous and compact), and various alternatives for different applications can be found in the literature Gerrits et al. (2019); Artrith and Urban (2016); Nguyen et al. (2018).
The final fingerprint for a single atom can be constructed by concatenating the output from differently parametrized symmetry functions with consistent ordering, as illustrated in Figure 4. The list starts with the twobody ACSFs, ordered by atom type . For each type, appears first, bringing only one value since it has no parameter dependence. Next we find all values of calculated with different parameter pairs given by the user. The values of for all are found last. This sequence is repeated for each atomic type, sorted from lighter to heavier. Threebody ACSFs appear afterward: for each unique combination of chemical elements, we find the values of and given by all specified triplets of .
2.6 Smooth Overlap of Atomic Orbitals
The Smooth Overlap of Atomic Positions (SOAP) Bartók et al. (2013) can be used to encode a local environment within an atomic structure by using an expansion of a gaussian smeared atomic density based on spherical harmonics and radial basis functions. In SOAP, the atomic structure is first transformed into atomic density fields for each species by using unnormalized gaussians centered on each atom
(17) 
Here the summation for runs over atoms with the atomic number to build a separate density for each atomic element and the width of the gaussian is controlled by .
A local region surrounding the point of interest set at the origin may then be expanded with a set of orthonormal radial basis functions and spherical harmonics as
(18) 
where the coefficients can be obtained through
(19) 
For the spherical harmonics, an orthonormalized convention typical for quantum mechanics is used
(20) 
where are the associated Legendre polynomials.
The final rotationally invariant output from our SOAP implementation is the partial power spectra De et al. (2016) vector where the individual vector elements are defined as:
(21) 
The vector is constructed by concatenating the elements for all unique atomic number pairs , all unique pairs of radial basis functions up to and the angular degree values up to .
Spherical harmonics are a natural orthogonal and complete set of functions for the angular degrees of freedom. For the radial degree of freedom the selection of the basis set is not as trivial and multiple approaches may be used. In our implementation we, by default, use a set of spherical primitive gaussian type orbitals as radial basis functions. These basis functions are defined as
(22)  
(23) 
This basis set allows analytical integration of the coefficients defined by equation (19). This provides a speedup over various other radial basis functions that require numerical integration. Our current implementation provides the analytical solutions up to , with the possibility of adding more in the future.
The decay parameters are chosen so that each nonorthonormalized function decays to a threshold value of at a cutoff radius taken on an evenly spaced grid from 1Å to with a step of . Thus the parameter controls the maximum reach of the basis and a better sampling can be obtained by increasing the number of basis functions .
The weights are chosen so that the radial basis functions are orthonormal. For each value of angular degree , the orthonormalizing weights can be calculated with Löwdin orthogonalization Löwdin (1950):
(24)  
(25) 
where is the overlap matrix and is a matrix containing the weights .
We also provide an option for using the radial basis consisting of cubic and higher order polynomials, as introduced in the original SOAP article Bartók et al. (2013). This basis set is defined as:
(26)  
(27) 
The calculations with this basis are performed with efficient numerical integration and currently support .
The two different basis sets are compared in Figure 5. Most notably the form of the gaussian type orbitals depend on the angular degree , whereas the polynomial basis is independent of this value. It is also good to notice that between these two radial basis functions the definition of is somewhat different – whereas the polynomial basis is guaranteed to decay to zero at , the gaussian basis only approximately decays near this value and the decay is also affected by the orthonormalization.
2.7 Descriptor usage as machine learning input
In this section we discuss some of the methods for organizing the output from descriptors so that it can be efficiently used as input for machine learning.
The descriptor invariance against permutations of atomic indices – property iii) in the introduction – is directly achieved in MBTR, SOAP and ACSF by stratifying the output according to the involved chemical elements. The output is always ordered by a predefined order determined by the chemical elements that are included in the dataset, making the output independent of the indexing of individual atoms. The three matrix descriptors – the Coulomb matrix, Ewald sum matrix, and sine matrix – are, however, not invariant with respect to permutation of atomic indices, as the matrix columns and rows are ordered by atomic indices. However, there are different approaches for enforcing invariance for these matrices. One way is to encode the matrices by their eigenvalues, which are invariant to changes in the column and row ordering Rupp et al. (2012). Another way is to order the rows and columns by a chosen norm, typically the Euclidean norm Montavon et al. (2012). A third approach is to augment the dataset by creating multiple slightly varying matrices for each structure. In this approach multiple matrices are drawn from a statistical set of sorted matrices where Gaussian noise is added to the row norms before sorting Montavon et al. (2012). When the learning algorithm is trained over this ensemble of matrices it becomes more robust against small sorting differences that can be considered noise. All of these three approaches are available in our implementation.
Machine learning algorithms also often require constantsized input. Once again the stratification of the descriptor output by chemical elements makes the output for MBTR, ACSF and SOAP constant size. For the matrix descriptors a common way to achieve a uniform size for geometries with different amount of atoms, is by introducing zeropadding. This means that we first have to determine the largest system in the dataset. If this system has , we allocate matrices of size or a vectors or size if using matrix eigenvectors. The descriptor for each system will fill the first or many entries, with the rest being set to zero. If the machinelearning algorithms expects a onedimensional vector as input, the twodimensional matrices can be flattened by concatenating the rows together into a single vector.
Local descriptors, such as ACSF and SOAP, encode only local spatial regions and cannot be directly used as input for performing predictions related to entire structures. There are, however, various ways for combining information from multiple local sites to form a prediction for an entire structure. The descriptor output for multiple local sites can simply be averaged, a custom kernel can be used to combine information from multiple sites Willatt et al. (2018); De et al. (2016) or the predicted property can in some cases be directly modeled as a sum of local contributions Bartók et al. (2010).
3 Software structure
We use python as the default interfacing language through which the user interacts with the library. This decision was motivated by the existence of various python libraries, including ase Larsen et al. (2017), pymatgen Ong et al. (2013) and quippy qui (2019), that supply tools for creating, reading, writing and manipulating atomic structures. Our python interface does not, however, restrict the implementation to be made entirely in python. Python can easily interact with software libraries written with highperformance, statically typed languages such as C, C++ and Fortran. We use this dual approach by performing some of the most computationally heavy calculations either in C or C++.
An example of creating a descriptor for an atomic structure with the library is demonstrated in Fig. 6. It demonstrates the workflow that is common to all descriptors in the package. For each descriptor we define a class, from which objects can be instantiated with different descriptor specific setups.
All the descriptors have the sparseparameter that controls whether the created output is a dense or a sparse matrix. The possibility for creating a sparse output is given so that large and sparsely filled output spaces can be handled, as typically encountered when a dataset contains large amounts of different chemical elements. Various machine learning algorithms can make use of this sparse matrix output with linear algebra routines specifically designed for sparse data structures.
Once created, the descriptor object is ready to be used and provides different methods for interacting with it. All of the descriptors implement two methods: get_number_of_features and create. The get_number_of_featuresmethod can be used for querying the final number of features for the descriptor, even before a structure has been provided. This dimension can be used for initializing and reserving storage space for the resulting output array. create accepts one or multiple atomistic structures as an argument, and possibly other descriptorspecific arguments. It returns the final descriptor output that can be used in machine learning applications. To define atomic structures we use the ase.Atomsobject from the ase packageLarsen et al. (2017). The Atomsobjects are easy to create from structure files or build with the utilities provided by ase.
As the creation of a descriptor for an atomic system is completely independent from the other systems, it can be easily parallelized with data parallelism. For convenience we provide a possibility of parallelizing the descriptor creation for multiple samples over multiple processes. This can be done by simply providing the number of parallel jobs to instantiate with the n_jobsparameter as demonstrated in Figure 6.
The DScribe package is structured such that new descriptors can easily be added. We provide a python baseclass that defines a standard interface for the descriptors through abstract classes. One of our design goals is to provide a codebase in which researchers can make their own descriptors available to the whole community. All descriptor implementations are accompanied by a test module that defines a set of standard tests. These tests include tests for rotational, translational and index permutation invariance, as well as other tests for checking the interface and functionality of the descriptor. We have adapted a continuous integration system that automatically runs a series of regression tests when changes in the code are introduced. The code coverage is simultaneously measured as a percentage of visited code lines in the python interface.
The source code is directly available in github at https://github.com/SINGROUP/dscribe and we have created a dedicated home page at https://singroup.github.io/dscribe/ that provides additional tutorials and a full code documentation. For easy installation the code is provided through the python package index (pip) under the name dscribe.
4 Results and discussion
The performance of the DScribe package is illustrated for formation energy predictions for inorganic crystal structures and ionic charge prediction for atoms in organic molecules. These examples demonstrate the usage of the package in supervised machine learning tasks, but the output vectors can be as easily used in other learning tasks. For example the descriptors can be used as input for unsupervised clustering algorithms such as Tdistributed stochastic neighbor embedding (TSNE) van der Maaten and Hinton (2008) or Sketchmap Ceriotti et al. (2011) to analyse structureproperty relations in structural and chemical landscapes.
For simplicity we here restrict the machine learning model to be kernel ridge regression (KRR) as implemented in the scikitlearn Pedregosa et al. (2011)package. However, the vectorial nature of the output from all the introduced descriptors does not impose any specific learning scheme, and many other regressors can be used, including neural networks, decision trees and support vector regression.
4.1 Formation energy prediction for inorganic crystals
We demonstrate the use of multiple descriptors on the task of predicting the formation energy of inorganic crystals. The data comes from the Open Quantum Materials Database (OQMD) 1.1 Saal et al. (2013). We selected structures with a maximum of 10 atoms per unit cell and a maximum of 6 different atomic elements. Unconverged systems were filtered by removing samples which have a formation energy that is more than two standard deviations away from the mean, resulting in the removal of 96 samples. After these selections, 222 215 samples were left. The predictions were performed on randomly selected subsets with sizes 1024, 2048, 4096, 8192 and 16384. The mean absolute errors as a function of training set size are given in Figure 7.
The Coulomb matrix, Ewald sum matrix and sine matrix are used for the prediction with matrix rows and columns sorted by their Euclidean norm, and using the unit cell that was used for performing the formation energy calculation. The Coulomb matrix does not take the periodicity of the structure into account, but is included as a baseline for the other methods. We include MBTR with different values of and for each we individually optimize and with grid search. Figure 8 shows the error for each tested MBTR term, and the best performing one is included in Figure 7. To test the energy prediction by combining information from multiple local descriptors, as discussed in 2.7, we also include results using a simple averaged SOAP output for all atoms in the simulation cell. For SOAP we use the gaussian type orbital basis and fix and , but optimize the cutoff and gaussian width individually with grid search.
The possible descriptor hyperparameters are optimized at a subset of samples with 5fold crossvalidation and 80%/20%training/test split. The KRR kernel width and the regularization parameter are also allowed to vary on a logarithmic grid during the descriptor hyperparameter search. The use of a smaller subset allows much quicker evaluation for the hyperparameters than optimizing the hyperparameters for each size individually, but the transferability of these optimized hyperparameters to different sizes may affect the results slightly.
After finding the optimal descriptor setup, it is used to train a model on dataset sizes of 1024, 2048, 4096, 8192 and 16384. The same crossvalidation setup as for the descriptor hyperparameter optimization is used, but now with finer grid for the KRR kernel width. The predictions are repeated three times on different training sets to estimate the variance of the results. The hyperparameter grids and optimal values for both the descriptors and kernel ridge regression are found in the Supplementary Information together with additional details.
4.2 Ionic charge prediction for organic molecules
To demonstrate the prediction of local properties with the DScribe package, a prediction of ionic charges in small organic molecules is performed. The dataset consists of Mulliken charges calculated at the CCSD level for the GDB9 dataset of 133 885 neutral molecules Ramakrishnan et al. (2014). The structures contain up to nine atoms and five different chemical species: hydrogen, oxygen, carbon, nitrogen and fluorine. The geometries have been relaxed at the B3LYP/631G(2df,p) level and no significant forces were present in the static CCSD calculation.
The prediction is performed with the two local descriptors included in the package, SOAP and ACSF. For SOAP we perform the prediction with both radial basis functions: the polynomial basis (SOAP) and the gaussian type orbital radial basis (SOAP). For them we fix and , but optimize the cutoff and Gaussian width with grid search. For ACSF we use 10 radial functions and 8 angular functions . The cutoff value is shared between the radial and angular functions and it is optimized with grid search. More details about the used ACSF symmetry functions are found in the Supplementary Information.
Descriptor hyperparameters are optimized with grid search separately for each species on a smaller set of 2500 sample atoms with 5fold crossvalidation and 80%/20%training/test split. Both the KRR kernel width and the regularization parameter are allowed to vary on a logarithmic grid during the descriptor hyperparameter search. After optimizing the descriptor hyperparameters, the model is retrained using the optimal descriptor setup and a larger dataset of 10 000 sample atoms for each species (except fluorine, for which only 3314 atoms were available in the dataset). The training is done with the same crossvalidation setup as for the descriptor hyperparameter optimization, but now with finer grid for the KRR kernel width. The parity plots for the charge prediction are shown in Figure 9. The hyperparameter grids and optimal values for both the descriptors and kernel ridge regression are found in the Supplementary Information together with additional details.
4.3 Discussion
The formation energy prediction demonstrates that our implementation performs consistently and offers insight into the performance of the different descriptors. Special care must be taken in interpreting the results, as there exist different variations of the different descriptors. For example, as discussed in section 2.7, there are different ways to combine information from multiple local SOAPoutputs, and different geometry functions and cutoff types may be used for the MBTR. The learning rates also depend on the chosen machine learning model.
Our results for the Ewald sum matrix and the sine matrix reflect the results reported earlier, where a formation energy prediction was performed for a similar set of data from the Materials Project Jain et al. (2013). They report MAE for the Ewald sum matrix to be 0.49 eV and for the sine matrix to be 0.37 eV Faber et al. (2015) with a training set of 3000 samples, whereas we find MAE for the Ewald sum matrix to be 0.36 eV and for the sine matrix to be 0.24 eV with a training set of 3276 samples. The performance improvement in our results can be explained by differences in the contents of the used dataset. We, however, recover the same trend of the sine matrix performing better, even when issues in the original formulation of the Ewald sum matrix (as discussed in section 2.2) were addressed. The low performance of the more accurate charge interaction in the Ewald model and the relatively small difference between the performance of the Coulomb and sine matrix may indicate that for this task the information of the potential energy of the neutral atoms – contained on the diagonal of both the sine and Coulomb matrix – largely controls the performance.
With respect to the individual performance of the different MBTR parts, the terms containing distance information performs best, whereas the angle information contained in and the simple composition information contained by lag behind. However, the best MBTR performance is achieved by combining the information from all of the terms. It is also surprising how well the simple averaging scheme for SOAP performs in the tested dataset range. When extrapolating the performance to larger datasets, it can however be seen that MBTR may provide better results.
The charge prediction test illustrates that the ionic charges of different species in organic molecules may be learned accurately on the CCSD level just by observing the local arrangement of atoms up to a certain radial cutoff. On average the mean absolute error is around 0.0050.01 when using a dataset of 10 000 samples for each species. The variance of the Mulliken charges in the training set depends on the species, which also results in speciesspecific variation of MAE for the predicted charges. As to be expected, the charge of the multivalent species – C, N, O – varies much more in the CCSD data and is much harder to predict than the charge of the low valence species H and F.
Our comparison shows that there is little difference between the predictive performance of the two radial bases used for SOAP. With our current implementation there is, however, a notable difference in the speed of creating these descriptors. For identical settings (, , , and = 0.1), the gaussian type orbital basis is over four times faster to calculate than the polynomial basis. This difference originates largely from the numerical radial integration, which is required for the polynomial basis but not for the gaussian type orbital basis. The prediction performance of ACSF does not fall far behind SOAP and it might be possible to achieve the same accuracy by using a more advanced parameter calibration for the symmetry functions. The symmetry functions used in ACSF are easier to tune for capturing specific structural properties, such as certain pairwise distances or angles formed by three atoms. This tuning can, however, be done only if such intuition is available a priori, and in general consistently improving the performance by changing the used symmetry functions can be hard.
5 Conclusions
The recent boom in creating machine learnable fingerprints for atomistic systems, or descriptors, has led to a plethora of available options for materials science. The software implementations for these descriptors is, however, often scattered across different libraries or missing altogether, making it difficult to test and compare different alternatives.
We have collected several descriptors in the DScribe software library. DScribe has an easytouse pythoninterface, with C/C++ extensions for the computationally intensive tasks. We use a set of regression tests to ensure the validity of the implementation, and provide the source code together with tutorials and documentation. We have demonstrated the applicability of the package with the supervised learning tasks of formation energy prediction for crystals and the charge prediction for molecules. The DScribe descriptors are compatible with generalpurpose machine learning algorithms, and can also be used for unsupervised learning tasks. In the future we plan to extend the package with new descriptors, such as the structural descriptor based on Voronoi tesselations Ward et al. (2017), and also welcome external contributors.
6 Acknowledgements
We acknowledge the computational resources provided by the Aalto ScienceIT project. This project has received funding from the Jenny and Antti Wihuri Foundation and the European Union’s Horizon 2020 research and innovation programme under grant agreements number no. 676580 NOMAD and no. 686053 CRITCAT.
numbers,sort&compress
References
 Takahashi and Tanaka (2016) K. Takahashi, Y. Tanaka, Materials informatics: a journey towards material design and synthesis, Dalton Trans. 45 (26) (2016) 10497–10499.
 Zdeborová (2017) L. Zdeborová, Machine learning: New tool in the box, Nat. Phys. 13 (5) (2017) 420–421.
 Gubernatis and Lookman (2018) J. E. Gubernatis, T. Lookman, Machine learning in materials design and discovery: Examples from the present and suggestions for the future, Phys. Rev. Materials 2 (12) (2018) 120301.
 Butler et al. (2018) K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, A. Walsh, Machine learning for molecular and materials science, Nature 559 (7715) (2018) 547–555.
 Ward et al. (2017) L. Ward, R. Liu, A. Krishna, V. I. Hegde, A. Agrawal, A. Choudhary, C. Wolverton, Including crystal structure attributes in machine learning models of formation energies via Voronoi tessellations, Phys. Rev. B 96 (2) (2017) 1–12.
 Huo and Rupp (2017) H. Huo, M. Rupp, Unified Representation of Molecules and Crystals for Machine Learning, arXiv eprints arXiv:1704.06439.
 Faber et al. (2015) F. Faber, A. Lindmaa, O. A. v. Lilienfeld, R. Armiento, Crystal structure representations for machine learning models of formation energies, Int. J. Quantum Chem. 115 (16) (2015) 1094–1101.
 Seko et al. (2017) A. Seko, H. Hayashi, K. Nakayama, A. Takahashi, I. Tanaka, Representation of compounds for machinelearning prediction of physical properties, Phys. Rev. B 95 (14) (2017) 144110.
 Rupp et al. (2012) M. Rupp, A. Tkatchenko, K.R. Müller, O. A. von Lilienfeld, Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning, Phys. Rev. Lett. 108 (2012) 058301.
 Stuke et al. (2018) A. Stuke, M. Todorović, M. Rupp, C. Kunkel, K. Ghosh, L. Himanen, P. Rinke, Chemical diversity in molecular orbital energy predictions with kernel ridge regression, arXiv eprints arXiv:1812.08576.
 Ghosh et al. (2019) K. Ghosh, A. Stuke, M. Todorović, P. B. Jørgensen, M. N. Schmidt, A. Vehtari, P. Rinke, Deep Learning Spectroscopy: Neural Networks for Molecular Excitation Spectra, Adv. Sci. 0 (0) (2019) 1801367.
 Smith et al. (2017) J. S. Smith, O. Isayev, A. E. Roitberg, ANI1: an extensible neural network potential with DFT accuracy at force field computational cost, Chem. Sci. 8 (2017) 3192–3203.
 Bartók et al. (2010) A. P. Bartók, M. C. Payne, R. Kondor, G. Csányi, Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons, Phys. Rev. Lett. 104 (13) (2010) 1–4.
 Bartók et al. (2013) A. P. Bartók, R. Kondor, G. Csányi, On representing chemical environments, Phys. Rev. B 87 (2013) 184115.
 Chmiela et al. (2017) S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, K. T. Schütt, K.R. Müller, Machine learning of accurate energyconserving molecular force fields, Sci. Adv. 3 (5) (2017) e1603015.
 Behler (2011) J. Behler, Atomcentered symmetry functions for constructing highdimensional neural network potentials, J. Chem. Phys. 134 (7) (2011) 074106.
 Behler (2016) J. Behler, Perspective: Machine learning potentials for atomistic simulations, J. Chem. Phys. 145 (17) (2016) 170901.
 Li et al. (2017a) Y. Li, H. Li, F. C. Pickard, B. Narayanan, F. G. Sen, M. K. Y. Chan, S. K. R. S. Sankaranarayanan, B. R. Brooks, B. Roux, Machine Learning Force Field Parameters from Ab Initio Data, J. Chem. Theory Comput. 13 (9) (2017a) 4492–4503.
 Jacobsen et al. (2018) T. L. Jacobsen, M. S. Jørgensen, B. Hammer, OntheFly Machine Learning of Atomic Potential in Density Functional Theory Structure Optimization, Phys. Rev. Lett. 120 (2) (2018) 026102.
 Li et al. (2017b) Z. Li, S. Wang, W. S. Chin, L. E. Achenie, H. Xin, Highthroughput screening of bimetallic catalysts enabled by machine learning, J. Mater. Chem. A 99 (2017b) 016105.
 Goldsmith et al. (2018) B. R. Goldsmith, J. Esterhuizen, J. X. Liu, C. J. Bartel, C. Sutton, Machine learning for heterogeneous catalyst design and discovery, AIChE Journal 64 (7) (2018) 2311–2323.
 Chowdhury et al. (2018) A. J. Chowdhury, W. Yang, E. Walker, O. Mamun, A. Heyden, G. A. Terejanu, Prediction of Adsorption Energies for Chemical Species on Metal Catalyst Surfaces Using Machine Learning, J. Phys. Chem. C 122 (49) (2018) 28142–28150.
 Jäger et al. (2018) M. O. J. Jäger, E. V. Morooka, F. F. Canova, L. Himanen, A. S. Foster, Machine learning hydrogen adsorption on nanoclusters through structural descriptors, npj Comput. Mater. 4 (2018) 37.
 Zahrt et al. (2019) A. F. Zahrt, J. J. Henle, B. T. Rose, Y. Wang, W. T. Darrow, S. E. Denmark, Prediction of higherselectivity catalysts by computerdriven workflow and machine learning, Science 363 (6424) (2019) eaau5631.
 Kiyohara et al. (2016) S. Kiyohara, H. Oda, T. Miyata, T. Mizoguchi, Prediction of interface structures and energies via virtual screening, Sci. Adv. 2 (11) (2016) e1600746.
 Zalake et al. (2017) P. Zalake, S. Ghosh, S. Narasimhan, K. G. Thomas, DescriptorBased Rational Design of TwoDimensional SelfAssembled Nanoarchitectures Stabilized by Hydrogen Bonds, Chem. Mater. 29 (17) (2017) 7170–7182.
 Todorović et al. (2019) M. Todorović, M. U. Gutmann, J. Corander, P. Rinke, Bayesian inference of atomistic structure in functional materials, npj Comput. Mater. 5 (2019) 35.
 Ghiringhelli et al. (2015) L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, M. Scheffler, Big Data of Materials Science: Critical Role of the Descriptor, Phys. Rev. Lett. 114 (2015) 105503.
 Xie and Grossman (2018) T. Xie, J. C. Grossman, Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties, Phys. Rev. Lett. 120 (14) (2018) 145301.
 Arbabzadah et al. (2017) F. Arbabzadah, S. Chmiela, K. R. Müller, A. Tkatchenko, Quantumchemical insights from deep tensor neural networks, Nat. Commun. 8 (2017) 6–13.
 Gilmer et al. (2017) J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, G. E. Dahl, Neural Message Passing for Quantum Chemistry, Proceedings of the 34th International Conference on Machine Learning (2017) 1263–1272.
 Schütt et al. (2018) K. T. Schütt, H. E. Sauceda, P. J. Kindermans, A. Tkatchenko, K. R. Müller, SchNet  A deep learning architecture for molecules and materials, J. Chem. Phys. 148 (24) (2018) 241722.
 Montavon et al. (2012) G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe, A. Tkatchenko, A. V. Lilienfeld, K.R. Müller, Learning Invariant Representations of Molecules for Atomization Energy Prediction, in: F. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25, Curran Associates, Inc., 440–448, 2012.
 Hansen et al. (2015) K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O. A. Von Lilienfeld, K. R. Müller, A. Tkatchenko, Machine learning predictions of molecular properties: Accurate manybody potentials and nonlocality in chemical space, J. Phys. Chem. Lett. 6 (12) (2015) 2326–2331.
 Gastegger et al. (2018) M. Gastegger, L. Schwiedrzik, M. Bittermann, F. Berzsenyi, P. Marquetand, WACSF  Weighted atomcentered symmetry functions as descriptors in machine learning potentials, J. Chem. Phys. 148 (24) (2018) 241709.
 Isayev et al. (2017) O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo, A. Tropsha, Universal fragment descriptors for predicting properties of inorganic crystals, Nat. Commun. 8 (2017) 15679.
 Faber et al. (2016) F. A. Faber, A. Lindmaa, O. A. Von Lilienfeld, R. Armiento, Machine Learning Energies of 2 Million Elpasolite (ABC2D6) Crystals, Phys. Rev. Lett. 117 (13) (2016) 2–7.
 Faber et al. (2018) F. A. Faber, A. S. Christensen, B. Huang, O. A. von Lilienfeld, Alchemical and structural distribution based representation for universal quantum machine learning, J. Chem. Phys. 148 (24) (2018) 241717.
 Pronobis et al. (2018) W. Pronobis, A. Tkatchenko, K.R. Müller, ManyBody Descriptors for Predicting Molecular Properties with Machine Learning: Analysis of Pairwise and ThreeBody Interactions in Molecules, J. Chem. Theory Comput. 14 (6) (2018) 2991–3003.
 Faber et al. (2017) F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E. Dahl, O. Vinyals, S. Kearnes, P. F. Riley, O. A. Von Lilienfeld, Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error, J. Chem. Theory Comput. 13 (11) (2017) 5255–5264.
 Ouyang et al. (2018) R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, L. M. Ghiringhelli, SISSO: A compressedsensing method for identifying the best lowdimensional descriptor in an immensity of offered candidates, Phys. Rev. Materials 2 (2018) 083802.
 Wolpert and Macready (1997) D. H. Wolpert, W. G. Macready, No free lunch theorems for optimization, IEEE Trans. Evol. Comput. 1 (1) (1997) 67–82.
 Ceriotti et al. (2011) M. Ceriotti, G. A. Tribello, M. Parrinello, Simplifying the representation of complex freeenergy landscapes using sketchmap, Proc. Natl. Acad. Sci. U.S.A. 108 (32) (2011) 13023–13028.
 De et al. (2016) S. De, A. P. Bartók, G. Csányi, M. Ceriotti, Comparing molecules and solids across structural and alchemical space, Phys. Chem. Chem. Phys. 18 (20) (2016) 13754–13769.
 Isayev et al. (2015) O. Isayev, D. Fourches, E. N. Muratov, C. Oses, K. Rasch, A. Tropsha, S. Curtarolo, Materials cartography: Representing and mining materials space using structural and electronic fingerprints, Chem. Mater. 27 (3) (2015) 735–743.
 Ramakrishnan et al. (2015) R. Ramakrishnan, M. Hartmann, E. Tapavicza, O. A. von Lilienfeld, Electronic spectra from TDDFT and machine learning in chemical space, J. Chem. Phys. 143 (8) (2015) 084111.
 Ewald (1921) P. P. Ewald, Die Berechnung optischer und elektrostatischer Gitterpotentiale, Ann. Phys. 369 (3) (1921) 253–287.
 Toukmaji and Board (1996) A. Y. Toukmaji, J. A. Board, Ewald summation techniques in perspective: a survey, Comput. Phys. Commun. 95 (2) (1996) 73–92.
 Hub et al. (2014) J. S. Hub, B. L. de Groot, H. Grubmüller, G. Groenhof, Quantifying Artifacts in Ewald Simulations of Inhomogeneous Systems with a Net Charge, J. Chem. Theory Comput. 10 (1) (2014) 381–390.
 Jackson and Catlow (1988) R. A. Jackson, C. R. Catlow, Computer Simulation Studies of Zeolite Structure, Mol. Simul. 1 (4) (1988) 207–224.
 Gerrits et al. (2019) N. Gerrits, K. Shakouri, J. Behler, G.J. Kroes, Accurate Probabilities for Highly Activated Reaction of Polyatomic Molecules on Surfaces Using a HighDimensional Neural Network Potential: CHD3 + Cu(111), J. Phys. Chem. Lett. 10 (2019) 1763–1768.
 Artrith and Urban (2016) N. Artrith, A. Urban, An implementation of artificial neuralnetwork potentials for atomistic materials simulations: Performance for TiO2, Comput. Mater. Sci. 114 (2016) 135–150.
 Nguyen et al. (2018) T. T. Nguyen, E. Székely, G. Imbalzano, J. Behler, G. Csányi, M. Ceriotti, A. W. Götz, F. Paesani, Comparison of permutationally invariant polynomials, neural networks, and Gaussian approximation potentials in representing water interactions through manybody expansions, J. Chem. Phys. 148 (24) (2018) 241725.
 Löwdin (1950) P. Löwdin, On the NonOrthogonality Problem Connected with the Use of Atomic Wave Functions in the Theory of Molecules and Crystals, J. Chem. Phys. 18 (3) (1950) 365–375.
 Willatt et al. (2018) M. J. Willatt, F. Musil, M. Ceriotti, A DataDriven Construction of the Periodic Table of the Elements, arXiv eprints arXiv:1807.00236.
 Larsen et al. (2017) A. H. Larsen, J. J. Mortensen, J. Blomqvist, I. E. Castelli, R. Christensen, M. Dułak, J. Friis, M. N. Groves, B. Hammer, C. Hargus, E. D. Hermes, P. C. Jennings, P. B. Jensen, J. Kermode, J. R. Kitchin, E. L. Kolsbjerg, J. Kubal, K. Kaasbjerg, S. Lysgaard, J. B. Maronsson, T. Maxson, T. Olsen, L. Pastewka, A. Peterson, C. Rostgaard, J. Schiøtz, O. Schütt, M. Strange, K. S. Thygesen, T. Vegge, L. Vilhelmsen, M. Walter, Z. Zeng, K. W. Jacobsen, The atomic simulation environment – a Python library for working with atoms, Journal of Physics: Condensed Matter 29 (27) (2017) 273002.
 Ong et al. (2013) S. P. Ong, W. D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, V. L. Chevrier, K. A. Persson, G. Ceder, Python Materials Genomics (pymatgen): A robust, opensource python library for materials analysis, Comput. Mater. Sci. 68 (2013) 314–319.
 qui (2019) QUIP and quippy documentation, http://libatoms.github.io/QUIP/, (accessed 16 April 2019), 2019.
 van der Maaten and Hinton (2008) L. van der Maaten, G. Hinton, Visualizing Data using tSNE, J. Mach. Learn. Res. 9 (2008) 2579–2605.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikitlearn: Machine Learning in Python, J. Mach. Learn. Res. 12 (2011) 2825–2830.
 Saal et al. (2013) J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, C. Wolverton, Materials Design and Discovery with HighThroughput Density Functional Theory: The Open Quantum Materials Database (OQMD), JOM 65 (11) (2013) 1501–1509.
 Ramakrishnan et al. (2014) R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Sci. Data 1 (2014) 140022.
 Jain et al. (2013) A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, K. a. Persson, The Materials Project: A materials genome approach to accelerating materials innovation, APL Mater. 1 (1) (2013) 011002.