Optimizing the energy with quantum Monte Carlo: A lower numerical scaling for Jastrow-Slater expansions

Optimizing the energy with quantum Monte Carlo: A lower numerical scaling for Jastrow-Slater expansions

Roland Assaraf assaraf@lct.jussieu.fr Sorbonne Universités, UPMC Univ Paris 06, CNRS, Laboratoire de Chimie Théorique (UMR7616), 4 place Jussieu F-75252 Paris, France    Saverio Moroni moroni@democritos.it CNR-IOM DEMOCRITOS, Istituto Officina dei Materiali, and SISSA Scuola Internazionale Superiore di Studi Avanzati, Via Bonomea 265, I-34136 Trieste, Italy    Claudia Filippi c.filippi@utwente.nl MESA+ Institute for Nanotechnology, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands
Abstract

We present an improved formalism for quantum Monte Carlo calculations of energy derivatives and properties (e.g. the interatomic forces), with a multideterminant Jastrow-Slater function. As a function of the number of Slater determinants, the numerical scaling of per derivative we have recently reported is here lowered to for the entire set of derivatives. As a function of the number of electrons , the scaling to optimize the wave function and the geometry of a molecular system is lowered to , the same as computing the energy alone in the sampling process. The scaling is demonstrated on linear polyenes up to CH and the efficiency of the method is illustrated with the structural optimization of butadiene and octatetraene with Jastrow-Slater wave functions comprising as many as 200000 determinants and 60000 parameters.

I Introduction

Quantum Monte Carlo methods (QMC) are first-principle methods which can efficiently solve the Schrödinger equation. For fermionic systems, they are powerful variational approaches because they can handle a large variety of variational wave functions , where represents the coordinates of the electrons of the system. Here, the vector indicates the 3 spatial coordinates of the electron , and its spin component (). This flexibility stems from the fact that integrals are not computed analytically but from a stochastic sampling. For example, the variational energy is

(1)

where is the Hamiltonian and is normalized, and can be interpreted as the expectation value of a random variable, the so-called local energy on the probability density . QMC methods can be used as benchmark methods also for relatively large systems thanks to their favorable scaling with the number of particles . For a given parametrization of , is typically computed with a scaling in memory requirements and in CPU per Monte Carlo step. In practice, one needs to optimize the parameters of and the geometry of a molecular system. Despite the availability of stable wave function optimization methods Umrigar et al. (2007), such techniques remain costly and one of the main reasons is that a large number of derivatives of (typically ) has to be computed. Lowering the numerical scaling per derivative is therefore important. For single determinants, Sorella et al. have found that the low-variance estimators of the intermolecular forces can be calculated with a scaling instead of with the use of algorithmic differentiation techniques Sorella and Capriotti (2010). We have recently recovered the same reduction using transparent matricial formulas and extended it to the orbital coefficients Filippi, Assaraf, and Moroni (2016). For expansions over additional Slater determinants, , multiplied by a positive Jastrow correlation factor ,

(2)

Clark et al. have proposed a method to compute with a scaling and with a scaling  Clark et al. (2011) that we have further reduced to and extended to any derivative of Filippi, Assaraf, and Moroni (2016). The derivatives of are useful because they are involved in low-variance estimators for forces and observables Assaraf and Caffarel (1999); Filippi and Umrigar (2000); Assaraf and Caffarel (2003). At the origin of this reduction is the observation that the local energy can be written in terms of a first-order (logarithmic) derivative of the determinantal component, .

In this paper, we show that the scaling per derivative can be further improved to for any set of derivatives of and . The core observation is that the determinantal part is a function of the matrix elements where is an orbital and an electron index, and that any derivative of can be computed using a simple trace formula involving the matrix defined as the logarithmic gradient of with respect to . The first derivatives of the local energy can then be expressed as traces involving and one of its derivative : many derivatives of and are obtained efficiently because the matrices and are computed only once for the whole set of parameters . Consequently, the calculation of all derivatives of with respect to all parameters of the wave function (Jastrow parameters, orbital coefficients, the coefficients of the expansion , and all nuclear positions) has now the same scaling as the calculation of alone, opening the path to full optimization of large multideterminant expansions.

In the next Section, we outline the main idea and introduce the matrix . In Section III, we present a formula to compute at a cost and, in Section IV, discuss the formulas for the second derivative of and, specifically, the first derivatives of . In Section V, we demonstrate the scaling of the computation of interatomic forces with multideterminant wave functions on polyenes up to CH and, in the last Section, apply the scheme to the optimization of multideterminant wave functions and geometries of butadiene and octatetraene.

Ii Derivative of the determinantal expansion

The determinantal component in the Jastrow-Slater expansion of Eq. (2) is a linear combination of Slater determinants

(3)

For a system including electrons, the matrix is an Slater matrix, built from of the molecular spin-orbitals . Mathematically, comprises columns of the matrix defined as follows

(4)

In general, one needs to compute many derivatives of with respect to different parameters of . These parameters can be the electron coordinates, nuclei coordinates, orbital coefficients, basis-function parameters and so on. The derivative of with respect to a given parameter in is obtained from the chain rule

(5)

where a summation on repeated indices is implied and we have introduced , that is, the gradient of with respect to the matrix elements of

(6)

The trace formula (5) is at the core of greater efficiency in computing many derivatives of because the matrix depends only on and not on . For a given configuration in the Monte Carlo sample, is computed only once for all the set of derivatives. In addition, can be evaluated efficiently, at a cost as we will see in the next Section. Once is computed and stored, any new derivative requires to calculate besides the trace (5) at a cost . What is important here is that this scaling is independent on and leads to vast improvements over previous methods Filippi, Assaraf, and Moroni (2016); Clark et al. (2011) when and the number of derivatives are large.

Finally, also quantities like the local energy or the value of the wave function after one electron move, can be computed using this trace formula (5). This is because one-body operators can be also expressed as first order derivatives of when applied to a Jastrow-Slater expansion Filippi, Assaraf, and Moroni (2016).

Iii Efficient evaluation of the matrix

iii.1 Convenient expression for

The determinants of the Slater matrices can be computed efficiently because usually differs by a few columns from a reference Slater matrix . For example, let be the Slater matrix built with the orbitals :

(7)

where the notation stands for the column of . The Slater matrix of a double excitation is

(8)

Here, and differ only in the 2 last columns. The determinant of is

and

(9)

where a column of the identity matrix arises whenever and share the same column. The determinant of is readily evaluated:

(10)

More generally, the determinant of for a -order excitation is the determinant of a submatrix. Such a submatrix can always be written as follows

(11)

where, in our example,

(12)

and

(13)

In general, is such that are the columns of which differ from those of , and is such that . In other words (applied on the right of ) selects the columns of from which excitations are built, and (applied on the right of ) selects the columns of to which excitations are built. To summarize, the expression

(14)

enables to compute the determinant of a large matrix as the determinant of a small submatrix of . This expression can also be proven using the determinant lemma Clark et al. (2011); Filippi, Assaraf, and Moroni (2016). Finally, the convenient expression for to efficiently compute is:

(15)

iii.2 Convenient expression for

Introducing the matrix such that , the expression (15) is explicitly a function of . In particular, the summation on the r.h.s. of Eq. (15)

(16)

is a polynomial function depending on the matrix elements of

(17)

The order of this polynomial is the order of the highest-order exitation. It is usually low (typically ). Applying the chain rule and using the convention of summation over repeated indices, we obtain

(18)

where

(19)

It is simple to show that

(20)

The derivative of is given by

(21)

Finally, writing and using the cyclic property of the trace, we obtain

(22)

where

(23)

For example, if the occupied orbitals are the first ones, the matrix is

(24)

where the first line is a matrix. The second line is a matrix where represents the non-zero lines of , i.e. the last lines.

iii.3 One-body operators and first-order derivatives of

First-order derivatives of can be computed with the trace formula (5) which involves the matrix. One-body operators acting on the wave function can be also expressed as first-order derivatives of when applied to a Jastrow-Slater expansion as we have shown in Ref. Filippi, Assaraf, and Moroni, 2016. The local energy for example can be written as a first-order logarithmic derivative of the determinantal part where has been replaced by

(25)

and is an appropriate matrix depending on the orbitals, the Jastrow factor, and their derivatives. In particular, the reference Slater determinant has been replaced by . The determinantal part of the wave function is now

(26)

From this expression, one can compute the local energy

(27)

In the presence of the Jastrow factor, one recovers the same trace expression for the local energy of but with a matrix also depending on and its derivatives Filippi, Assaraf, and Moroni (2016).

Iv Second-order derivatives

The second derivative of can be written in terms of and its derivative as

(28)

Example of the derivative of the local energy

When computing improved estimators of derivatives of the energy , we need also the derivatives of the local energy . It follows from Eq. 27 that the derivative of the local energy with respect to a given parameter is

(29)

The order of the derivation has been chosen so that and not is differentiated with respect to . Consequently, the matrix does not depend on the parameter and has to be computed only once, whatever the number of second derivatives we need. Once has been computed, the calculation of involves (besides and ) two traces which can be computed at a cost . Importantly, such a calculation does not depend on contrary to what was presented in Ref. Filippi, Assaraf, and Moroni, 2016.

Efficient calculation of

The derivative of is

(30)

where

(31)

Applying the chain rule, we obtain

(32)

where

(33)
(34)

It follows from Eq. 20 that

(35)

We can compute the derivatives of avoiding the evaluation of inverse matrices. That will be presented in the appendix.

Derivatives with respect to the linear coefficients

The derivatives of a local quantity with respect to the expansion coefficients require instead to evaluate the action of the one-body operator on each excited determinant separately (Eq. 14). For instance, as we have shown in Ref. Filippi, Assaraf, and Moroni, 2016, the derivative of the local energy with respect to is given by

(36)

These quantities are needed in the optimization of the energy with respect to the linear coefficients and can be computed at a cost .

V Numerical scaling

In practice, for each step of the Monte Carlo algorithm, we need to compute , , and at a cost of at most (products and inversions of matrices). Then, we need to calculate the first and second derivatives of with respect to (Eqs. 20 and 35) at a cost (a few sums and products for each excitation). The related tensors and are also computed at a cost . is computed at a cost where is the total number of double excitations involved in any order excitation (), where of course . Finally, and are computed at a cost (product of matrices).

In particular, computing the components of the inter-atomic forces with improved estimators has a scaling

(37)

Assuming that , this scaling simplifies

(38)

This is significantly more efficient than the scaling 111or presented in our previous work Filippi, Assaraf, and Moroni (2016), in the large , or , , regimes. The term is no more present because here we avoid to compute . Regarding the sampling process, when one-electron moves are used (see appendix), the total numerical cost for a full sweep (all the electrons are moved once) is .

Figure 1: Ratio of the CPU time for a VMC calculation of the forces to the CPU time for the same simulation of the energy alone for the sequence of molecules CH with between 4 and 60 and an increasing number of determinants in the Jastrow-Slater wave function. The forces are calculated after moving all the electrons once.

In Fig. 1, we demonstrate this favorable scaling in the variational Monte Carlo (VMC) computation of the interatomic forces for multi-determinant Jastrow-Slater wave functions using the sequence of molecules CH with between 4 and 60. For each system, the ratio of the CPU time of computing all interatomic forces to the time of evaluating only the energy is initially constant and then decreases when the number of determinants exceeds about 100. For the largest CH, computing all interatomic gradients costs less than about 3 times a VMC simulation where one only evaluates the total energy. Finally, as it is shown in the Appendix, if we move one electron, many quantities can be updated so that, for each Monte Carlo step, the scaling is reduced to . This leads to an overall scaling when all the electrons have been moved. For an all-electron-move algorithm, the scaling is which could be more efficient when is large.

Vi Numerical results

We demonstrate the formulas above on the ground-state structural optimization in VMC of butadiene (CH) and octatetraene (CH) using large expansions in the determinantal component of the Jastrow-Slater wave function. All expansion coefficients, orbital and Jastrow parameters in the wave function are optimized together with the geometry. Given the large number of variational parameters (up to 58652) we employ the stochastic reconfiguration optimization method Sorella, Casula, and Rocca (2007) in a conjugate gradient implementation Neuscamman, Umrigar, and Chan (2012) which avoids building and storing large matrices. In most of our calculations, to remove occasional spikes in the forces, we use an improved estimator of the forces obtained by sampling the square of a modified wave function close to the nodes C. Attaccalite, and S. Sorella (2008). To optimize the geometry, we simply follow the direction of steepest descent and appropriately rescale the interatomic forces. We employ the CHAMP code Cha () with scalar-relativistic energy-consistent Hartree-Fock pseudopotentials and the corresponding cc-pVXZ Burkatzki, Filippi, and Dolg (2007); Hps () and aug-cc-pVXZ aug () basis sets with X=D,T, and Q. The Jastrow factor includes two-body electron-electron and electron-nucleus correlation terms. The starting determinantal component of the Jastrow-Slater wave functions before optimization is obtained in multiconfiguration-self-consistent-field calculations performed with the program GAMESS(US) Schmidt et al. (1993); Gordon and Schmidt (2011).

Figure 2: Bond length alternation (BLA) of CH optimized in VMC for different basis sets and choices in the determinantal part of the Jastrow-Slater wave function. The atomic positions and all parameters of the wave function (expansion coefficients, orbital and Jastrow parameters) are simultaneously optimized. The CCSD(T) BLA in the CBS limit computed with various corrections Feller and Craig (2009) and the best value obtained with a Jastrow-antisymmetrized geminal power (JAGP) Barborini and Guidoni (2015) are reported.
Expansion No. det No. param. C-C C=C BLA (Å)
1 det 1 1404 1.45513(12) 1.32482(05) 0.13031(16)
CAS(4,4) 20 1547 1.45211(10) 1.33347(07) 0.11865(15)
CAS(4,16) 7232 4995 1.45160(15) 1.33422(13) 0.11738(16)
CAS(4,20) 18100 9147 1.45143(16) 1.33409(07) 0.11734(24)
CAS(10,10) 15912 6890 1.45858(09) 1.33694(06) 0.12163(13)
RAS(10,22) 45644 11094 1.45705(17) 1.33760(15) 0.11945(29)
CCSD(T)/CBS 1.4548 1.3377 0.1171
CCSD(T)/CBS-corr 1.4549 1.3389 0.1160
Ref. Feller and Craig, 2009; Ref. Feller and Craig, 2009, including a CCSDT(Q)(FC)/cc-pVDZ correction.
Table 1: Optimal bond lengths and BLA values (Å) of butadiene computed in VMC with the cc-pVQZ basis set and various choices of Jastrow-Slater expansions. The numbers of determinants and optimized parameters in the wave function are listed.

We first focus on the VMC geometrical optimization of butadiene. Despite its small size and apparent simplicity, predicting the bond length alternation (BLA) of butadiene remains a challenging task for quantum chemical approaches which lead to a spread of BLA values, mainly clustered around either 0.115 or 0.125 Å (see Table 2 in Ref. Barborini and Guidoni, 2015 for a recent compilation of theoretical predictions). In particular, Barborini and Guidoni Barborini and Guidoni (2015) using VMC in combination with Jastrow-antisymmetrized geminal power (JAGP) wave functions find a best BLA value of 0.1244(6) Å, rather close to the BLA of 0.1251(7) Å they obtain using a single-determinant Jastrow-Slater wave function and clearly distinct from the CCSD(T) prediction of 0.116 Å computed in the complete basis set (CBS) limit and corrected for core-valence correlation, scalar-relativistic effects, and inclusion of quadruples Feller and Craig (2009).

To elucidate the origin of this difference, we consider here various expansions correlating the and electrons: a) a single determinant; b) the complete-active-space CAS(4,4), CAS(4,16), and CAS(4,20) expansions (20, 7232, and 18100 determinants, respectively) of the four electrons in the bonding and antibonding orbitals constructed from the , , , , and atomic orbitals; c) a CAS(10,10) correlating the six and four electrons of the carbon atoms in the corresponding bonding and antibonding and orbitals (15912 determinants); d) the same CAS(10,10) expansion augmented with single and double excitations in the external space of 12 orbitals and truncated with a threshold of 210 on the coefficients of the spin-adapted configuration state functions. This last choice results in a total of 45644 determinants and is denoted as a restricted-active-space RAS(10,22) expansion.

We start all runs from the same geometry and, after convergence, average the geometries over an additional 30-40 iterations. The results of these structural optimizations are summarized in Fig. 2. We find that the basis sets of triple- and quadruple- quality yield values of BLA which are compatible within 1-1.5 standard deviations, namely, to better than 510 Å. The further addition of augmentation does not change the BLA as shown in the one-determinant case. In the following, we therefore focus on the cc-pVQZ bond lengths and BLA values of butadiene, which are summarized in Table 1.

With a one-determinant wave function (case a), we obtain a BLA of 0.1303(2) Å  which is higher than the value of 0.1251(6) Å reported in Ref. Barborini and Guidoni, 2015, possibly due to their use of a basis set of quality inferior to triple-. Moving beyond a single determinant, we observe a strong dependence of the result on the choice of active space. The inclusion of - correlation within 4, 16, and 20 orbitals (case b) significantly decreases the BLA with respect to the one-determinant case with the CAS(4,16) and CAS(4,20) expansions yielding a BLA of 0.117 Å in apparent agreement with the CCSD(T)/CBS estimate of 0.116 Å. Accounting also for - and - correlations in a CAS(10,10) (case c) leads however to a more substantial lengthening of the single than the double bond and a consequent increase of BLA. Finally, allowing excitations out of the CAS(10,10) in 12 additional orbitals (case d) brings the double bond in excellent agreement with the CCSD(T)/CBS value and somewhat shortens the single bond, lowering the BLA to a final value of 0.119 Å. In summary, all choices of multi-determinant expansion in the Jastrow-Slater wave function represent a clear improvement with respect to the use of a single determinant, significantly lowering the value of BLA. Consequently, the agreement reported in Ref. Barborini and Guidoni, 2015 between the single-determinant and JAGP wave functions indicates that the JAGP ansatz does not have the needed variational flexibility to capture the subtle static correlation effects in butadiene.

Finally, in Fig. 3, we demonstrate the ability of our method to optimize the structure and the many wave function parameters for the larger molecule CH when using a very large determinantal expansion. For this purpose, we employ the simple cc-pVDZ basis set and consider all single, double, and triple excitations in an expansion denoted as SDT(22,22), correlating 22 electrons in the 22 and orbitals obtained from the carbon valence atomic orbitals. The wave function comprises a total of 201924 determinants and 58652 parameters. To illustrate the dependence of the energy on the choice of wave function, we also display the energy of the last iterations of a structural optimization of the same molecule with the minimal CAS(8,8) expansion over the orbitals. At each iteration, we update both the wave function parameters and the atomic positions, the former with one step of the stochastic reconfiguration method and the latter along the down-hill direction of the interatomic forces. The energy of the SDT(22,22) wave function is distinctly lower than the one obtained with the smaller active space and converged to better than 2 mHartree within about 80 iterations. The structural parameters converge much faster and reach stable values within the first 30 iterations.

Figure 3: Total energy for a VMC geometry optimization of CH using two different expansions in the Jastrow-Slater wave function, that is, a full CAS(8,8) with 2468 determinants, and all single, double, and triple excitations in an expansion correlating 22 electrons in 22 orbitals for a total of 201924 determinants. The atomic positions and all parameters of the wave function (expansion coefficients, orbital and Jastrow parameters) are simultaneously optimized.

Appendix A Efficient calculation of , ,

We demonstrate here that we do not need to compute explicitly the inverses of the submatrices as in Eqs. (16, 21, and 35) or in Refs. Clark et al., 2011; Filippi, Assaraf, and Moroni, 2016 to obtain and its derivatives. These can be computed efficiently using recursion formulas.

Suppose that contains only third-order excitations (the generalization to an arbitrary order is straightforward). Let us rewrite the expression of (Eq. 16) as

(39)

where stands for a permutation of the indices , and is the sign of the permutation. We note that this formula can also include first- and second-order excitations: a second-order excitation can be written as , and a first-order excitation as .

The starting point is that the tensor of second derivatives can be computed directly from the expression (39) as

(40)

where and are the permutations ordering and , respectively. Note that this tensor is antisymmetric with respect to the permutations of either the indices or the indices , and we only need to compute and store the elements such that , and . The tensor of first order derivatives is

(41)

and the value of is

(42)

In practice, sparse representations of these tensors should be used. The formula (40) involves at most nine products and nine sums per excitation. The formulas (41) and (42) require less than and operations (additions or multiplications), respectively. The method still scales like but with a reduced prefactor because no divisions are involved and the number of operations is smaller. For example, expression (42) involves at most multiplications and additions whereas (16) is a sum on terms ( can be of order if third-order excitations are included).

Appendix B One-electron-move algorithms

To sample the density , we use the Metropolis-Hastings method Hastings (1970); Toulouse, Assaraf, and Umrigar (2016) which is a stochastic dynamics in the space of configurations . For a given iteration, this method proposes a random move with a transition probability density . The proposed move is accepted with the probability

(43)

If only one electron is moved (here the first, for example), the new configuration is . The new extended Slater matrix differs from only in the first line.

We introduce the matrix such that the first line of and are the same but is zero elsewhere. Since is a linear function of the modified line

(44)

where we considered the following transformation . Using Eq. (18), we obtain

(45)

where we recall that and . The cost of this calculation is . When the first electron has been moved, can be updated using the Sherman Morrison formula at a cost Filippi, Assaraf, and Moroni (2016), and which depends on can be again computed at a cost . The total cost for a sweep (each electron has moved once) is . The matrix and all derivatives are computed after each sweep.

We note that, if one uses instead the expression involving to update the wave function,

(46)

one would need to update at each Monte Carlo step and incur the higher cost of for a full sweep. This is because updating requires the calculation of given in Eq. (30), where of course is replaced by . In this equation, the product scales like , unless is sparsely modified after one electron move (i.e. a few double excitations are involved).

Finally, also in the calculation of the drift of a single electron needed in the Monte Carlo sampling, it is better not to recompute but to use formula (45) with , where the matrix is zero except the row which equals . However, if the sampling is modified to use a finite distribution at the nodes following Ref. C. Attaccalite, and S. Sorella, 2008, the full drift has to be computed at each step. The resulting scaling is per sweep, using Eq. (45) or (46) alike.

Appendix C Simple Expression of for a Jastrow-Slater expansion

Here, we provide a simple (though not efficient) expression for and some mathematical properties.

Simple expression for

The determinantal contribution of the wave function written in Eq. (3) is

where is a list of columns of the generalized Slater matrix . We can then define a matrix such that

(47)

which gives an explicit expression of as a function of

(48)

For example, given a Slater matrix built on the orbitals

The derivative of the determinantal expansion with respect to a parameter is

Using the linearity and the cyclic properties of the trace, we find

(49)

where we can identify

(50)

In the expression (50), the application of on the left of dispatches the lines of in a larger matrix. Of course, a direct evaluation of (50) would be and would be too costly.

Properties of the matrix

is a right inverse of , i.e.

(51)

where is the identity matrix of order . The proof is simple

(52)
(53)

We now consider the matrix and resort to the transformation . The only non-zero column of the matrix is the column, which is the same as the column of . Therefore,

(54)

meaning that is the new value of the determinantal expansion when the orbital has been replaced by the orbital

(55)

In particular, if ,

(56)

In other words, the main diagonal of is made of restrictions of the summation in (3) to determinants containing a given orbital. As a by-product, if is common to all the determinants of the expansion, is equal to . If , is the expansion (3) restricted to Slater determinants occupied by and not by

(57)

In particular, if the orbital is common to all determinants, for any . In conclusion, if there are orbitals which can be excited (i.e. there are orbitals common to all determinants), the following property holds: contains a block which is zero with the exception of a square sub-block which is the identity matrix.

Appendix D Calculation of using the Sherman-Morrison-Woodbury formula

Here, we derive the expression (23) directly from the identity (50) using the Sherman-Morrsion-Woodbury formula. The algebra is a bit more tedious. First, we remind some notations useful to explicit the matrix and dependencies on . is the reference Slater matrix and is the matrix which selects the columns from which is made

(58)

is the matrix such that is the list of the columns of which differ from those of (see for example Eq. (12)). The matrix is a diagonal matrix: if is the index of a column which differ in and , , while otherwise. Consequently, the identity

(59)

holds. The list of excited orbitals are the columns of and can be selected from with the aid of the matrix such that

(60)

as in the example Eq. (13). With these definitions

(61)

and the matrix which selects the columns from which , is given by

(62)

Now, writing and applying the Sherman-Morrison-Woodbury formula, we obtain

(63)

so that

(64)

where we have introduced