Massiveparallel Implementation of the ResolutionofIdentity Coupledcluster Approaches in the Numeric Atomcentered Orbital Framework for Molecular Systems
Abstract
We present a massiveparallel implementation of the resolutionofidentity (RI) coupledcluster approach that includes single, double and perturbatively triple excitations, namely RICCSD(T), in the FHIaims package for molecular systems. A domainbased distributedmemory algorithm in the MPI/OpenMP hybrid framework has been designed to effectively utilize the memory bandwidth and significantly minimize the interconnect communication, particularly for the tensor contraction in the evaluation of the particleparticle ladder term. Our implementation features a rigorous avoidance of the onthefly disk storage and an excellent strong scaling up to 10,000 and more cores. Taking a set of molecules with different sizes, we demonstrate that the parallel performance of our CCSD(T) code is competitive with the CC implementations in stateoftheart highperformance computing (HPC) computational chemistry packages. We also demonstrate that the numerical error due to the use of RI approximation in our RICCSD(T) is negligibly small. Together with the correlationconsistent numeric atomcentered orbital (NAO) basis sets, NAOVCCZ, the method is applied to produce accurate theoretical reference data for 22 biooriented weak interactions (S22), 11 conformational energies of gaseous cysteine conformers (CYCONF), and 32 isomerization energies (ISO32).
pacs:
Valid PACS appear hereI Introduction
Coupledcluster (CC) theory is a wellestablished wave functionbased electronicstructure approach that originates in nuclear physics Coester [1958], Coester and KÃ¼mmel [1960], but flourishes in the quantumchemistry community Čížek [1966], Paldus et al. [1972], Szabo and Ostlund [1996], Bartlett and MusiaÅ [2007]. Coupledcluster theory holds many theoretical advantages, for example size extensivity Bartlett and MusiaÅ [2007], Hirata [2011] and orbital invariance Bartlett and MusiaÅ [2007], Szabo and Ostlund [1996], both of which are crucial for a correct description of large and even extended systems. Compared to the widely used densityfunctional approximations, the CC hierarchy provides a systematic way to approach the exact description of the electroncorrelation effects at least for systems with not too small HOMOLUMO gaps. The coupledcluster ansatz with single and double excitations (CCSD) Purvis and Bartlett [1982] with its perturbative consideration of triple excitations, known as CCSD(T) Raghavachari et al. [1989], has long been recognized “gold standard” in quantum chemistry. However, the price to pay for these potential benefits is a considerably increased numerical complexity, which manifests itself in an computational scaling and an memory requirement with system size “” for the CCSD method based on the optimal formulation proposed by Scuseria, Janssen, and Schaefer Scuseria et al. [1988]. CCSD(T) shares the same memory requirement but with one order of magnitude higher computational scaling . This “curse of dimensionality”ed. [1961] together with the wellknown slow basisset convergence problem Dunning [1989], Tew et al. [2007] significantly hamper the precise numerical computation of the CC methods to large systems.
Serveral reducedscaling approximations of CC methods having been proposed for molecular systems Sæbø and Pulay [1985], Scuseria and Ayala [1999], Schütz and Werner [2001], Riplinger et al. [2016], but a highperformance massiveparallel implementation of conventional CCSD and CCSD(T) methods – the goal of this paper – is still highly valuable. It also severs the ultimate benchmark for moderatesize systems in the complete basis set (CBS) limit for the reducedscaling formalisms Peng et al. [2016] and it paves the way to study the applicability of the CC methods and their reducedscaling variants in solids, which is emerging quickly as an active field in computational materials science Booth et al. [2013], Michaelides et al. [2015], Hummel et al. [2017].
The first massiveparallel implementation of conventional closedshell CCSD energy was reported in 1992 by Rendell et. al. Rendell et al. [1992] adopting Scuseria’s formulation with optimal computational scaling Scuseria et al. [1988]. Rendell’s algorithm is the base of most stateoftheart CC codes aiming at massiveparallel calculations. It has been well recognized that, the major challenge towards an efficient massiveparallel CCSD(T) implementation based on Rendell’s algorithm is not about the evaluation of the most expensive perturbative (T) part, but to deal with several large fourdimensional arrays, namely intermediate date, in the CCSD iteration Rendell et al. [1992], Kobayashi and Rendell [1997], Anisimov et al. [2014], Harding et al. [2008]. These intermediate data often include electron repulsion integrals (ERIs), intermediate results in the twostep tensor contractions, and some CCSD amplitude tensors to construct new trial amplitudes (see section II for more details). For chemically interesting applications, these intermediate data can be terabytescale and thus cannot be stored in the available random access memory (RAM) on a single compute node. The traditional strategy of storing these data arrays to hard disks with heavy disk input/output (IO) traffic is obviously not an option for an efficient massiveparallel computation.
A data strategy that stores (part of) large arrays in the distributed RAM of compute nodes has been utilized in several CC codes in recent highperformance computing (HPC) packages, for example NWChem Valiev et al. [2011], Anisimov et al. [2014],QChem Shao and et. al. [2015], GAMESS Schmidt et al. [1993], Olson et al. [2007], Aquarius Solomonik et al. [2014] and MPQC Janssen et al. [2008], to name a few. In order to access the remote memory storage, the distributedmemory concept has to be fulfilled either in terms of the basic message passing interface (MPI) directly or employing sophisticated array distribution toolkits, like the global arrays (GA) shared memory model Nieplocha et al. [1996] which has been used in NWChem Valiev et al. [2011] and GAMESS Asadchev and Gordon [2013]. Furthermore, some other versatile toolkits proposed recently integrate advanced tensor utilities into selfdefined distributedmemory array frameworks, which can significantly simplify the massiveparallel CC code but retaining high parallel efficiency. These toolkits include the socalled Cyclops Tensor Framework (CTF) Solomonik et al. [2014] used in Aquarius Solomonik et al. [2014] and QChem packages Shao and et. al. [2015], and the TiledArray tensor framework used in the MPQC package Janssen et al. [2008].
The distributedmemory parallel closedshell CCSD implementation using the GA toolkit in NWChem was proposed in 1997 by Kobayashi and Rendell Valiev et al. [2011]. In order to avoid the storing of the whole ERI array in molecular orbital (MO) basis, Rendell’s algorithm Rendell et al. [1992] was adopted, i.e. reformulating the contributions involving ERIs with three and four virtual (unoccupied) MO indices into the atomicorbital (AO) basis and computing the relevant AO integrals onthefly, namely AO integraldirect algorithm. A strongscaling test on Cray T3D demonstrated excellent parallel efficiency (about 70%) by increasing the number of processors from 16 to 256.
Thereafter, Anisimov and coworkers realized that, when using more processors, the performancelimiting factor becomes the intensive network communication of doubles amplitudes in the tensor contraction of evaluating the socalled particleparticle ladder (ppLadder) term, which is the timedetermining step in CCSD energy calculations Anisimov et al. [2014]. It has lead to a recent progress in NWChem which replicates the symmetrical doubles amplitudes to reduce the communication in the AO integraldirect algorithm Anisimov et al. [2014]. The improved CCSD code in NWChem provides a notable speedup compared to the original distributedmemory code on 1,100 nodes and exhibits a good strong scaling from 1,100 to 20,000 nodes on “Cray XE6” supercomputer with 64 GB RAM per node Anisimov et al. [2014]. A similar replication algorithm had been used by Harding, et al. to implement the CC methods in the MAB variant of the ACES II package Harding et al. [2008]. However, the replication algorithm significantly increases the memory consumption and somewhat weakens the memory scalability of the code.
To alleviate the distributedmemory communication latency, CTF Solomonik et al. [2014] resorts to the use of the hybrid MPI/OpenMP communication layer together with the sophisticated (socalled communicationoptimal) SUMMA algorithm in 2.5D variant Van De Geijn and Watts [1997] for tensor contractions. To take advantage of the CTF utility, the current CCSD implementations in Aquarius and QChem are employing an openshell CC formulation even for closedshell cases; while all intermediate arrays are fully stored and distributed in memory, including the most expensive one – ERIs in MO basis. The CTFbased CCSD implementations present an excellent strong scaling in the Cray XC30 supercomputer Solomonik et al. [2014]. For systems with hundreds of electrons and MOs, the parallel efficiency retains about 50% with thousands of cores.
The closedshell CCSD implementation in MPQC was recently proposed by Peng et. al. in 2016 Peng et al. [2016]. It uses the TiledArray toolkit to distribute in memory all necessary intermediate arrays with more than one index. TiledArray is an opensource framework for distributedmemory parallel implementation of dense and blocksparse tensor arithmetic, which features a SUMMAstyle communication algorithm with a taskbased formulation. Both conventional MOonly and AO integraldirect approaches are implemented. For the first approach, the resolutionofidentity (RI) approximation is taken to reduce the computational load of ERIs in MO basis. A distinctive feature of Peng’s implementation is to completely turn off the use of permutation symmetry, particularly in the ratelimiting tensor contractions of evaluating the ppLadder term in CCSD equations. Despite significantly increasing the computational cost in terms of floating point operations (FLOP), about three times more expensive than Rendell’s algorithm, this choice allows for excellent parallel performance demonstrated to be scalable from a standalone multicore workstation with 16 cores to 32 nodes on BlueRidge supercomputer with 408 nodes hosted by Virginia Tech Advanced Research Computing (VT ARC).
Beside aforementioned distributedmemory CC codes, there are many other CC algorithms implemented in the same packages or others. To name a few in brief, another CC implementation in NWChem is based on the Tensor Contraction Engine (TCE) Hirata [2003], Hu et al. []; the original CC codes in QChem use a general tensor contraction library (socalled libtensor) for sharedmemory architecture Epifanovsky et al. [2013], Shao and et. al. [2015]; the parallel data strategies used in GAMESS include the Distributed Data Interface (DDI) for intranode parallelism Olson et al. [2007] and the hybrid local disk+GA model Asadchev and Gordon [2013]; The ACES III package uses the super instruction assembly language (SIAL) for distributed memory tensor contractions in CC theory Deumens et al. [2011]; and the closedshell AOdriven CCSD and CCSD(T) methods in the PQS package using the Array Files (AF) scheme Ford et al. [2007]. These CC implementations in quantum chemistry are all using Gaussiantype basis sets; but note that the CTF toolkit has been used very recently together with plane wave basis sets to provide CCquality results for solidstate systems Hummel et al. [2017].
In this paper, we describe a massiveparallel implementation of RIbased CCSD(T) for molecules using the numeric atomcenter orbital (NAO) basis sets in the Fritz Haber Institute ab initio molecular simulations package (FHIaims) Blum et al. [2009]. A domainbased distributedmemory strategy upon the hybrid MPI/OpenMP communication layer is proposed to replicate the intermediate data across domains, while distributing them among the computer nodes associated with the same domain. It allows for effectively utilizing the overall memory capacity to reduce the interdomain communication latency without losing the parallel scalability for larger systems. Motivated by Peng’s algorithm Peng et al. [2016], we partially turn off the permutation symmetry in the ratelimiting tensor contraction steps, which reduces the difficulty of designing a loadbalanced distributedmemory strategy and alleviates the intradomain communication latency. The subtensor contractions in each processor are carried out using multithreaded BLAS library, and the data movements among processors are performed via unblocked MPI twoside communication scheme. In this first CCSD(T) implementation in FHIaims, we do not equip our code with sophisticated tensor contraction toolkits, but this will be added soon. We expect a further improvement on the intradomain tensor operations, in particular by employing CTF and/or TiledArray toolkits.
In Section II, we describe in detail the domainbased distributedmemory strategy in the context of evaluating the timedetermining step in CCSD energy calculations, i.e. the ppLadder term; and we discuss the key advantage of our algorithm of partially turning off the permutation symmetry in the same context. Section III documents the performance of our code on a single computer node against three stateoftheart CCSD implementations with excellent sharedmemory parallelism in the packages of MPQC, Psi4 Turney et al. [2012], and ORCA Neese [2012]. We also benchmark the strongscaling performance of our implementation with up to 512 nodes with 10,000 cores in total and 128 GB RAM per node, and compare with the distributedmemory CCSD codes implemented in Aquarius, MPQC, and NWChem.
FHIaims provides a series of NAO basis sets with valencecorrelation consistency, termed NAOVCCZ with , which ensures wave functionbased methods consistently converging to the complete basisset (CBS) limit Zhang et al. [2013]. The RI implementation in FHIaims features a prescription of producing an accurate, methodindependent auxiliary basis set, which automatically adapts to a given basis set in a given system, therefore preventing a potential bias by the auxiliary basis sets optimized for other methods Ren et al. [2012a], Ihrig et al. [2015a].
In Section IV, we demonstrate 1) our RICCSD(T) produces the exact CCSD(T) results for absolute total energies using the same basis sets; 2) with the aid of the extrapolation scheme and the composite approach, our RICCSD(T) with NAOVCCZ provides accurate results for several widely used quantumchemistry molecular test sets.
Ii Theory and Implementation
The closedshell CCSD implementation in this work takes the spinadapted formulation of Scuseria, Janssen, and Schaefer Scuseria et al. [1988], which is accelerated by the direct inversion of the iterative subspace (DIIS) method Scuseria et al. [1986]. Using the converged CCSD amplitudes, the noniterative evaluation of perturbative triples energy is then implemented based on the algorithm of Rendell and Lee Rendell et al. [1991, 1993]. Instead of repeating the description of the welldocumented CC theory and CCSD(T) formalisms, we will give details only for the key modifications employed in our approach.
As in the usual convention, we will use symbols and to denote the indices and the number of occupied MOs; and and for unoccupied MOs. Meanwhile, and will be used to label the indices and the number of unspecified MOs; and letters of Greek alphabet with for the (auxiliary) AO basis sets. The index values all start from 1. We focus on the massiveparallel implementation of closedshell CCSD(T), therefore the indices are spinfree throughout the paper. Note that, the openshell version of CCSD has been coded in FHIaims as well but not yet fully optimized; and the openshell perterbative (T) implementation is not yet available.
ii.1 Formalism
The CCSD wave function is obtained from the HartreeFock single slater determinant ground state :
(1) 
with the exponential cluster operator
(2) 
and are spinfree single and double excitation operators for closedshell systems, which can be described in the unitary group approach,
(3) 
(4) 
with the definition of the unitary group generator as
(5) 
Here and are annihilation and creation operators, while and refer to the spin states. The amplitudes of single and doubleexciation configurations are (or ) and (or . In our approach, all arrays with only one or two indieces, including , will be replicated in each processor. in double precision consumes only bytes; while is allocated with double prcision and without symmetry, thus consuming bytes.
and are determined by solving the CCSD equation, which project the Schrödinger equation onto the HartreeFock ground state , singleexcitation states , and doubleexcitation states as follow,
(6) 
(7) 
(8) 
Here is the Hamiltonian for real systems, the HartreeFock groundstate energy, and the CCSD correlation energy. The CCSD equations have to be solved iteratively if following the most widely used Jacobi solver Scuseria et al. [1986]. With the converged and , the closedshell CCSD and perturbative (T) correlations can be evaluated from the following equations:
(9) 
(10) 
In this (T) energy expression,
(11) 
(12) 
and
(13) 
with being the electron repulsion integrals of molecular orbitals (see the next subsection for more details). is the permutation operator , and is the HartreeFock eigenvalue of molecular orbital .
ii.2 Intermediate data
As shown above, in addtion to the single and double amplitudes and , several large arrays with four indices, namely intermediate data, are retrieved (and/or updated) heavily during the CCSD iteration, which mainly include

: doubles amplitudes in the th iteration step. In order to accelerate the CCSD iteration using the DIIS strategy, several doubles amplitudes in previous steps should be recorded. (Often 3 steps are enough in practical use.) In our approach, the symmetry is used to reduce the storage of to bytes in double precision.

: electron repulsion integrals of molecular orbitals , namely ERIs in MO basis or MOERIs:
(14) which consume bytes in double precision without symmetry.
These intermediate arrays share the same scaling in memory consumption, but it is obvious that the storage of is the most challenging one, because and in practical use. Despite the full tensor should be used in CCSD(T) calculations, it is convenient to handle them separately in terms of the number of unoccupied indices, resulting in several subtensors as , , , , and . Apparently, the MOERIs with four unoccupied indices remain to be the most memorydemanding part. More importantly, as shown in the following discussion, is involved in the evaluation of ppLadder contribution (see equation 20), making the manipulation of crucial for massive parallelism. Note that MOs are expanded in terms of a set of AO basis functions
(15) 
with the expension coefficients as . The AO integraldirect algorithm proposed by Rendell avoids the direct storage of , which reformulates the contribution involving the use of in the AO representation and computes the relevant AOERIs () onthefly Rendell et al. [1992], Kobayashi and Rendell [1997]. The AO integraldirect algorithm is mainly applied with Gaussiantype basis sets Peng et al. [2016], Scuseria et al. [1988], Sæbø and Pulay [1987], Kobayashi and Rendell [1997], Janowski et al. [2007].
Resolution of identity (RI) approximation (also known as “density fitting (DF)”) is now the most successful approach to alleviate the computational load of ERIsRen et al. [2012b], Ihrig et al. [2015b]. The RI approach allows for the decomposition of the fourthrank ERI tensor in terms of thirdrank tensors, and is therefore better suited to be prestored:
(16) 
where the index runs over an auxiliary basis with the size of .

: the decomposed thirdrank tensor in MO basis, which consumes bytes in double precision. In RIV approximation, is determined by directly minimizing the errors in the AOERIs Ren et al. [2012b]:
(17) Here is the threecenter integral between the AO basis pair and the auxiliary basis
(18) and is the twocenter Coulomb integral for the auxiliary basis functions,
(19)
RI methods and other tensor decomposition techniques, like the partial Cholesky decomposition(CD) Beebe and J. [1977], Roeggen and WisloffNilssen [1977] and the tensor hypercontraction scheme Hohenstein et al. [2012a], have been applied to the CCSD and/or CCSD(T) implementation Rendell and Lee [1994], Bell et al. [2010], Pitonak et al. [2011], Hohenstein et al. [2012b], Benedikt et al. [2013], DePrince and Sherrill [2013], Parrish et al. [2014], Peng et al. [2016]. DePrince and coworkers demonstrated that the auxiliary basis sets optimized for MP2 theory are able to provide accurate RICCSD(T) results for weak interactions and reaction energies DePrince and Sherrill [2013]. Furthermore, it has been proposed that tensor decomposition techniques can be used to either reduce the scaling of the most vexing term in CCSD equations Parrish et al. [2014] or even the whole CCSD equations Bell et al. [2010], Benedikt et al. [2013], Hohenstein et al. [2012b]. In our approach with NAO basis sets, will be calculated onthefly in the RIV fashion during CCSD(T) calculations; while the rest MOERIs with fewer unoccupied indices will be prestored and distributed in memory directly
Complex^{1}^{1}1The basis sets used in these calculations are ccpVDZ for water clusters and a modified triplezeta basis set (mTZ) for carotene Hu et al. [], Peng et al. [2016].  
(HO)  50  190  920  0.42  0.72  2.74 
(HO)  75  285  1380  1.43  3.66  13.89 
(HO)  100  380  1840  3.39  11.55  43.90 
carotene  108  884  4504  35.46  72.92  596.86 

, , and : MOERIs with zero, one, and two unoccupied indices,which consume bytes, bytes, and bytes, respectively in double precision.

: MOERIs with three unoccupied indices which consumes bytes in memory. In order to obtain the good performance for the perturbative triple (T) evaluation, our approach follows the idea of Rendell, Lee, and Komornicki Rendell et al. [1991] to prestore and distribute in memory. But for the tensor contractions with which are not locally prestored, RIV approximation will be used to reduce the communication.
In consequence, our approach will allocate and/or prepare a series of forthrank tensors with one thirdrank RIV tensor before the CCSD iteration procedure. Table 1 provides the memory consumption of these intermediate data for a set of molecules. Not surprisingly, the memory requirement increases dramatically in terms of the system size. Taking the cluster with 20 water molecules (HO) with the ccpVDZ basis set as example, it consumes 11.5 GB for the doubles amplitudes, 3.4 GB for , and 43.9 GB for . Considering that several temporary buffer files with a similar size of doubles amplitudes are needed (see section II.4 for details), the memory requirement in total largely exceeds the memory capacity of a single computer node in today’s supercomputers. Thus the data should be distributed over many computer nodes to fully utilize the global memory capacity.
ii.3 The ppLadder evaluations
As widely discussed in the literature Rendell et al. [1992], Kobayashi and Rendell [1997], Anisimov et al. [2014], Harding et al. [2008], massiveparallel CCSD(T) calculations are essentially communication bound, particularly in calculating the ppLadder diagram generated in equation 8. The resulting ppLadder array shares the same size of doubles amplitudes and is contracted by
(20) 
with the definitions of as
(21) 
and (as large as ) as
(22) 
In our approach, the RIV approximation is used to evaluate and onthefly, leading to the construction of as
(23) 
with
(24) 
The hybridRI algorithm for MOERIs with three unoccupied indices is designed to balance the computing cost and communication in our domainbased distributedmemory strategy which will be introduced in the following section. As a result, the total cost of evaluating the ppLadder array in terms of FLOP is , in which the leading term is the tensor contraction of equation 20 with scaling.
In the seminal paper of CCSD formulation Scuseria et al. [1988], Scuseria et al. innovatively divided into two auxiliary, but highly symmetric tensors, which reduce the FLOP count of the leading term from to . Scuseria’s formulation provides the optimal computational cost for CCSD and has been widely used in quantumchemistry codes, such as Gaussian Frisch et al. [2016], ORCA Neese [2012], etc. However, in line with the observation of previous works on massivparallel CC methods, we find that FLOP count is not the only issue relevant to the computational cost. Other aspects as the I/O operation, communication, loadbalance, and vectorization are also of (often vital) importance.
Our implementation only takes advantage of the symmetry in equation 20
(25) 
This reduces the numerical workload to double loops with . Accordingly, we introduce a new index with , and note the relevant tensors as and . In addition to the advantage of reducing the leading cost (equation 20) to , this approach does not introduce additional manipulation of – the largest intermediate tensor in the work, making it easier to design a loadbalanced distributedmemory strategy with reduced communication.
ii.4 Domainbased distributedmemory strategy
It is welldocumented that, for massiveparallel calculations, the CCSD implementation based on the standard distributedmemory strategy is encountering the socalled communication bottleneck. It is because the large intermediate data mentioned above are distributed globally and uniformly over the processors. The data communication becomes unaffordable very quickly with the increasing number of processors. For convenience, we use symbols and to denote the index and the number of processors in the standard distributedmemory strategy.
To alleviate this problem, a domainbased distributedmemory strategy is proposed, which groups the processors to different domains. Specifically, we introduce an index to label the domain of a certain processor. It results in a 2Dgrid distribution of compute processors with labeling the processors in each domain. In the current version, we restrict the number of processors per domain to be the same,
(26) 
with the integer denoting the number of domains. The systemdependent is determined to ensure that the global memory of each domain is sufficiently high to accommodate all the prestored intermediate data together with a certain amount of buffer space. Figure 1 visualizes the 2D hierarchical organization of compute processors in the domainbased distributedmemory strategy. Since each domain now possesses a full copy of prestored intermediate data, it is very easy to design the corresponding CC algorithm with negligible interdomain communication load, thus by definition ensuring the scalability in massive parallelism. On the other hand, the double amplitudes and all intermediate data with more than two unoccupied or full MO indices will be uniformly distributed across the compute nodes in each domain. Therefore, our domainbased distributedmemory strategy can be considered as a straightforward generalization of the standard distributedmemory strategy with optimal flexibility to fully utilize the global memory capacity and alleviate the communication latency at scale. By setting , it will decay to the standard distributedmemory strategy.
In order to take advantage of the hybrid MPI/OpenMP model, it would be better to create fewer processors per node. Assuming that each node generates one processor, the 2Dgrid hierarchy of processors shown in figure 1 is equivalent to that of compute nodes. Note that, for supercomputers composed by dualsocket x86based blades, optimal is one processor per socket (i.e. two processors per node).
Our domainbased distributedmemory strategy requires a specific parallel algorithm. Let us take the evaluation of the ppLadder array . The corresponding pseudocode is presented in figure 2. The important points are as follows:

The tensor contraction for is divided into subtensor contractions labeled by the unoccupied index . Since each domain possesses a full copy of all prestored intermediate data, the resulting interdomain subtensors will be contracted in each domain without any interdomain communication. is the local unoccupied index in the th domain with .
(27) In consequence, a series of subtensors , with a similar interdomain distribution is needed.

In each domain, is distributed across compute nodes in terms of the index , resulting in the intradomain subtensors . Here we define a new intradomain local index in the th processor which obeys . Meanwhile and are uniformly distributed in each domain as and . The newly introduced local index indicates the distribution of the unoccupied index with .

To minimize the intradomain communication, prestored intermediate data are either distributed before CCSD(T) calculations or redistributed before evaluating the ppLadder term: (A) ; (B) ; (C) .

For the evaluation of a hybridRI strategy is employed. As the MOERIs are distributed with three indices in terms of (same as for ), no intradomain communication is needed in the last contraction of equation 22. However, for the second contraction of equation 22, this kind of distribution cannot avoid intensive communication. This motivates the hybridRI strategy, i.e. using RIV approximation for those that are not prestored locally (see equation 23).

Similar to , the intradomain distribution of the RIV tensor cannot avoid the communication to prepare , which are requested for all processors within domain to calculate onthefly (see equation 22 and 23). Profiting by the domainbased concept, this intradomain communication load decreases with , because the workload with the loop of has been well parallelized across domains.

Once all intermediate data have been evaluated, the evaluation of will be accomplished by three steps: (1) divide the task into subtasks , each of which takes over a block of , i.e. , and will be finally stored in processor ; (2) for each subtask, calculate
(28) which is an incomplete contraction looping over local ; (3) perform intradomain communication to sum over and store in processor . Unblocked MPI twoside communication scheme is utilized to balance the contraction workload at step (2) and the communication latency at step (3).
By partially turning off the use of symmetry in equation 20 and employing the hybridRI strategy to evaluate part of MOERIs onthefly, our implementation avoids the heavy network communication of and (symmetrical doubles amplitudes in AO basis requested by AO integraldirect algorithm Rendell et al. [1992], Anisimov et al. [2014]). With this domainbased concept, the total communication load of our approach for the ppLadder term is bytes. Although retaining the intrinsic dependence on the problem size, the leading cost in communication reduces significantly with respect to the number of domains .
Another advantage of partially turning off symmetry is the possibility of distributing and retrieving as regularshape blocks, such that the tensor contraction of can be realized block by block. At the cost of moderately increased temporary buffer per processor, about bytes plus some arrays on the order of , this permits to avoid intensive intradomain exchange requests of small data packages which often imposes a big challenge for any interconnect.
Iii Performance
Our CCSD(T) algorithm was coded in the FHIaims package using NAO basis sets Blum et al. [2009], Zhang et al. [2013]. In this section, we focus on the parallel performance of the CCSD code, which is a partricularly difficult part in the massiveparallel implementation of CCSD(T), and has been extensively investigated. The benchmark tests were carried out on the “HYDRA” supercomputer of Max Planck Computing & Data Facility (MPCDF). Despite HYDRA is not the most powerful HPC cluster available at FHI now, its hardware parameter is at the same level of supercomputers that were used in recent benchmark works on stateoftheart massively parallel implementations of CCSD Solomonik et al. [2014], Peng et al. [2016], making it possible to compare the parallel performance of our code with others directly. In the appendix, we will provide the detailed description of HYDRA as well as the other two supercomputers, i.e. BlueBidge and Cray XC30, that are used to produce the CCSD results for comparison in this section.
We first benchmarked the multithreaded performance of the OpenMP application in our code using a single Sandy Bridge node in HYDRA. The frozencore RICCSD calculations were carried out for a (HO) cluster with the ccpVDZ basis set (). All these calculations use the domainbased distributedmemory setting with one domain and one processor () consistently. Figure 3 presents the time cost per CCSD iteration of present work against the number of threads (labeled as FHIaims). Compared with the calculation using one thread only (9.9 minutes per CCSD iteration), our RIbased CCSD code provides a speedup of 12.3 on 16 threads, resulting in the cost of 0.8 minutes per CCSD iteration. In the recent paper of the TiledArraybased massiveparallel CCSD(T) implementation in MPQC Peng et al. [2016], the multithreaded performance of their code has been investigated together with other stateoftheart CCSD codes on the same system (frozencore, (HO)@ccpVDZ) and the same hardware (2 Sandy Bridge CPUs with 8 physical cores per CPU, 64 GB of RAM per node). Figure 3 also shows the results of three codes with the best multithreaded parallel performance reported in the literature, including the MOonly CCSD code in ORCA and two RIbased CCSD codes in MPQC and Psi4. Both MOonly CCSD in ORCA and RICCSD in MPQC present a superlinear multithreaded scaling, resulting in 18 and 16.9 speedups on 16 threads against their own performances with one thread (it is not clear if the hyperthreading mode is active in their benchmark). The RIbased CCSD code in Psi4 also possesses an excellent thread scaling with 14.5 speedup on 16 threads. By partially utilizing the permutation symmetry, our RICCSD code with one thread is about 1.9 times faster than MPQC in which the symmetry has been completely turned off Peng et al. [2016]; and it remains about 1.4 times faster than MPQC on threads according to the time cost reported in their paper.
In figure 4, we then benchmarked the domainbased distributedmemory parallel performance of our code versus the number of Sandy Bridge nodes for the same system using the same basis set, i.e. (HO) with ccpVDZ. As this system is small enough, we took each node (2 MPI processes with 8 OpenMP threads per process) as a domain (). As discussed in the previous section, the domainbased strategy developed in this work prepares a full copy of necessary intermediate data in each domain, thus significantly minimizing the interdomain communication. Therefore, it is not surprised to see that our code exhibits an excellent strong scaling against the number of nodes (or domains). The resulting parallel efficiency on 8 nodes (or 8 domains) is above 75% compared to the performance on 1 node. On the other hand, we also investigate the parallel performance of our code using all available nodes as a domain (, the standard distributedmemory strategy). Our algorithm still requests the intradomain communication in the order of , but with much smaller prefactor (see Section II.4). Together with the loopblocking design in the MPI/OpenMP framework, it enables the parallel performance of our approach with to be almost identical to the multidomain calculations in this system, which therefore are not plotted in figure 4. The same benchmark tests on two stateoftheart CCSD implementations (MPQC TiledArraybased CCSD code and NWChem TCE CCSD code) were recently performed on “BlueRidge” cluster which are composed by the same Sandy Bridge nodes as HYDRA Peng et al. [2016]. A slightly different setting in their work is the use of 8 MPI processes with 2 OpenMP threads in each node. Taking the results reported in Peng’s paper Peng et al. [2016], figure 4 also plots the strongscaling behaviors of MPQC and NWChem in this benchmark. Using 1 node, the time costs per CCSD iteration are 50 seconds and 245 seconds for MPQC and NWChem, respectively. As the intermediate data were distributed uniformly across all processes in both implementations, the reported parallel efficiency is around 35% to 50% from 1 node to 8 nodes.
Figure 5 presents the strong scaling of allelectron CCSD calculations for water clusters with different sizes ((HO) with ) on HYDRA up to 256 Ivy Bridge nodes equipping 64 GB of RAM per node and 5120 physical cores in total. The ccpVDZ basis set was employed. In this benchmark, we set the number of processors per domain for the clusters with 10, 15, and 20 waters, respectively. In line with the above observation, figure 5 demonstrates again that our domainbased distributedmemory strategy enables CCSD calculations to scale on many thousands of cores, while achieving a high degree of efficiency in computation, communication, and storage. For comparison, figure 5 also plots the parallel performance of the CCSD code in Aquarius based on the results reported in Ref. Solomonik et al. [2014]. Due to the use of the CTF library to automatically manage tensor blocking, redistribution, and contractions in CC theory, their CTFbased CCSD implementation in the standard distributedmemory framework achieves high parallel scalability on Cray XC30 supercomputer which is composed of Intel Ivy Bridge Xeon E52595 CPUs with 24 physical cores and 64 GB per node. Because the Aquarius CTFbased CCSD code stores all intermediate data, including the most consuming , in memory, their benchmark tests were performed starting from 4 nodes, 16 nodes, and 64 nodes for (HO), (HO), and (HO), respectively. Note that the Ivy Bridge nodes used in Cray XC30 are more powerful than those in HYDRA (refer to the appendix for more details). However, the time per CCSD iteration for Aquarius is significantly longer, which is most likely due to the use of the openshell CCSD formulism in the Aquarius code; and thus more FLOPs are needed.
As one of the largest systems used in the benchmark of stateoftheart CCSD implementations Peng et al. [2016], Hu et al. [], the carotene molecule with 96 atoms was also investigated in this work. The frozencore CCSD calculation with a modified TZ basis set involves 108 valence orbitals, 884 unoccupied orbitals, and 4504 auxiliary basis functions. According to the memory requirement to store a full copy of intermediate data in our domainbased distributedmemory strategy, 32 Ivy Bridge fat nodes (128 GB of RAM per node; 640 physical cores in total) are grouped as a domain with the number of processors per domain . Figure 6 presents the strongscaling performance of our code in this case. Using 640 cores as one domain, the time per CCSD iteration costs 63 minutes, which can be reduced to only 6 minutes if 10240 cores are employed (, ), resulting in the parallel efficiency as high as 66%. For comparison, the time per CCSD iteration reported for the AO integraldirect CCSD code in MPQC is about 100 minutes on 32 Sandy Bridge nodes with 512 cores Peng et al. [2016]; while it costs about 115 minutes for the NWChem TCEbased CCSD code on 48 Sandy Bridge nodes with 768 cores Hu et al. [].
Table 2 summarizes the time consumption of CCSD and (T) calculations for water clusters by our domainbased distributedmemory implementation in FHIaims. As the computational cost of (T) in term of FLOP is one order of magnitude higher than the CCSD procedure, it often spends longer time on the (T) part. As shown in table 2, the cost ratio of (T) to CCSD is about 5 to 6 for the 10water cluster, which, however, increases quickly to above 40 for 20 waters. Luckly, the massiveparallel implementation of the (T) part can be very efficient, as demonstrated in table 2.
Complex  ^{2}^{2}2Calculations of (HO) are all on Sandy Bridge nodes; while those of (HO) and (HO) are on Ivy Bridge nodes.  CCSD ^{3}^{3}3For simplicity, we assume that CCSD calculations need 12 iterations to converge for all three molecules.  (T)  
(HO)  2  1  8.7  49.6 
2  2  4.6  25.1  
2  4  2.6  12.6  
2  8  1.4  6.5  
(HO)  4  2  16.4  220.1 
4  4  9.1  109.1  
4  8  5.5  54.9  
(HO)  16  2  23.5  691.6 
16  4  12.9  358.5  
16  8  7.5  183.3 
Iv Results and Discussions
As the RI approximation is used to evaluate MOERIs Ren et al. [2012a], Ihrig et al. [2015a], it is necessary to examine the numerical accuracy of our RICC approaches compared with the CC results using analytic MOERIs and the same basis set. Taking the frozencore CCSD total energies obtained from GAMESS with analytic MOERIs as reference, Figure 7 shows the absolute deviation of our RICCSD results against a test set of 18 small molecules and radicals, including CH, CO, CO, F, H, HO, MgO, N, NaF, NH, BeCl, BeH, CH, NO, NO, OH, O, and SO. For both Gaussiantype basis sets, ccpVDZ and ccpVTZ, the absolute deviations of our RICCSD results can be smaller than 3 meV. The mean absolute deviations (MADs) are less than 0.6 meV for both basis sets, demonstrating that the auxiliary basis set used in FHIaims is accurate enough for the calculations of CC methods.
The CCSD(T) method has been widely used to produce accurate theoretical reference data of popular molecular test sets, for example the S22 set with 22 biooriented noncovelent interactions Jurečka et al. [2006] and CYCONF with 10 relative energies of cysteine conformations Wilke et al. [2009]. These test sets have been widely used to benchmark newly developed electronicstructure approaches, mainly in density functional theory. However, due to the demanding computational cost and the slow basisset convergence, it is very difficult (often unfeasible under today’s computer capacity) to provide CCSD(T) results in complete basisset (CBS) limit for large molecules in these test sets. In this work, we utilize a combination methodology to approach the CBS CCSD(T) results
(29) 
where the couplecluster correction at a finite basis set is defined as,
(30) 
and the converged MP2 total energy
(31) 
is achieved by twopoint extrapolations for the converged HF total energy and the MP2 correlation energy using the correlationconsistent basis set,
(32) 
(33) 
Here, denotes the cardinal number of correlationconsistent basis sets, for example the popular ccpVZ () and NAOVCC with provided in FHIaims Zhang et al. [2013]. This combination strategy was proposed based on the assumption that the energy difference between MP2 and CCSD(T) converges much faster against the basisset size than the MP2 and CCSD(T) energies themselves; and more accurate can be obtained if the finite basis set used for is larger. In this work, we chose for as recommended by Halkier et. al. Halkier et al. [1999]. was obtained by the twopoint extrapolation between NAOVCC4Z and 5Z; while the calculations of were carried out with NAOVCC3Z or 4Z according to the molecular size in calculations.
The S22 test set was constructed by Juračka et. al. in 2006 Jurečka et al. [2006] to benchmark the accuracy of theoretical methods in the description of noncovalent interactions. The CCSD(T) reference data reported in the original paper, designated S22RefA, were calculated following a similar combination strategy (equation 29) with the CCSD(T) correction evaluated with smaller Gaussiantype basis sets (631G** and ccpVDZ). In the past years, there were a number of literatures to update the CCSD(T) reference data for the S22 test set using larger basis sets for both extrapolation and correction terms in equation 29 Takatani et al. [2010], Podeszwa et al. [2010], Marshall et al. [2011]. The revised CCSD(T) reference data provided by Sherrill’s group in 2012 are considered to be the most accurate results to date, designated S22RefB. However, these CCSD(T) reference data reported are all based on Gaussiantype basis sets, mainly (aug)ccpVZ.
Table 3 lists 22 noninteraction energies of CCSD(T) quality calculated by our RICCSD(T) code with NAOVCCZ. Compared with the uptodate reference data S22RefB, the present work provides the deviations usually smaller than 0.1 kcal/mol, with the MAD of 0.059 kcal/mol and the maximum deviation of 0.19 kcal/mol in the parallel displaced benzene dimer (No. 11); while the MAD between S22RefB and S22RefA is 0.136 kcal/mol. Our data demonstrates that accurate CCSD(T) results can be obtained by using our RICCSD(T) code with the correlationconsistent NAO basis sets, NAOVCCZ.
Index  Present  S22RefA  S22RefB 

1  3.13  3.17  3.13 
2  4.98  5.02  4.99 
3  18.78  18.61  18.75 
4  16.14  15.96  16.06 
5  20.73  20.65  20.64 
6  16.99  16.71  16.93 
7  16.70  16.37  16.66 
8  0.52  0.53  0.53 
9  1.53  1.51  1.47 
10  1.47  1.50  1.45 
11  2.84  2.73  2.65 
12  4.38  4.42  4.26 
13  9.74  10.12  9.81 
14  4.67  5.22  4.52 
15  11.82  12.23  11.73 
16  1.54  1.53  1.50 
17  3.29  3.28  3.28 
18  2.35  2.35  2.31 
19  4.55  4.46  4.54 
20  2.79  2.74  2.72 
21  5.65  5.73  5.63 
22  7.06  7.05  7.10 
We then applied our code to calculate the CCSD(T) reference data of 10 relative energies of cysteine conformers in the CYCONF test set Wilke et al. [2009]. The geometries of all 11 stationary conformers were optimized at MP2/ccpVTZ level. The CCSD(T)/CBS reference data provided in the original literature, denoted as CYCONFRefA, were obtained following the similar combination strategy (equation 29) with the CCSD(T) correction at the basisset level of ccpVTZ. In the present work, we evaluated the CCSD(T) correction using a NAO basis set with more basis functions, NAOVCC4Z. Table 4 summarizes the relative energies with respect to the most stable conformer (No. 1). It can be seen that our results predict the order of conformers in terms with the relative energies, which is the same as those by CYCONFRefA Wilke et al. [2009]. However, inspecting table 4 reveals that the CCSD(T)/CBS relative energies of the present work are systematically smaller. To further study this systematic deviation, we suggested to update the original reference data with a better CCSD(T) correction at the basisset level of (aug)ccpVQZ.
Index  Present  CYCONFRefA 

1  0.00  0.00 
2  1.46  1.52 
3  1.54  1.61 
4  1.87  1.95 
5  1.73  1.80 
6  2.01  2.10 
7  1.85  1.93 
8  2.12  2.18 
9  2.31  2.36 
10  2.60  2.56 
11  2.61  2.67 
Index  Present  Expt.  Index  Present  Expt. 

1  1.58  1.62  18  11.69  11.16 
2  23.30  21.88  19  4.59  0.00 
3  7.44  7.20  20  17.95  20.23 
4  1.07  0.99  21  1.15  0.94 
5  1.22  0.93  22  3.27  3.23 
6  2.39  2.62  23  5.49  5.26 
7  11.10  11.15  24  11.98  12.52 
8  22.71  22.90  25  25.98  26.49 
9  6.42  6.94  26  16.68  18.16 
10  3.73  3.58  27  65.30  64.17 
11  1.50  1.91  28  30.89  31.22 
12  44.90  46.95  29  13.05  11.90 
13  36.46  36.04  30  9.89  13.60 
14  24.18  21.30  31  15.23  14.05 
15  8.21  7.26  32  6.56  2.40 
16  10.12  10.81  33  8.44  5.62 
17  28.40  26.98  34  6.59  7.26 
Isomerization is a welldefined reaction process in organic chemistry. The ISO34 test set is composed of 34 isomerization energies of small organic molecules Grimme et al. [2007]. The experimental reference data provided in the seminal paper are presented in table 5. In this work, we produced an accurate theoretical reference data at the CCSD(T)/CBS level, in which the CCSD(T) correction was evaluated with the NAOVCC3Z basis set. We also presented the CCSD(T)/CBS reference data in table 5 and visualized the deviations of CCSD(T)/CBS data against experimental data in figure 8. Despite the MAD between theoretical and experimental data is 1.1 kcal/mol, approaching the expected ’chemical accuracy’, it can be seen that there are 13 reactions of which the deviation is larger than 1.0 kcal/mol with the maximum deviation of 4.59 kcal/mol occurring at the 19th reaction. For the sake of benchmarking newly developed electronicstructure methods, we suggest to use the CCSD(T)/CBS reference data, so that the comparison can be based on exactly the same molecular geometry and immune to the experimental uncertainty, making the comparison welldefined.
V Conclusions
In this work, we introduce a domainbased distributedmemory strategy to implement a massiveparallel CCSD(T) code in the NAO framework for molecules. In this model, the compute processors are grouped into a number of domains. As each domain possesses a full copy of all intermediate data, the CCSD(T) calculations can be carried out in each domain with slight interdomain communications. The permutation symmetry is partially turned off in our algorithm, and the RI approximation are used to evaluate and part of onthefly. These choices result in an efficient parallel algorithm with optimal intradomain communication. We demonstrate that our RICCSD(T) implementation in FHIaims exhibits an outstanding parallel performance, which is scalable from a multithreaded calculations in one compute node to 512 nodes with 10240 CPU cores.
As the first implementation of CCSD(T) in the NAO framework, we demonstrate that the numerical error due to the use of RI approximation in our RICCSD(T) code can be negligible. Together with the correlationconsistent NAO basis sets, NAOVCCZ, we produce the CCSD(T)/CBS reference data for three popular test sets in quantum chemistry, including S22, CYCONF, and ISO34. Our CCSD(T)/CBS results are in good agreement with the theoretical reference data obtained using Gaussiantype basis sets for S22 and CYCONF. To replace the experimental reference data for ISO34, we suggest the use of our CCSD(T)/CBS results in the future methodology development and benchmark.
Vi Acknowledgments
IYZ is grateful to the support from the 14th Recruitment Program of Young Professionals in China.
Vii Appendix
The “HYDRA” supercomputer of MPCDF was used to produce the CCSD(T) results of the FHIaims code in this work. 610 nodes of HYDRA have 2 Intel Sandy Bridge Xeon E52670 central processing units (CPUs) with 8 physical cores per CPU and 3500 nodes have 2 Ivy Bridge E52680 CPUs with 10 physical cores per CPU. Most of the compute nodes in HYDRA have 64 GB RAM; while 200 of the Ivy Bridge nodes and 20 of the Sandy Bridge nodes equip a large RAM of 128 GB, namely fat nodes. The crossnode connection of HYDRA goes by a fast InfiniBand FDR14 network. FHIaims was compiled with Intel Parallel Studio XE 16.0 and Intel MPI 5.1 along with parallel MKL 11.3 and Intel OpenMP.
The “BlueRidge” supercomputer of the Virginia Tech research communicty contains 408 nodes with 2 Sandy Bridge Xeon E52670 CPUs and 64 GB (or 128 GB) RAM per node Peng et al. [2016], which is thus the same as the Sandy Bridge nodes of HYDRA introduced above. BlueRidge is connected via InfiniBand as well. The CCSD results of MPQC, Psi4, and ORCA in figure 3 and those of NWChem and MPQC in figure 4 were produced by using the Sandy Bridge nodes on BlueRidge. For comparison, the FHIaims results shown in these figures are produced by using the Sandy Bridge nodes of HYDRA.
The Edison supercomputer “Cray XC30” contains 5586 nodes, in which there are two 12core Ivy Bridge E52695v2 CPUs and 64 GB per node. Cray XC30 equips Cray Aries highspeed interconnect with Dragonfly topology Solomonik et al. [2014]. The CCSD results of CTFbased Aquarius code shown in figure 5 were obtained by using Ivy Bridge nodes on Cray XC30. For comparison, the FHIaims results in the same figure were calculated on HYDRA using less powerful Ivy Bridge nodes with two 10core E52680 CPUs per node.
References
 Coester [1958] F. Coester. Nuclear Physics, 7:421–424, 1958.
 Coester and KÃ¼mmel [1960] F. Coester and H. KÃ¼mmel. Nuclear Physics, 17:477–485, 1960.
 Čížek [1966] Jiří Čížek. J. Chem. Phys., 45:4256–4266, 1966.
 Paldus et al. [1972] J. Paldus, J. Čížek, and I. Shavitt. Phys. Rev. A, 5:50–67, 1972.
 Szabo and Ostlund [1996] Attila Szabo and Neil S. Ostlund. Modern Quantum Chemistry. McGrawHill, New York, 1996.
 Bartlett and MusiaÅ [2007] Rodney J. Bartlett and Monika MusiaÅ. Rev. Mod. Phys., 79:291–352, 2007.
 Hirata [2011] So Hirata. Theor. Chem. Acc., 129:727–746, 2011.
 Purvis and Bartlett [1982] George D. Purvis and Rodney J. Bartlett. J. Chem. Phys., 76:1910–1918, 1982.
 Raghavachari et al. [1989] Krishnan Raghavachari, Gary W. Trucks, John A. Pople, and Martin HeadGordon. Chem. Phys. Lett., 157:479–483, 1989.
 Scuseria et al. [1988] Gustavo E. Scuseria, Curtis L. Janssen, and Henry F. Schaefer. J. Chem. Phys., 89:7382–7387, 1988.
 ed. [1961] R. Bellman ed. Adaptive Control Processes: A Guided Tour. Princeton University Press, New Jersey, 1961.
 Dunning [1989] T. H. Dunning. J. Chem. Phys., 90:1007–1023, 1989.
 Tew et al. [2007] David P. Tew, Wim Klopper, and Trygve Helgaker. J. Comput. Chem., 28:1307–1320, 2007.
 Sæbø and Pulay [1985] Svein Sæbø and Peter Pulay. Chem. Phys. Lett., 113:13–18, 1985.
 Scuseria and Ayala [1999] Gustavo E. Scuseria and Philippe Y. Ayala. J. Chem. Phys., 111:8330–8343, 1999.
 Schütz and Werner [2001] Martin Schütz and HansJoachim Werner. J. Chem. Phys., 114:661–681, 2001.
 Riplinger et al. [2016] Christoph Riplinger, Peter Pinski, Ute Becker, Edward F. Valeev, and Frank Neese. J. Chem. Phys., 144:024109, 2016.
 Peng et al. [2016] Chong Peng, Justus A. Calvin, Fabijan Pavosevic, Jinmei Zhang, and Edward F. Valeev. J. Phys. Chem. A, 120:10231–10244, 2016.
 Booth et al. [2013] George H. Booth, Andreas Grüneis, Georg Kresse, and Ali Alavi. Towards an exact description of electronic wavefunctions in real solids. Nature, 493:365–370, 2013.
 Michaelides et al. [2015] Angelos Michaelides, Todd J. Martinez, Ali Alavi, Georg Kresse, and Frederick R. Manby. J. Chem. Phys., 143:102601, 2015.
 Hummel et al. [2017] Felix Hummel, Theodoros Tsatsoulis, and Andreas GrÃ¼neis. J. Chem. Phys., 146(12):124105, 2017.
 Rendell et al. [1992] Alistair P. Rendell, Timothy J. Lee, and Roland Lindh. Chem. Phys. Lett., 194:84–94, 1992.
 Kobayashi and Rendell [1997] Rika Kobayashi and Alistair P. Rendell. Chem. Phys. Lett., 265:1–11, 1997.
 Anisimov et al. [2014] Victor M. Anisimov, Gregory H. Bauer, Kalyana Chadalavada, Ryan M. Olson, Joseph W. Glenski, William T. C. Kramer, Edoardo AprÃ , and Karol Kowalski. J. Chem. Theory Comput., 10:4307–4316, 2014. ISSN 15499618.
 Harding et al. [2008] Michael E. Harding, Thorsten Metzroth, JÃ¼rgen Gauss, and Alexander A. Auer. J. Chem. Theory Comput., 4:64–74, 2008.
 Valiev et al. [2011] M. Valiev, E. Bylaska, N. Govind, K. Kowalski, T. Straatsma, H. Van Dam, D. Wang, J. Nieplocha, E. Apra, T. Windus, and et al. Comput. Phys. Commun., 181:1477–1489, 2011.
 Shao and et. al. [2015] Yihan Shao and et. al. Mol. Phys., 113:184–215, 2015.
 Schmidt et al. [1993] Michael W. Schmidt, Kim K. Baldridge, Jerry A. Boatz, Steven T. Elbert, Mark S. Gordon, Jan H. Jensen, Shiro Koseki, Nikita Matsunaga, Kiet A. Nguyen, Shujun Su, Theresa L. Windus, Michel Dupuis, and John A. Montgomery. J. Comput. Chem., 14:1347–1363, 1993.
 Olson et al. [2007] Ryan M. Olson, Jonathan L. Bentz, Ricky A. Kendall, Michael W. Schmidt, and Mark S. Gordon. J. Chem. Theory Comput., 3:1312–1328, 2007.
 Solomonik et al. [2014] Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton, and James Demmel. J. Parallel. Distrib. Comput., 74:3176–3190, 2014.
 Janssen et al. [2008] Curtis L. Janssen, Ida B. Nielsen, Matt L. Leininger, Edward F. Valeev, Joseph P. Kenny, and Edward T. Seidl. The massively parallel quantum chemistry program (mpqc). www.mpqc.org, 2008.
 Nieplocha et al. [1996] J. Nieplocha, R.J. Harrison, and R.J. Littlefield. J. Supercomput., 10:169–189, 1996.
 Asadchev and Gordon [2013] Andrey Asadchev and Mark S. Gordon. J. Chem. Theory Comput., 9:3385–3392, 2013.
 Van De Geijn and Watts [1997] Robert A. Van De Geijn and Jerrell Watts. Concurrency, Pract. Exp., 9:255–274, 1997.
 Hirata [2003] So Hirata. J. Phys. Chem. A, 107:9887–9897, 2003.
 [36] HanShi Hu, Kiran BhaskaranNair, Edoardo Aprá, Niranjan Govind, and Karol Kowalski. J. Phys. Chem. A, 118(39):9087–9093.
 Epifanovsky et al. [2013] Evgeny Epifanovsky, Michael Wormit, Tomasz Kuś, Arie Landau, Dmitry Zuev, Kirill Khistyaev, Prashant Manohar, Ilya Kaliman, Andreas Dreuw, and Anna I. Krylov. New implementation of highlevel correlated methods using a general block tensor library for highperformance electronic structure calculations. J. Comput. Chem., 34(26):2293–2309, 2013.
 Deumens et al. [2011] Erik Deumens, Victor F. Lotrich, Ajith Perera, Mark J. Ponton, Beverly A. Sanders, and Rodney J. Bartlett. Software design of ACES III with the super instruction architecture. WIREs Comput Mol Sci, 1(6):895–901, 2011.
 Ford et al. [2007] Alan R. Ford, Tomasz Janowski, and Peter Pulay. J. Comput. Chem., 28:1215–1220, 2007.
 Blum et al. [2009] Volker Blum, Ralf Gehrke, Felix Hanke, Paula Havu, Ville Havu, Xinguo Ren, Karsten Reuter, and Matthias Scheffler. Ab initio molecular simulations with numeric atomcentered orbitals. Comput. Phys. Comm., 180:2175–2196, 2009.
 Turney et al. [2012] J. M. Turney, A. C. Simmonett, R. M. Parrish, E. G. Hohenstein, F. A. Evangelista, J. T. Fermann, B. J. Mintz, L. A. Burns, J. J. Wilke, M. L. Abrams, and et al. Psi4: an opensource ab initio electronic structure program. Wiley interdiscip. Rev. Comput. Mol. Sci., 2:556–565, 2012.
 Neese [2012] F. Neese. The orca program system. Wiley interdiscip. Rev. Comput. Mol. Sci., 2:73–78, 2012.
 Zhang et al. [2013] Igor Ying Zhang, Xinguo Ren, Patrick Rinke, Volker Blum, and Matthias Scheffler. New J. Phys., 15:123033, 2013.
 Ren et al. [2012a] Xinguo Ren, Patrick Rinke, Volker Blum, Juergen Wieferink, Alexandre Tkatchenko, Andrea Sanfilippo, Karsten Reuter, and Matthias Scheffler. New J. Phys., 14:053020–60, 2012a.
 Ihrig et al. [2015a] Arvid Conrad Ihrig, JÃ¼rgen Wieferink, Igor Ying Zhang, Matti Ropo, Xinguo Ren, Patrick Rinke, Matthias Scheffler, and Volker Blum. New Journal of Physics, 17:093020, 2015a.
 Scuseria et al. [1986] Gustavo E. Scuseria, Timothy J. Lee, and Henry F. Schaefer. Accelerating the convergence of the coupledcluster approach. Chem. Phys. Lett., 130(3):236–239, 1986.
 Rendell et al. [1991] Alistair P. Rendell, Timothy J. Lee, and Andrew Komornicki. A parallel vectorized implementation of triple excitations in CCSD(t): application to the binding energies of the AlH3, AlH2f, AlHF2 and AlF3 dimers. Chem. Phys. Lett., 178:462–470, 1991.
 Rendell et al. [1993] Alistair P. Rendell, Timothy. J. Lee, Andrew Komornicki, and Stephen Wilson. Theor. Chim. Acta, 84:271–287, 1993.
 Sæbø and Pulay [1987] Svein Sæbø and Peter Pulay. J. Chem. Phys., 86:914–922, 1987.
 Janowski et al. [2007] Tomasz Janowski, Alan R. Ford, and Peter Pulay. J. Chem. Theory Comput., 3:1368–1377, 2007.
 Ren et al. [2012b] Xinguo Ren, Patrick Rinke, Christian Joas, and Matthias Scheffler. J. Mater Sci., 47:7447–7471, 2012b.
 Ihrig et al. [2015b] Arvid Conrad Ihrig, JÃ¼rgen Wieferink, Igor Ying Zhang, Matti Ropo, Xinguo Ren, Patrick Rinke, Matthias Scheffler, and Volker Blum. New J. Phys., 17:093020, 2015b.
 Beebe and J. [1977] N. H. F. Beebe and Linderberg J. Int. J. Quantum Chem., 12:683–705, 1977.
 Roeggen and WisloffNilssen [1977] I. Roeggen and E. WisloffNilssen. Chem. Phys. Lett., 132:683–705, 1977.
 Hohenstein et al. [2012a] E. G. Hohenstein, R. M. Parrish, and T. J. MartÃnez. J. Chem. Phys., 137:044103, 2012a.
 Rendell and Lee [1994] A. P. Rendell and T. J. Lee. J. Chem. Phys., 101:400–408, 1994.
 Bell et al. [2010] F. Bell, D. S. Lambrecht, and M. HeadGordon. Mol. Phys., 108:2759–2773, 2010.
 Pitonak et al. [2011] M. Pitonak, F. Aquilante, P. Hobza, P. Neogrady, J. Noga, and M. Collect. Czech. Urban. J. Chem. Phys., 76:713–742, 2011.
 Hohenstein et al. [2012b] E. G. Hohenstein, R. M. Parrish, C. D. Sherrill, and T. J. Martínez. J. Chem. Phys., 137:221101, 2012b.
 Benedikt et al. [2013] Udo Benedikt, KarlHeinz Böhm, and Alexander A. Auer. J. Chem. Phys., 139:224101, 2013.
 DePrince and Sherrill [2013] A. Eugene DePrince and C. David Sherrill. J. Chem. Theory Comput., 9:2687–2696, 2013.
 Parrish et al. [2014] Robert M. Parrish, C. David Sherrill, Edward G. Hohenstein, Sara I. L. Kokkila, and Todd J. Martínez. J. Chem. Phys., 140:181102, 2014.
 Frisch et al. [2016] M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, G. Scalmani, V. Barone, G. A. Petersson, H. Nakatsuji, X. Li, M. Caricato, A. V. Marenich, J. Bloino, B. G. Janesko, R. Gomperts, B. Mennucci, H. P. Hratchian, J. V. Ortiz, A. F. Izmaylov, J. L. Sonnenberg, D. WilliamsYoung, F. Ding, F. Lipparini, F. Egidi, J. Goings, B. Peng, A. Petrone, T. Henderson, D. Ranasinghe, V. G. Zakrzewski, J. Gao, N. Rega, G. Zheng, W. Liang, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, T. Vreven, K. Throssell, J. A. Montgomery, Jr., J. E. Peralta, F. Ogliaro, M. J. Bearpark, J. J. Heyd, E. N. Brothers, K. N. Kudin, V. N. Staroverov, T. A. Keith, R. Kobayashi, J. Normand, K. Raghavachari, A. P. Rendell, J. C. Burant, S. S. Iyengar, J. Tomasi, M. Cossi, J. M. Millam, M. Klene, C. Adamo, R. Cammi, J. W. Ochterski, R. L. Martin, K. Morokuma, O. Farkas, J. B. Foresman, and D. J. Fox. GaussianË16 Revision A.03, 2016. Gaussian Inc. Wallingford CT.
 Peng and Kowalski [2017] Bo Peng and Karol Kowalski. J. Chem. Theory Comput., 13:4179–4192, 2017.
 Jurečka et al. [2006] Petr Jurečka, Jiří Šponer, Jiří Černý, and Pavel Hobza. Phys. Chem. Chem. Phys., 8:1985–1993, 2006.
 Wilke et al. [2009] Jeremiah J. Wilke, Maria C. Lind, Henry F. Schaefer III, Attila G. Császár, and Wesley D. Allen. J. Chem. Theory Comput., 5:1511–1523, 2009.
 Grimme et al. [2007] Stefan Grimme, Marc Steinmetz, and Martin Korth. J. Org. Chem., 72:2118–2126, 2007.
 Halkier et al. [1999] Asger Halkier, Trygve Helgaker, Poul Jørgensen, Wim Klopper, and Jeppe Olsen. Chem. Phys. Lett., 302:437–446, 1999.
 Takatani et al. [2010] Tait Takatani, Edward G. Hohenstein, Massimo Malagoli, Michael S. Marshall, and C. David Sherrill. J. Chem. Phys., 132:144104, 2010.
 Podeszwa et al. [2010] Rafal Podeszwa, Konrad Patkowski, and Krzysztof Szalewicz. Phys. Chem. Chem. Phys., 12:5974–5979, 2010.
 Marshall et al. [2011] Michael S. Marshall, Lori A. Burns, and C. David Sherrill. J. Chem. Phys., 135:194102, 2011.