Efficient model chemistries for peptides. I. Split-valence Gaussian basis sets and the heterolevel approximation in RHF and MP2
We present an exhaustive study of more than 250 ab initio potential energy surfaces (PESs) of the model dipeptide HCO-L-Ala-NH. The model chemistries (MCs) used are constructed as homo- and heterolevels involving possibly different RHF and MP2 calculations for the geometry and the energy. The basis sets used belong to a sample of 39 selected representants from Pople’s split-valence families, ranging from the small 3-21G to the large 6-311++G(2df,2pd). The reference PES to which the rest are compared is the MP2/6-311++G(2df,2pd) homolevel, which, as far as we are aware, is the more accurate PES of a dipeptide in the literature. The aim of the study presented is twofold: On the one hand, the evaluation of the influence of polarization and diffuse functions in the basis set, distinguishing between those placed at 1st-row atoms and those placed at hydrogens, as well as the effect of different contraction and valence splitting schemes. On the other hand, the investigation of the heterolevel assumption, which is defined here to be that which states that heterolevel MCs are more efficient than homolevel MCs. The heterolevel approximation is very commonly used in the literature, but it is seldom checked. As far as we know, the only tests for peptides or related systems, have been performed using a small number of conformers, and this is the first time that this potentially very economical approximation is tested in full PESs. In order to achieve these goals, all data sets have been compared and analyzed in a way which captures the nearness concept in the space of MCs.
The most important results of the study are the following: First, that
the convergence in method is not achieved in the RHF
MP2 step. Second that the transferability of basis set
accuracy from RHF to MP2 is imperfect. These two conclusions lead us
to discourage the use of RHF MCs for peptides. Regarding the relative
efficiency of the Pople’s basis sets, we recommend the inclusion of
polarization functions in 1st-row atoms and we discourage the use of
basis sets containing doubly-split polarization shells and no diffuse
functions. Also, we have found that 6-31G(d) is very efficient for
calculating the geometry, and that both the RHF and MP2 infinite basis
set limits are approximately reached at 6-311++G(2df,2pd). Finally,
related to the heterolevel approximation, we conclude that it is
essentially correct for the description of the conformational behaviour
of HCO-L-Ala-NH both at RHF and MP2. Nevertheless,
we place a cautionary remark on the use of RHF geometries with MP2
single-points: Whereas this practice could be accurate enough for
short peptides, the accumulation of errors may render it unreliable
for longer chains and require the use of MP2 geometries.
PACS: 07.05.Tp, 31.15.Ar, 31.50.Bc, 87.14.Ee, 87.15.Aa, 89.75.-k
In any bottom-up approach to the still unsolved protein folding problem [1, 2, 3, 4], the characterization of the conformational behaviour of short peptides [5, 6, 7, 8, 9, 10, 11, 12] constitutes an unavoidable first step. If high accuracy of the treatment is sought, numerically expensive methods must be used to calculate the physical properties of these protein subunits. This is why, the most frequently studied peptides are the shortest ones: the dipeptides [13, 14, 15, 16, 17], in which a single amino acid residue is capped at both the N- and C-termini with neutral peptide groups. Among them, the most popular choice has been the alanine dipeptide [18, 19, 20, 21, 22, 23, 24, 9, 25, 26, 27, 28, 29, 30, 31, 32, 33, 14], which, being the simplest chiral residue, shares many similarities with most of the rest of dipeptides for the minimum computational price.
Although classical force fields [34, 35, 36, 37, 38, 39, 40, 41, 42] are the only feasible choice for simulating large molecules at present, they have been reported to yield inaccurate potential energy surfaces (PESs) for dipeptides [14, 43, 44, 45, 46, 26] and short peptides [47, 9]. Therefore, it is not surprising that they are widely recognized as being unable to capture the fine details needed to correctly describe the intricacies of the whole protein folding process [48, 49, 50, 43, 51, 52, 53, 54]. On the other hand, albeit prohibitively demanding in terms of computational resources, ab initio quantum mechanical calculations [55, 56, 57] are not only regarded as the correct physical description that in the long run will be the preferred choice to directly tackle proteins (given the exponential growth of computer power), but they are also used in small peptides as the reference against which less accurate methods must be compared [14, 58, 59, 43, 44, 46, 26, 9] in order to parameterize improved generations of additive, classical force fields for polypeptides.
However, despite the sound theoretical basis, in practical quantum chemistry calculations a plethora of approximations must be typically made if one wants to obtain the final results in a reasonable human time. The exact ‘recipe’ that includes all the assumptions and steps needed to calculate the relevant observables for any molecular system has been termed model chemistry (MC) by John Pople. In his own words, a MC is an “approximate but well-defined general and continuous mathematical procedure of simulation” .
The two starting approximations to the exact Schrödinger equation that a MC must contain are, first, the truncation of the -electron space (in wavefunction-based methods) or the choice of the functional (in DFT) and, second, the truncation of the one-electron space, via the LCAO scheme (in both cases). The extent up to which the first truncation is carried (or the functional chosen in the case of DFT) is commonly called the method and it is denoted by acronyms such as RHF, MP2, B3LYP, CCSD(T), FCI, etc., whereas the second truncation is embodied in the definition of a finite set of atom-centered Gaussian functions termed basis set [61, 56, 57, 62], which is also designated by conventional short names, such as 6-31+G(d), TZP or cc-pVTZ(–f). If we denote the method by a capital and the basis set by a , the specification of both is conventionally denoted by and called a level of the theory. Typical examples of this are RHF/3-21G or MP2/cc-pVDZ [55, 56, 57].
Such levels of the theory are, by themselves, valid MCs, however, it is very common [63, 31, 64, 9] to use different levels to perform, first, a (possibly constrained) geometry optimization and, then, a single-point energy calculation on top of the resulting structures. If we denote by a given level of the theory, this ‘mixed’ calculation is indicated by , where the level at which the single-point energy calculation is performed is written first . Herein, if , we shall call an heterolevel MC; whereas, if it will be termed a homolevel one, and it will be typically abbreviated omitting the ‘double slash’ notation.
Apart from the approximations described above, which are the most commonly used and the only ones that are considered in this work, the MC concept may include a lot of additional features: protocols for extrapolating to the infinite-basis set limit [65, 66, 67, 68, 69], additivity assumptions [70, 71, 72, 73], extrapolations of the Møller-Plesset series to infinite order , removal of the so-called basis set superposition error (BSSE) [75, 76, 77, 78, 79, 80, 81], etc. The reason behind most of these techniques is the urging need to reduce the computational cost of the calculations. For example, in the case of the heterolevel approximation, this economy principle forces the level at which the single-point energy calculation is performed to be more accurate and more numerically demanding than ; the reason being simply that, while we must compute the energy only once at , we need to calculate several times the energy and its gradient with respect to the unconstrained internal nuclear coordinates at level (the actual number of times depending on the starting structure, the algorithms used and the size of the system). Therefore, it would be pointless to use an heterolevel MC in which is more expensive than , since, at the end of the geometry optimization, the energy at level is available.
Now, although general applicability is a requirement that all MCs must satisfy, general accuracy is not mandatory. Actually, the fact is that the different procedures that conform a given MC are typically parameterized and tested in very particular systems, which are often small molecules. Therefore, the validity of the approximations outside that native range of problems must be always questioned and checked, but, while the computational cost of a given MC is easy to evaluate, its expected accuracy on a particular problem could be difficult to predict a priori, specially if we are dealing with large molecules in which interactions in very different energy scales are playing a role. The description of the conformational behaviour of peptides (or, more generally, flexible organic species), via their PESs in terms of the soft internal coordinates, is one of such problems and the one that is treated in this work.
Our aim here is to provide an exhaustive study of the Restricted Hartree Fock (RHF) and Møller-Plesset 2 (MP2) methods, using the split-valence families of basis sets devised by Pople and collaborators [82, 83, 84, 85, 86, 87, 88, 89]. To this end, we compare the PES of the model dipeptide HCO-L-Ala-NH (see fig. 1) calculated with a large number of homo- and heterolevel MCs, and assess their efficiency by comparison with a reference PES. Special interest is devoted to the evaluation of the influence of polarization and diffuse functions in the basis sets, distinguishing between those placed at 1st-row atoms and those placed at hydrogens, as well as the effect of different contraction and valence-splitting schemes.
The second objective of this study, and probably the main one, is the investigation of the heterolevel assumption, which is defined here to be that which states that heterolevel MCs are more efficient than homolevel ones. The heterolevel assumption is very commonly assumed in the literature [14, 90, 59, 63, 32, 31, 29, 64, 46], but it is seldom checked. As far as we know, the only tests for peptides or related systems, have been performed using a small number of conformers [15, 17, 91, 23, 9], and this is the first time that this potentially very economical approximation is tested in full PESs.
In sec. 2.1, the methodological details regarding the quantum mechanical calculations performed in this work are provided. In sec. 2.2, a brief summary of the meaning and the properties of the distance introduced in ref. 92 and used for comparing the different MCs is given for reference. Next, in sec. 2.3, we discuss the rules and criteria that have been used in order to reasonably sample the enormous space of all Pople’s basis sets. In sec. 3, the main results of the investigation are presented. For convenience, they are organized into four different subsections: in sec. 3.1, a RHFRHF-intramethod study is performed, whereas the MP2 analogous is presented in sec. 3.2. In sec. 3.3, a small interlude is dedicated to reflect on the general ideas and the nearness concept in the space of MCs that underlie an investigation such as this one, and also to compare the RHF and MP2 results obtained in the previous two sections. In sec. 3.4, heterolevel MCs in which the geometry is calculated at RHF and the energy at MP2 are evaluated. Finally, sec. 4 is devoted to give a brief summary of the most important conclusions of the work.
2.1 Quantum mechanical calculations and internal coordinates
All ab initio quantum mechanical calculations have been performed using the Gaussian03 program  under Linux and in 3.20 GHz PIV machines with 2 GB RAM memory. The internal coordinates used for the Z-matrix of the HCO-L-Ala-NH dipeptide in the Gaussian03 input files (automatically generated with Perl scripts) are the Systematic Approximately Separable Modular Internal Coordinates (SASMIC) ones introduced in ref. 93. They are presented in table 1 (see also fig. 1 for reference). For the geometry optimizations, the SASMIC scheme has been used too (Opt=Z-matrix option) instead of the default redundant internal coordinates provided by Gaussian03, since we have seen that, when soft coordinates, such as the Ramachandran angles, are held fixed and mostly hard coordinates are let vary, the use of the SASMIC scheme slightly reduces the time to converge with respect to the redundant internals (for unconstrained optimizations, on the other hand, the redundant coordinates seem to slightly outperform the SASMIC ones).
|Atom name||Bond length||Bond angle||Dihedral angle|
All PESs in this study have been discretized into a regular 1212 grid in the bidimensional space spanned by the Ramachandran angles and , with both of them ranging from to in steps of .
To calculate the geometry at a particular level of the theory, we have run constrained energy optimizations at each point of the grid, freezing the two Ramachandran angles and at the corresponding values, and, in order to save computational resources, the starting structures were taken, when possible, from PESs previously optimized at a lower level of the theory. The convergence criterium for RHF optimizations has been set to Opt=Tight, while, in the case of MP2, an intermediate option of IOp(1/7=100) has been used (note that Opt=Tight corresponds to IOp(1/7=10), whereas the default criterium is IOp(1/7=300)). The resulting geometries have been automatically extracted by Perl scripts and used to construct the input files for the heterolevel calculations.
The self-consistent Hartree-Fock convergence criterium has been set in all cases to SCF=(Conver=10) (tighter than SCF=Tight) and the MP2 calculations have been performed in the (default) frozen-core approximation.
At the Hartree-Fock level, 142 PESs have been calculated, taking a total of years of computer time, whereas, at MP2, 35 PESs have been computed and the time invested amounts to years, from which, the highest level PES computed in this study, the MP2/6-311++G(2df,2pd) one depicted in fig 7, has taken years of computer time. Finally, 88 PESs have been calculated with MP2RHF-intermethod MCs, taking months. In total, 265 PESs of the model dipeptide HCO-L-Ala-NH have been computed for this study, taking years of computer time. All these PESs are available as supplementary material, see the Conclusions for access information.
Finally, let us note that the correcting terms to the PES coming from mass-metric tensors determinants have been recently shown to be relevant for the conformational behaviour of peptides . Although, in this study, we have not included them, the PES calculated here is the greatest part of the effective free energy , so that it may be considered as the first ingredient for a further refinement of the study in which the correcting terms are taken into account.
2.2 Physically meaningful distance
In order to compare the PESs produced by the different (homo- and heterolevel) MCs, a statistical criterium (distance) introduced in ref. 92 has been used. Let us recall here that this distance, denoted by , profits from the complex nature of the system studied to compare any two different potential energy functions, and . From a working set of conformations (in this case, the 144 points of each PES), it statistically measures the typical error that one makes in the energy differences if is used instead of the more accurate , admitting a linear rescaling and a shift in the energy reference.
This distance, which has energy units, presents better properties than other quantities customarily used to perform these comparisons, such as the energy RMSD, the average energy error, etc., and it may be related to the Pearson’s correlation coefficient by
where is the standard deviation of in the working set.
Also, due to its physical meaning, it has been argued in ref. 92 that, if the distance between two different approximations of the energy of the same system is less than , one may safely substitute one by the other without altering the relevant dynamical or thermodynamical behaviour. Consequently, we shall present the results in units of (at K, so that kcal/mol).
Finally, if one assumes that the effective energies compared will be used to construct a polypeptide potential and that it will be designed as simply the sum of mono-residue ones , then, the number of residues up to which one may go keeping the distance between the two approximations of the the -residue potential below is 
Now, according to the value taken by for a comparison between a fixed reference PES, denoted by , and a candidate approximation, denoted by , we divide all the efficiency plots in sec 3 in three regions depending on the accuracy: the protein region, corresponding to , or, equivalently, to ; the peptide region, corresponding to , or ; and, finally, the inaccurate region, where , and even for a dipeptide it is not advisable to use as an approximation to .
2.3 Basis set selection
In the whole study presented in this work, only Pople’s split-valence basis sets [82, 83, 84, 85, 86, 87, 88, 89] have been investigated. Among the many reasons behind this choice, we would like to mention the following ones:
They are very popular and they are implemented in almost every quantum chemistry package, in such a way that they are readily available for most researchers and the results here may be easily checked or extended.
There exist a lot of data calculated using these basis sets in the literature, so that the knowledge about their behaviour in different problems is constantly growing and may also be enriched by the study presented here.
Pople’s split-valence basis sets incorporate, and hence allow to investigate, most of the features and improvements that are commonly used in the literature, such as contraction, valence-splitting, diffuse functions and polarizations.
Unlike some other popular basis sets, such as Dunning’s correlation-consistent family , in which the diffuse or polarization functions are added in preset groups, Pople’s basis sets allow for the addition of individual shells rather independently, thus permitting a more in depth study.
The number of different basis sets available is very large (see, for example the EMSL database at http://www.emsl.pnl.gov/forms/basisform. html), so that, for obvious computational reasons, one cannot explore them all, and some choice must be made.
Now, even restricting oneself to this particular family of basis sets, the number of variants that can be formed by independently adding each type of diffuse or polarization function to each one of the basic 6-31G and 6-311G sets is huge (to get to the sought point, there is no need to consider the addition of functions to 3-21G, 4-31G, etc.). Using that the largest set of diffuse and polarization shells that we may add is the ‘++G(3df,3pd)’ one , we can express the different basis sets that may be constructed as a product of all the independent possibilities:
where the dot indicates here that no function is added from a particular group.
Therefore, there are different Pople’s split-valence basis sets just considering the 6-31G and 6-311G families. This number is prohibitively large to carry out a full study even at the RHF level, so that, here, the following strategy has been devised to render the investigation feasible:
To begin with, we impose several constraints on the basis sets that will be considered in a first stage:
|First-stage, rules-complying basis sets (24)|
|First violation of the rules (5)|
|Second violation of the rules (10)|
The maximum set of diffuse and polarization shells added is ‘++G(2df, 2pd)’, instead of ‘++G(3df,3pd)’. This is consistent with the thumb-rule concept of balance , according to which, the most efficient (balanced) basis sets are typically those that contain, for each angular momentum , one shell more than the ones included for ; so that 6-311++G(3df,3pd), for example, should be regarded as unbalanced.
There must be the same number and type of shells in hydrogens as in 1st-row atoms. This has to be interpreted in the proper way: for example, if we add two d-type polarization shells to 1st-row atoms, we must add two p-type ones to hydrogens. They are of the same type in the sense that they are one momentum angular unit larger than the largest one in the respective valence shell.
The higher angular momentum f-type (for 1st-row atoms) and d-type shells (for hydrogens), are not included unless the lower polarizations are doubly split, i.e., unless we have already included the (2d,2p)-shells. This is again consistent with the balance rule mentioned in point (i).
The investigation of smaller basis sets is restricted to the 3-21G and 4-31G families, and the largest set of extra shells that is added to them is ‘++G(d,p)’. For consistence, the diffuse and polarization functions used for 3-21G and 4-31G are the same as the ones for 6-31G and 6-311G [85, 89, 88, 84].
These rules, most of which are typically obeyed (often tacitly) in the literature [33, 15, 32, 8, 11, 31, 29, 28, 17, 25, 24, 23, 71], produce the list of 24 basis sets labeled as ‘First-stage, rules-complying basis sets’ in table 2 and depicted as black circles in fig. 2.
Even if their exhaustive study is already a demanding computational task and the space of all Pople’s split-valence basis sets may be thought to be reasonably sampled by this ‘first-stage’ group, we wanted to check the validity of some of the rules, since, in the same spirit of the arguments given in the introduction, what is good for a particular system or a particular purpose is not necessarily good for a different one, which may be far away from the native playground where the methods have been traditionally tested and parameterized. Therefore, to this end, we have chosen the medium-sized and reasonably RHF-efficient 6-31++G(d,p) basis set (see sec. 3.1), and we have modified it in order to break restrictions (ii) and (iii). On the one hand, as representants of breaking rule (ii), we have selected the basis sets 6-31+G(d,p), 6-31++G(d), 6-31+G(d,p) and 6-31++G(,p), where, in the first two cases, a diffuse and a polarization shell has been respectively removed from the hydrogens, while, in the last two ones, the removal has been carried out on the 1st-row atoms. This second modification is so unusual (in fact, we have not found any work where it is performed) that there is no notation for it in the literature; herein, a dot is used in the place where the unexisting 1st-row-atom shell would appear. On the other hand, as a representant of breaking rule (iii), we have selected 6-31G(f,d). This new group of 5 basis sets is labeled as ‘First violation of the rules’ in table 2 and depicted as grey-filled circles in fig. 2. We have decided to violate neither rule (i), mainly for computational reasons, nor rule (iv), due to the fact that the study of the smaller basis sets is intended to be only exploratory (and, in any case, the 3-21G and 4-31G families have proved to be rather inefficient for this problem; see sec. 3).
The conclusions extracted from the study of the ‘first violation of the rules’ group within RHF are discussed later, however, it suffices to say for the moment that we learn from them that breaking rule (iii) is not advantageous, and that one may benefit from breaking rule (ii) only if the functions are removed from the hydrogens. Therefore, in the final step of the selection of the basis sets that shall be investigated, we include a new group of 10 basis sets which come from removing hydrogen-atom diffuse and/or polarization shells from some of the most efficient ones in the other two groups. This new block is labeled as ‘Second violation of the rules’ in table 2 and depicted as white-filled circles in fig. 2.
The basis sets used in the RHF part of the study are those in table 2, whereas, in the MP2 part, we have considered the smaller subgroup that may be found in table 4 (see also fig. 2). All of them have been taken from the Gaussian03 internally stored library except for 6-31+G(d,p), 6-31++G(,p) and all the basis sets generated from the 3-21G and 4-31G ones by adding extra functions. The first two have no accepted notation and cannot be specified in the program, while the ones derived from 3-21G and 4-31G have been constructed, for consistence, using the diffuse and/or polarization shells of the 6-31G and 6-311G families. For these exceptions, the data has been taken from the EMSL repository at http://www.emsl.pnl.gov/forms/basisform.html222 Basis sets were obtained from the Extensible Computational Chemistry Environment Basis Set Database at http://www.emsl.pnl.gov/forms/basisform.html, Version 02/25/04, as developed and distributed by the Molecular Science Computing Facility, Environmental and Molecular Sciences Laboratory which is part of the Pacific Northwest Laboratory, P.O. Box 999, Richland, Washington 99352, USA, and funded by the U.S. Department of Energy. The Pacific Northwest Laboratory is a multi-program laboratory operated by Battelle Memorial Institute for the U.S. Department of Energy under contract DE-AC06-76RLO 1830. Contact Karen Schuchardt for further information. and the basis sets have been read using the Gen keyword. In all cases, spherical Gaussian-type orbitals (GTOs) have been preferred, thus having 5 d-type and 7 f-type functions per shell.
3.1 RhfRHF-intramethod model chemistries
The study in this work begins by performing an exhaustive comparison of all the homolevel MCs and most of the heterolevel ones that can be constructed using the 39 different basis sets described above and within the RHF method. The original aim was to identify, separately, the most efficient basis sets for doing geometry optimizations and those that perform best for single-point energy calculations, in order to extract the information needed to carry out, in successive stages, a (necessarily) more restrictive study of MP2MP2 and MP2RHF MCs. However, due to the considerations made in sec. 3.3, all mentions to the accuracy of any given MC in this section must be understood as relative to the RHFRHF reference, and not to the (surely better) MP2MP2 one or to the exact result. In this spirit, this part of the study should be regarded as an evaluation of the most efficient MCs for approximating the infinite basis set Hartree-Fock limit (for which the best RHFRHF homolevel here is probably a good approximation), and also as a way of introducing the relevant concepts and the systematic approach that shall be used in the rest of the computationally more useful sections.
Having this in mind, the efficiency of a particular MC is laxly defined as a balance between accuracy (in terms of the distance introduced in sec. 2.2) and computational cost (in terms of time). It is graphically extracted from the efficiency plots, where the distance between any given MC and a reference one is shown in units of in the -axis, while, in the -axis, one can find in logarithmic scale the average computational time taken for each MC, per point of the 1212 grid defined in the Ramachandran space of the model dipeptide HCO-L-Ala-NH (see the following pages for several examples). As a general thumb-rule, we shall consider a MC to be more efficient for approximating the reference when it is placed closer to the origin of coordinates in the efficiency plot. This approach is intentionally non-rigorous due to the fact that many factors exist that influence the computer time but may vary from one practical calculation to another; such as the algorithms used, the actual details of the computers (frequency of the processor, size of the RAM and cache memories, system bus and disk access velocity, operating system, mathematical libraries, etc.), the starting guesses for the SCF orbitals or the starting structures in geometry optimizations. Taking this into account, the only conclusions that shall be drawn in this work about the relative efficiency of the MCs studied are those deduced from strong signals in the plots and, therefore, those that can be extrapolated to future calculations; in other words, the small details shall be typically neglected.
The efficiency plots that we will discuss in this section are the ones used to compare RHFRHF-intramethod homo- and heterolevel MCs with the RHF reference, defined as the homolevel MC with the largest basis set, i.e., RHF/ 6-311++G(2df,2pd) (since, in this section, there is no possible ambiguity, the levels shall be denoted in what follows omitting the ‘RHF’ keyword and specifying only the basis set). The plots corresponding to this first intramethod part comprise figures from 3 to 6.
In fig. 3, the homolevel MCs corresponding to all the basis sets in table 2 are compared to the reference one. In fig. 3a, a general picture is presented, whereas, in fig. 3b, a detailed zoom of the most efficient region of the plot is shown. It takes an average of hours per grid point to calculate the PES of the model dipeptide HCO-L-Ala-NH at the reference homolevel 6-311++G(2df,2pd) (the time per point for homolevels is calculated assuming that all geometry optimizations take 20 iterations to converge, this is done in order to avoid the ambiguity due to the choice of the starting structures and it allows to place all MCs on a more equivalent footing); this time is denoted by and the most efficient region is defined as that in which and of . Additionally, we indicate in the plots the peptide region (), containing the MCs that may correctly approximate the reference one for chains of 1–100 residues, and the protein region (), including the MCs that are accurate for polypeptides over 100 residues (see sec. 2).
From these two plots, several conclusions may be drawn:
Regarding the check of rules (ii) and (iii) via the basis sets in the second group in table 2, we see that 6-31+G(d,p) is more efficient than 6-31++G(d,p) (it is cheaper and, despite being smaller, more accurate!333 Note that the Hartree-Fock method has a variational origin, in such a way that, if we were interested in the absolute value of the energy, and not in the energy differences, an enlargement of the basis set would always improve the results.), that 6-31++G(d) is as efficient as the most efficient basis sets of the rules-complying group (being outperformed only by some of the ones in the third group in table 2), that 6-31+G(d,p) has drifted a little towards the inefficiency region and that that 6-31++G(,p) is well deep in it. This suggests that it may be profitable to break rule (ii) but only in the direction of removing shells from the hydrogens, and not from the 1st-row atoms; in agreement with the common practice in the literature [93, 6, 32, 31, 16, 29, 27, 97, 22, 20] based on the intuition that ‘hydrogens are typically more passive atoms sitting at the end of bonds’ . On the other hand, 6-31G(f,d) turns out to be very inefficient, being about as accurate as the simple 6-31G basis set but far more expensive. This confirms that it is a good idea to follow restriction (iii), which is consistent with the already mentioned thumb-rule of basis set ‘balance’ .
The whole 3-21G family of basis sets is very inefficient. Only 3-21++G(d,p) lies in the accurate region and, anyway, it is less efficient than most of the other basis sets there.
Contrarily, the 4-31G family results are quite parallel to and only slightly worse than those of the 6-31G family, suggesting that, to account for conformational energy differences within the RHF method, the contraction of valence orbitals is more important than the contraction of core ones if homolevel MCs are used.
In fig. 3a, we can notice the existence of a gap in the values of the distance , which lies around and separates the MCs in two groups. Notably, all the basis sets in the most accurate group share a common characteristic: they contain 1st-row atoms polarization functions, whereas those in the inaccurate group do not, with the only exceptions of 3-21G(d,p) and 6-31G(f,d), whose bad quality has been explained in the previous points for other reasons.
All the basis sets with extra polarizations, (2d,2p) or (2df,2pd), and no diffuse functions are less efficient than their diffuse functions-containing counterparts.
Out of some of the specially inefficient cases discussed in the preceding points, the addition of diffuse functions to singly-polarized ((d) or (d,p)) basis sets always increases the accuracy.
The only basis set whose homolevel MC lies in the protein region is the expensive 6-31++G(2df,2pd).
If we look at the most efficient basis sets (those that lie at the lower-left envelope of the ‘cloud’ of points), we can see that no accumulation point is reached, i.e., that, although the distance between 6-311++G(2df,2pd) and 6-31++G(2df,2pd) is small enough to suggest that we are close to the Hartree-Fock limit for this particular problem, if the basis set is intelligently enlarged, we obtain increasingly better MCs.
For less than 10% the cost of the reference calculation, some particularly efficient basis sets for RHF-homolevel MCs that can be used without altering the relevant conformational behaviour of short peptides (i.e., whose distance with 6-311++G(2df,2pd) is less than ) are 6-31+G(d,p), 6-31+G(d), 4-31+G(d) and 6-31G(d).
Next, in fig. 4, the reference homolevel 6-311++G(2df,2pd) is compared to the RHFRHF-intramethod-heterolevel MCs obtained computing the geometries with the 38 remaining basis sets in table 2 and then performing a single-point energy calculation at the best level of the theory, 6-311++G(2df,2pd), on top of each one of the structures. The aim of this comparison is twofold: on the one hand, we want to measure the relative efficiency of the different basis sets for calculating the geometry (not the energy), on the other hand, we want to find out whether or not the heterolevel approximation described in the introduction is an efficient one within RHF.
Like in the previous case, in fig. 4a, a general picture is presented, whereas, in fig. 4b, a detailed zoom of the most efficient region of the plot is shown. The average time per point of the heterolevel MCs has been calculated adding the average cost of performing a single-point at 6-311++G(2df,2pd) ( hours) to the average time per point needed to calculate the geometry at each one of the levels (see page 3.1). This hours ‘offset’ in all the times, has rendered advisable to set the limit used to define the efficient region in this case to the 20% of (instead of the former 10%), so that most of the relevant basis sets are included in the second plot in fig. 4b.
In this second part of the present section, several interesting conclusions may be extracted from the plots:
Related to the test of rules (ii) and (iii), similar remarks to the ones before can be made, the only difference being that, in this case, for computing the geometry, 6-31+G(d,p) is not as bad as for the homolevel calculation. The signal, however, is rather weak and the main conclusions stated in the first point above should not be modified.
Regarding the 3-21G family of basis sets, we see here that, differently from what happened for the homolevels, they are not so bad to account for the geometry, and, in the case of 3-21G, 3-21++G and 3-21++G(d,p), their efficiency is close to that of the corresponding 4-31G and 6-31G counterparts.
Moreover, the 4-31G basis sets performance is still quite close to that of the 6-31G family. This point, together with the previous one, and differently from what happened in the case of homolevel MCs, suggests that, to account for the equilibrium geometry within RHF, the contraction scheme is only mildly important both for valence and core orbitals.
In fig. 4a, we see again, a gap in the values of the distance separating the MCs with the geometry calculated using basis sets that contain 1st-row atoms polarization functions from those that do not. The only differences are that, this time, the gap is even more evident, it lies around , and 3-21G(d,p) is placed below it.
The signal noticed in the homolevel case regarding the relative inefficiency of the the basis sets with extra polarizations, (2d,2p) or (2df,2pd), and no diffuse functions has become stronger here and a second gap can be seen separating them from their diffuse functions-containing counterparts and also from the basis sets with only one polarization shell. This gap separates, for example, the MCs whose geometries have been calculated with 6-31G(2df,2pd) and 6-31G(d,p), in such a way that the smaller one is not only more efficient, but also more accurate. This clearly illustrates one of the points raised in this study, namely, that MCs parameterized and tested in concrete systems may behave in an unexpected way when used in a different problem, and that the investigation of the quality of the most popular MCs, as well as the design of new ones, for the study of the conformational preferences of peptides, is a necessary (albeit enormous) task.
Also for geometry optimizations, the addition of diffuse functions to singly-polarized ((d) or (d,p)) basis sets increases the accuracy.
Contrarily to the situation for homolevels, where the only basis set that lied in the protein region was the 6-31++G(2df,2pd) one and some MCs presented distances of near with the reference one, here, all MCs lie well below , and those for which the geometry has been computed with a basis set that contains 1st-row atoms polarization functions (except for 6-31G(f,d)) are all in the protein region, so that, under the assumptions in sec. 2.2, they can correctly approximate the reference MC for chains of more than 100 residues. Remarkably, some of this heterolevel MCs, such as 6-311++G(2df,2pd)6-31+G(d) for example, are physically equivalent to the reference homolevel up to peptides of ten thousand residues at less of 10% the computational cost. Indeed, all these results confirm the heterolevel assumption, discussed in the introduction and so commonly used in the literature [14, 90, 59, 63, 32, 31, 29, 64, 46], for RHFRHF-intramethod MCs.
Differently from the homolevel case, an accumulation point is reached here in the basis sets, since, in fig. 4b, we can see that there is no noticeable increase in the accuracy, say, beyond 6-311+G(d).
Finally, let us mention 6-311+G(d), 6-31+G(d) and 6-31G(d) as some examples of particularly efficient basis sets for calculating the geometry in RHF-heterolevel MCs. Under the approximation in sec. 2.2, they can be used without altering the relevant conformational behaviour of polypeptides of more than a hundred residues (i.e., their distance with the homolevel 6-311++G(2df,2pd) is less than ), and their computational cost is less than 20% that of the reference calculation.
Now, after the geometry, we shall investigate the efficiency for performing energy calculations within RHF of the all the basis sets in table 2 but the largest one. To render the study meaningful, the geometry on top of which the single-points are computed must be the same, and we have chosen it to be the one calculated at the level 6-311++G(2df,2pd). Of course, since the reference to which the heterolevel MCs must be compared is the homolevel, and they take more computational time than this MC (the time plus the one required to perform the single-point at ), all of them are computationally inefficient a priori. Therefore, in the efficiency plots in fig. 5, the time shown in the -axis is not the one needed to calculate the actual PES with the MC, but just the one required for the single-point computation. In principle therefore, the study and the conclusions drawn should be regarded only as providing hints about how efficient a given basis set will be if it is used to calculate the energy on top of some less demanding geometry than the one (in order to have a MC that could have some possibility of being efficient). However, in the fourth part of the RHFRHF-intramethod investigation (see below), we show that the performance of the different basis sets for single-point calculations depends weakly on the underlying geometry, so that the range of validity of the present part of study must be thought to be wider.
In fig. 5a, a general picture of the comparison is presented, whereas, in fig. 5b, a detailed zoom of the most efficient region of the plot is shown. As we have already mentioned, the time shown is the average one per point required to perform the single-point energy calculation on the best geometry, and, consequently, the time used for defining the efficient region has been redefined as the one needed for a single-point at (i.e., hours).
We extract the following conclusions from the plots:
Regarding the check of rules (ii) and (iii), the situation is the same as in the two former cases, with the only difference that we can see that, for single-point calculations, 6-31G(f,d) is much more inefficient than for geometry optimizations, being of an accuracy close to that of the smallest basis set studied, the 3-21G, and taking considerably more time.
The 3-21G family of basis sets is very inefficient for energy calculations.
On the other hand, like it happened in the homolevels case, the 4-31G basis sets performance is quite close to that of the 6-31G family. This suggests that, for energy calculations in RHFRHF MCs, to use a considerable number of primitive Gaussian shells to form the contracted ones is more important in the valence orbitals than in the core ones.
The 1st-row atoms polarization gap in the distance also occurs for single-point calculations (see fig. 5a). This time, 3-21G(d,p) is placed above it.
The relative inefficiency of the the basis sets with extra polarizations, (2d,2p) or (2df,2pd), and no diffuse functions is also observed here for energy calculations. It is mild, like in the homolevels case, and no gap appears.
Like in the two studies above, the addition of diffuse functions to singly-polarized ((d) or (d,p)) basis sets increases the accuracy for single-point energy calculations as well.
About the accuracy of the investigated MCs, the situation observed in the homolevel case is even more severe here, since not even the 6-31++G(2df,2pd) single-point MC lies in the protein region and the worst basis sets (see 3-21G, for example) present distances over . This enriches and supports the ideas that underlie the heterolevel assumption, showing that, whereas the level of the theory may be lowered in the calculation of the (constrained) equilibrium geometries, it is necessary to perform high-level energy single-points if a good accuracy is sought.
Related to the basis set convergence issue, the situation here is analogous to the one seen in the case of homolevel MCs: No accumulation point is reached, and the accuracy can always be increased by intelligently enlarging the basis set.
Finally, let us mention 6-311+G(d,p), 6-31+G(d) and 6-31G(d) as some examples of particularly efficient basis sets also for calculating the energy in RHF-heterolevel MCs. Under the approximation in sec. 2.2, they can be used without altering the relevant conformational behaviour of short peptides, and their computational cost is less than 10% that of the reference single-point calculation.
Next, in order to close the RHFRHF-intramethod section, we evaluate a group of heterolevel MCs which are constructed by simultaneously decreasing the level of the theory used for the geometry and the one used for the energy single-point, relatively to the reference 6-311++G(2df,2pd).
Using the basis sets in table 2, there exist different MCs of the form , with and excluding 6-311++G(2df,2pd). This number is too large to perform an exhaustive study and, therefore, any investigation of the MCs in this particular group must be necessarily exploratory. Here, we are specially interested in the most efficient MCs, so that, using the lessons learned in the preceding paragraphs, we have considered only heterolevels with being 6-31G(d), 6-31+G(d) or 6-311+G(d), which we have proved to perform well at least when the single-point is calculated at 6-311++G(2df,2pd). For the choice of , different criteria have been followed. On the one hand, since the energy at level is readily available as an output of the geometry optimization step, it is clear that to perform a single-point calculation with a level of similar accuracy to will not pay. On the other hand, some hints may be extracted from the study in fig. 5 about which could be the most efficient basis sets for calculating the energy. Taking these two points into consideration, and also including, for checking purposes, some levels that are expected to be inefficient, the basis sets that are investigated for performing single-points within RHF are those shown in fig. 6a, where is again the time taken by the reference homolevel 6-311++G(2df,2pd).
|Efficient RHFRHF MCs|
There are no essentially new conclusions to extract from this part of the study, since it mainly confirms those drawn from the previous parts and shows that they can be combined rather independently. For example, the approximate verticality of the dotted lines joining the MCs with equal indicates, as we have already mentioned, that, in the RHFRHF case, the accuracy of a given MC depends much more strongly on the level used for calculating the energy than on the one used for the geometry. Also, the fact that the MCs with 6-31G(d) lie in the lower-left envelope of the plot shows that 6-31G(d) keeps its character of efficient basis set for computing the geometry even if the single-point is calculated with levels that are different from the reference one. Finally, note that, in this particular problem, within RHF, and under the assumptions in sec. 2.2, if one wants to correctly approximate the reference MC beyond 100-residue peptides, the energy must be calculated at 6-31++G(2df,2pd).
In fig. 6b, all the 110 MCs studied up to now are depicted as a summary (the 38 inefficient ones are not shown). Now, if we look at the lower-left envelope of the plot, we can see that, depending on the target accuracy sought, the most efficient MCs may belong to different groups among the ones investigated above. From to , for example, the most efficient MCs are the ones; from to , on the other hand, the MCs of the form , where the single-point level has also been lowered with respect to the reference one, clearly outperform those in the rest of groups; finally, for distances , it is recommendable to use homolevel MCs. In table 3, these efficient MCs are shown together with their distance to the reference homolevel 6-311++G(2df,2pd), the number of residues up to which they can be used as a good approximation of it, and the required computer time , expressed as a fraction of .
3.2 Mp2MP2-intramethod model chemistries
Now, using all the information gathered in the previous RHFRHF-intramethod section (see however sec. 3.3 and the first paragraph of sec. 3.1), we open the second part of the study, in which we shall perform an MP2MP2-intramethod investigation with some selected basis sets among those in table 2. The choice of MP2  as the method immediately ‘above’ RHF is justified by several reasons. In the first place, it is typically regarded as accurate and as the reasonable starting point to include correlation in the literature [7, 99, 64, 100, 101, 102], where it is also commonly used as a reference calculation to evaluate or parameterize less demanding methods [103, 63, 104, 105, 9]. Secondly, and contrarily to DFT, MP2 is a wavefunction-based method that allows to more or less systematically improve the calculations by going to higher orders of the Møller-Plesset perturbation expansion. The majority of the rest of methods devised to add correlation to the RHF wavefunction-based results, such as coupled cluster, configuration interaction, or MCSCF, are more computationally demanding than MP2 [106, 107, 56]. Finally, although, for some particular problems, DFT may rival MP2 [108, 109, 110, 102], the latter is known to account better for weak dispersion forces, which are present and may be important in peptides [14, 111, 64].
The basis sets investigated in this MP2 part are the 11 ones in table 4 and they have been originally chosen in order to adequately sample the larger set studied at RHF and check if the same effects are observed at MP2. Some kind of selection must be done due to the higher computational cost of MP2 calculations, so that, with the hope that the RHF results were relatively transferable to MP2, the basis sets that have proved to be relatively more efficient at RHF were included in table 4, together with the largest one, the smallest one and a small number of other basis sets (such as 6-31G(d,p) or 6-31G(2d,2p), for example) intended to analyze the tendencies observed in the previous section. In the following discussion and in sec. 3.3, however, the RHF MP2 transferability of the results is shown to be imperfect, so that, despite the valuable lessons learned in this work, in further studies, one of the research directions that will have to be followed is the addition of more basis sets to table 4.
We would also like to stress that the MP2-reference PES of HCO-L-Ala-NH, using 6-311++G(2df,2pd), that has been calculated to carry out the investigation presented here (see fig. 7) is, as far as we are aware, the one computed at the highest level of the theory at present. Although coupled cluster [14, 31] and MP4  methods have been used to perform single-points on top of the geometries optimized at lower levels for some selected conformers444 In this brief review of the literature, we include, apart from the calculations in HCO-L-Ala-NH, also those in alanine dipeptides with different protecting groups, since both efficiency and accuracy considerations are expected to be very similar., the highest homolevels used to calculate full PESs in the literature after the one used in this study seem to be MP2/6-311G(d,p) in ref. 28 and B3LYP/6-311++G(d,p) in ref. 31 (assuming that the accuracy of the B3LYP method lies somewhere between RHF and MP2). Regarding heterolevel MCs, in ref. 43, a PES is calculated at the LMP2/cc-pVQZ(-g)MP2/6-31G(d) level, and although this is certainly a remarkably accurate computation, the question whether it is better than the homolevel MP2/6-311++G(2df,2pd) will have to wait until an assessment of the LMP2 method and its relation to the heterolevel hypothesis is performed. Finally, something analogous may be said about the MP2/cc-pVTZMP2/6-31G(d) PES in ref. 32.
Now, the structure of the MP2MP2-intramethod study is the same as in the RHFRHF case: We begin by evaluating the MP2 homolevels, and, just as we did before, the ‘MP2’ keyword is omitted from the MCs specification, since, in this section, no possible ambiguity may appear.
In fig. 8, the homolevel MCs corresponding to all the basis sets in table 4 are compared to the reference one. It takes an average of hours 8 days of computer time per grid point (see page 3.1) to calculate the PES of the model dipeptide HCO-L-Ala-NH at the reference homolevel 6-311++G(2df,2pd); this time is denoted by .
Regarding the conclusions that can be extracted from this plot, let us focus (remarking the differences) on the issues parallel to the ones studied in the RHF case, although, since the number of basis sets in table 4 is smaller than that in table 2, some details will have to be left out:
The 3-21G basis set is again the worst one for homolevel calculations, with a distance close to 3 .
The 1st-row atoms polarization gap that we saw in fig. 3a, is absent here, and, for example, the 6-31++G basis set is more accurate than the larger and polarized 6-311+G(d).
The only basis set with extra polarizations and no diffuse functions that we have studied in the MP2 case, 6-31G(2d,2p), is less efficient than its diffuse functions-containing counterpart, 6-31++G(2d,2p).
Whereas, in the RHF case, the addition of diffuse functions to singly-polarized ((d) or (d,p)) basis sets always increased the accuracy, here, it is sometimes slightly advantageous (in the 6-31G(d,p) 6-31++G(d,p) case) and sometimes slightly disadvantageous (in the 6-31G(d) 6-31+G(d) case). So that no clear conclusion may be drawn to this respect.
There is no basis set whose homolevel MC lies in the protein region, although we remark that the second largest basis set studied with RHF, 6-31++G(2df,2pd), which lied in the protein region then, has not been included in this MP2 part of the work.
If we look at the most efficient basis sets (those that lie at the lower-left envelope of the ‘cloud’ of points), we can see that, like in RHF, no accumulation point is reached, i.e., that, although the distance between 6-311++G(2df,2pd) and 6-31++G(2d,2p) is small enough to suggest that we are close to the MP2 limit for this particular problem, if the basis set is intelligently enlarged, we obtain increasingly better MCs. Also note that, if we compare fig. 8 here to fig. 3 in the previous section, we do not observe a strong signal indicating the slower basis set convergence of the MP2 method that is sometimes reported in the literature [112, 113, 56]. Therefore, from these limited data, we must conclude that, for conformational energy differences in peptides, the homolevel MCs converge approximately at the same pace towards the infinite basis set limit for RHF and MP2.
For less than 10% the cost of the reference calculation, some particularly efficient basis sets for MP2-homolevel MCs that can be used without altering the relevant conformational behaviour of short peptides (i.e., whose distance with 6-311++G(2df,2pd) is less than ) are 6-31++G(d,p), 6-31G(d,p) and 6-31G(d).
Next, in fig. 9, the reference homolevel 6-311++G(2df,2pd) is compared to the MP2MP2-intramethod-heterolevel MCs obtained computing the geometries with the 10 remaining basis sets in table 4 and then performing a single-point energy calculation at the best level of the theory, 6-311++G(2df,2pd), on top of each one of the structures. Like in the RHF case, the aim of this comparison is twofold: on the one hand, we want to measure the relative efficiency of the different basis sets for calculating the geometry (not the energy), on the other hand, we want to find out whether or not the heterolevel assumption described in the introduction is a good approximation within MP2.
The average time per point of the heterolevel MCs has been calculated adding the average cost of performing a single-point at 6-311++G (2df,2pd) ( hours) to the average time per point needed to calculate the geometry at each one of the levels (see page 3.1).
The following remarks may be made about fig. 9:
Although the only representant of the 3-21G family of basis sets in this MP2MP2-intramethod study is one of the most inaccurate levels for calculating the geometry, the signal observed in the RHF case, indicating that the 3-21G basis sets are not so bad to account for the geometry, also occurs here, where we can see that 3-21G is more accurate (and hence more efficient) than the larger 6-31G and 6-31++G.
Contrarily to the homolevel case, here we can appreciate, like we did in RHF, a rather wide gap in the values of the distance separating the MCs with the geometry calculated using basis sets that contain 1st-row atoms polarization functions from those that do not.
The signal noticed in the homolevel case regarding the relative inefficiency of the the basis sets with extra polarizations and no diffuse functions has been inverted here, since the 6-31++G(2d,2p) is less accurate than the smaller 6-31G(2d,2p).
Again, and contrarily to the RHF case, the addition of diffuse functions to singly-polarized ((d) or (d,p)) basis sets it is sometimes slightly advantageous (in the 6-31G(d,p) 6-31++G(d,p) case) and sometimes slightly disadvantageous (in the 6-31G(d) 6-31+G(d) case). So that no clear conclusion may be drawn to this respect.
Like in the RHF case, and contrarily to the situation for MP2 homolevels, where no basis sets lied in the protein region and some MCs presented distances of near with the reference one, here, most MCs lie well below , and those for which the geometry has been computed with a basis set that contains 1st-row atoms polarization functions are all in the protein region, so that, under the assumptions in sec. 2.2, they can correctly approximate the reference MC for chains of more than 100 residues. Remarkably, some of these heterolevel MCs, such as 6-311++G(2df,2pd)6-31G(d) for example, are physically equivalent to the reference homolevel up to peptides of 400 residues at less of 10% the computational cost. Indeed, all these results confirm the heterolevel assumption, discussed in the introduction and so commonly used in the literature[14, 90, 59, 63, 32, 31, 29, 64, 46], for MP2MP2-intramethod MCs.
Differently from the homolevel case, an accumulation point is reached here in the basis sets, since we can see that there is no noticeable increase in accuracy beyond 6-31G(d). Regarding the convergence towards the infinite basis set limit, we observe again that, whereas it is slightly slower here than in fig. 4, the signal is too weak to conclude anything and we repeat what we said in the homolevel case: that, for conformational energy differences in peptides, the ability of accounting for the geometry in heterolevel MCs of the form converges approximately at the same pace towards the infinite basis set limit for RHF and MP2.
Finally, let us mention 6-31G(d) as the one clear example of a particularly efficient basis set for calculating the geometry in MP2-heterolevel MCs. Under the assumptions in sec. 2.2, it can be used without altering the relevant conformational behaviour of polypeptides of around 400 residues (i.e., its distance with the homolevel 6-311++G(2df,2pd) is ), and its computational cost is that of the reference calculation. The rest of the basis sets in fig. 9 are either less accurate and not significantly cheaper, or more expensive and not more accurate than 6-31G(d).
Next, after the geometry, we investigate the efficiency for performing energy calculations of all the basis sets in table 4 but the largest one. Like in the RHF case, the geometry on top of which the single-points are computed must be the same, and we have chosen it to be the one calculated at the level 6-311++G(2df,2pd). Again, since the reference to which the heterolevel MCs must be compared is the homolevel, and they take more computational time than this MC (the time plus the one required to perform the single-point at ), all of them are computationally inefficient a priori. Therefore, in the efficiency plot in fig. 10, the time shown in the -axis is not the one needed to calculate the actual PES with the MC, but just the one required for the single-point computation. In principle therefore, the study and the conclusions drawn should be regarded only as providing hints about how efficient a given basis set will be if it is used to calculate the energy on top of some less demanding geometry than the one (in order to have a MC that could have some possibility of being efficient). However, in the fourth part of the MP2MP2-intramethod investigation (see below), we show, like we did in the RHF case, that the performance of the different basis sets for single-point calculations depends weakly on the underlying geometry, so that the range of validity of the present part of study must be thought to be wider. Again, the time used for defining the efficient region in fig. 10 has been redefined as the one needed for a single-point at .
The conclusions of this part of the study are:
Like in RHF, 3-21G is very inefficient for energy calculations.
Although all basis sets containing 1st-row atoms polarization functions are more accurate than the ones that do not, differently from the geometry case, we do not observe a clear gap in the distance separating the two groups for MP2 single-point energy calculations.
Similarly to the MP2-homolevel case and to RHF, the respective positions in the plot of 6-31G(2d,2p) and 6-31++G(2d,2p) constitute a signal that indicates that the basis sets with extra polarizations and no diffuse functions are less efficient than their diffuse functions-containing counterparts for energy calculations.
As in the rest of the MP2 study, nothing conclusive can be said about the addition of diffuse functions to the singly polarized 6-31G(d) and 6-31G(d,p) basis sets.
Regarding the accuracy of the investigated MCs, the situation here is analogous to the one found for RHF, and, again this supports the ideas that underlie the heterolevel assumption, showing that, also at MP2, whereas the level of the theory may be lowered in the calculation of the (constrained) equilibrium geometries, it is necessary to perform high-level energy single-points if a good accuracy is sought.
Related to the basis set convergence issue, the situation here is analogous to the one seen in the case of homolevel MCs: No accumulation point is reached, and the accuracy can always be increased by intelligently enlarging the basis set. The convergence velocity towards the MP2 limit is again very similar to the one in RHF.
Finally, let us mention 6-31G(d,p) and 6-31G(d) as some examples of particularly efficient basis sets for calculating the energy in MP2-heterolevel MCs. They can be used without altering the relevant conformational behaviour of short peptides, and their computational cost is less than 10% that of the reference single-point calculation.
|Efficient MP2MP2 MCs|
To close the MP2MP2-intramethod section, we have calculated four PESs with MCs of the form 6-31G(d), since the geometry computed with 6-31G(d) has proved to be very accurate when a single-point at the highest level was performed on top of it. Due to the same computational arguments presented in the previous section, only those basis sets significantly larger than 6-31G(d) have been explored for calculating the energy. The results are presented in fig. 11 together with a summary of the rest of the MP2MP2 MCs studied in this section (except for the inefficient ones).
We have already advanced a conclusion that may be extracted from this last plot, namely, that if we compare the distance of the 6-31G(d) MCs in fig. 11 to the distance of the ones in fig. 10 for the same , we see that they are very close. Therefore, like in the RHF case, we conclude that the accuracy of a given MC depends much more strongly on the level used for calculating the energy than on the one used for the geometry.
Finally, in table 5, we present the most efficient MCs that lie at the lower-left envelope of the plot in fig. 11. Like in RHF, we can see that, depending on the target accuracy sought, these most efficient MCs may belong to different groups among the ones investigated above. From to , for example, the most efficient MCs are the ones; from to , on the other hand, the MCs of the form 6-31G(d) outperform those in the rest of groups; finally, for distances , it is recommendable to use homolevel MCs.
The general abstract framework behind the investigation presented in this study (and also behind most of the works found in the literature), may be described as follows:
The objects of study are the model chemistries defined by Pople  and discussed in the introduction. The space containing all possible MCs is a rather complex and multidimensional one and it is denoted by in fig. 12. The MCs under scrutiny are applied to a particular problem of interest, which may be thought to be formed by three ingredients: the physical system, the relevant observables and the target accuracy. The MCs are then selected according to their ability to yield numerical values of the relevant observables for the physical system studied within the target accuracy. The concrete numerical values that one wants to approach are those given by the exact model chemistry MC, which could be thought to be either the experimental data or the exact solution of the electronic Schrödinger equation. However, the computational effort needed to perform the calculations required by MC is literally infinite, so that, in practice, one is forced to work with a reference model chemistry MC, which, albeit different from MC, is thought to be close to it. Finally, the set of MCs that one wants to investigate are compared to MC and the nearness to it is seen as approximating the nearness to MC.
These comparisons are commonly performed using a numerical quantity that is a function of the relevant observables. In order for the intuitive ideas about relative proximity in the space to be captured and the above reasoning to be meaningful, this numerical quantity must have some of the properties of a mathematical distance. In particular, it is advisable that the triangle inequality is obeyed, so that, for any model chemistry MC, one has that
and, assuming that is small (and is a positive function), we obtain
which is the sought result in agreement with the ideas stated at the beginning of this section.
The distance introduced in ref. 92 and summarized in sec. 2.2, measured in this case on the conformational energy surfaces (the relevant observable) of the model dipeptide HCO-L-Ala-NH (the physical system), approximately fulfills the triangle inequality and thus captures the nearness concept in the space of model chemistries.
Now, as we have advanced and after having completed the intramethod parts of the study with both the RHF and MP2 methods, we shall use the ideas discussed above to tackle the natural question about the transferability of the RHF results to the more demanding and more accurate MP2-based MCs.
As a first step to answer this question, we point out that the distance between the reference RHF/6-311++G(2df,2pd) MC and the MP2 one depicted in fig. 7 is . This prevents us from using the former as an approximation of the latter even for dipeptides if we want that the conformational behaviour at room temperature be unaltered. It also indicates that, whereas basis set convergence has been reasonably achieved within the family of Pople’s Gaussian basis sets, both for homo- and heterolevel MCs and inside the two methods, the convergence in method has not been achieved in the RHF MP2 step, even with the largest basis set investigated 6-311++G(2df,2pd).
Complementarily to this, in fig. 13, we show the distance of all RHFRHF MCs studied in sec. 3.1 (except for the inefficient ones), with both the RHF reference (in the -axis) and the MP2 one (in the -axis). Some relevant remarks may be made about the situation encountered:
The distance of all RHF-intramethod MCs to the MP2 reference is larger than , therefore, none of the former may be used to approximate the latter, not even in dipeptides.
Although a general trend could be perceived and, for example, the RHF homolevels can be clearly divided in both axes by the 1st-row atoms polarization gap found in the previous sections, the correlation between the distance to the MP2 reference and the distance to the RHF one is as low as , being Pearson’s correlation coefficient. Therefore, almost all details are lost and the accuracy with respect to RHF/6-311++G(2df,2pd) cannot be translated into accuracy with respect to the MP2 reference.
Related to the previous point, some strange behaviours are present. For example, not only are there RHFRHF MCs that are closer to the MP2 reference than RHF/6-311++G(2df,2pd), but the one that is closest is the small RHF/4-31G(d,p) homolevel. This is probably caused by fortituous cancellations that shall not allow systematization and that may unpredictably vary from one problem to another. Similar compensations have already been observed in the literature [31, 17, 25].
If we denote by the MP2 reference MC and, by , the RHF one, we may use eqs. (4),
to notice that, since , for any model chemistry MC that is close to the RHF-intramethod reference, i.e., that present a small , we have that
This set of RHF-intramethod MCs that are close to the RHF reference, and that present the approximately constant value of above, can be associated to the lower group encircled in fig. 13.
All the points above illustrate what we have already advanced at the beginning of sec. 3.2: that the accuracy (or the efficiency, if computational time is included in the discussion) of any MC with respect to a good RHF reference, such as the RHF/6-311++G(2df,2pd) one, cannot be transferred to higher levels of the theory and, therefore, any such comparison must be seen as providing information only about the infinite basis set Hartree-Fock limit.
To close this section, let us approach the question of the RHF MP2 transferability of the results from a different angle.
We have proved in the preceding paragraphs that the study of RHF-intramethod MCs comparing them to a good RHF reference cannot be used for predicting the accuracy of those MCs with respect to a probably better MP2 reference. Now, in sec. 3.2, MP2-intramethod MCs have been compared to the MP2/6-311++G(2df,2pd) homolevel, which, in turn, has been shown to be close to the infinite basis set MP2 limit. However, this level of the theory is very demanding computationally: the whole 1212 grid of points in the PES of HCO-L-Ala-NH has taken years of computer time in 3.20 GHz PIV machines, while the one calculated at RHF/6-311++G(2df,2pd) has taken ‘only’ months (see sec. 2.1).
Therefore, we have decided to check whether or not the accuracy of a given RHF-intramethod MC with respect to the RHF reference is indicative of the accuracy of the MP2-intramethod MC that uses the same basis sets with respect to its own MP2 reference. The answer to this question is in fig. 14. There, each point corresponds to a given combination of basis sets and, in the -axis, the distance between the associated MP2 MC and the MP2/6-311++G(2df,2pd) reference is shown. In the -axis, on the other hand, we present the distance of the analogous RHF MC to the RHF/6-311++G(2df,2pd) homolevel.
Although, since we have had to restrict ourselves to that combinations that were present both in sec. 3.1 and in sec. 3.2, the set of MCs is smaller in this case, the conclusion extracted is that the correlation is more significant than before: if we use all the MCs, and if we remove the 3-21G homolevel, which is very inaccurate in both cases, from the set. This indicates that, although some details might be lost, the relative efficiency of Gaussian basis sets in RHF-intramethod studies provides hints about their performance at MP2, and it partially justifies the structure of the investigation presented here.
Finally, the overall situation described in this section and the relations among all the intramethod MCs studied are schematically depicted in fig. 12.
3.4 Mp2RHF-intermethod model chemistries
In the final part of the study presented here, we investigate the efficiency of heterolevel MCs in which the geometry is calculated at RHF and, then, a single-point energy calculation is performed on top of it at MP2. They shall be termed MP2RHF-intermethod MCs.
To this end, the RHF geometries that are used are those computed with the 8 basis sets in table 6. Like in sec. 3.2, they have been selected from those in table 2 looking for the most efficient ones, but also trying to reasonably sample the whole group of basis sets, in order to check whether or not the behaviours and signals observed in the remaining parts of the study are repeated here. The MP2 single-points, on the other hand, are computed with the whole set of possibilities in table 2.
In fig. 15, we present an efficiency plot, using the MP2/6-311++G(2df,2pd) homolevel as reference MC, and containing all the MP2MP2 MCs studied in sec. 3.2 together with the new 88 possible MP2RHF-intermethod combinations of the form .
Some conclusions can be drawn from this plot:
Due to the larger computational demands of the MP2 method, even the MCs whose geometry has been computed at the highest RHF level, the one with the 6-311++G(2df,2pd) basis set, are much cheaper than the MP2 reference. Their times are slightly larger than the 10% of , whereas all the rest of MP2RHF MCs take less than that bound.
For all the RHF geometries, the MCs whose MP2 single-point has been calculated with 3-21G, 6-31G, 6-31++G and most of the 6-311+G(d) ones lie above , so that they should not be used even on dipeptides. This is related to the 1st-row atoms polarization gap observed in previous sections, although the signal is not so strong here.
The rest of MP2RHF MCs not included in the two previous points lie at the efficient region, defined as that for which and 10% of . This confirms the heterolevel assumption also in the intermethod context.
However, no MP2RHF-intermethod MCs, not even the ones with the single-point calculated at the highest MP2/6-311++G(2df,2pd) level, lie in the protein region. Therefore, if we want to approximate the reference MP2 results for peptides longer than 100 residues, under the assumptions in sec. 2.2, the geometry calculation must be performed at MP2. In fact, this is the correct way of addressing the discussion between Csásár  and Schäfer [23, 114], in which the former defends the position that the geometry can be calculated at RHF (provided that a subsequent MP2 single-point is performed on top of it), while the latter disagrees and recommends to compute the geometry at MP2 too. The data they argue about is, of course, the same; the discrepancy simply arises from the fact that Csásár is thinking only in the small systems in which the calculations have been performed, while Schäfer wants to use the information obtained to gain understanding about the folding of long peptides. In this work, the distance introduced in ref. 92 and summarized in sec. 2.2 codifies whether or not two different MCs yield equivalent PESs for the HCO-L-Ala-NH dipeptide (i.e., if , they are physically indistinguishable at temperature ). On the other hand, if we assume a particular form in which the polypeptide potential is constructed from the single-residue PESs and use the additivity properties of , we can show that this error grows with the square root of the number of residues and use the quantity , defined in sec. 2.2 to estimate how the difference between the two MCs affects the behaviour of polypeptides. In a work in progress in our group , we are investigating how the distance scales with for different and more realistic hypotheses. It is in this sense, that the statements relying on should be regarded as an estimation.
Finally, let us point out that there is no accuracy region where the MP2-homolevel MCs are more efficient than the rest.
Now, in fig. 16, the efficient region of the previous plot is enlarged and, due to the large number of MP2RHF MCs studied, two subplots are produced for visual comfort: the one in fig. 16a, in which the MCs sharing the same RHF level for the geometry have been joined by dotted lines, and the one in fig. 16b, in which the MCs sharing the same MP2 level for the single-point calculations have been joined by broken lines.
Let us remark some interesting facts that can be seen in these two more detailed plots:
The leftmost group of five MP2RHF MCs that show the highest accuracy are those in which the geometry has been obtained with basis sets containing 1st-row atoms polarization functions and the single-point energy calculation has been performed at MP2/6-311++G(2df,2pd). In particular, the MP2/6-311++G(2df,2pd)RHF/6-31G(d) PES can correctly approximate the reference one up to peptides of residues, under the assumptions mentioned above, at around 1% its computational cost. This supports the heterolevel assumption for MP2RHF-intermethod MCs.
Efficient MP2MP2 and MP2RHF MCs MP2/6-311++G(2df,2pd)MP2/6-31G(d) 0.046 468.3 1.90% MP2/6-311++G(2df,2pd)RHF/6-31G(d) 0.194 26.7 1.48% MP2/6-31++G(2d,2p)RHF/6-31G(d) 0.367 7.4 0.37% MP2/6-31++G(2d,2p)RHF/3-21G 0.417 5.7 0.27% MP2/6-31+G(d)RHF/3-21G 0.645 2.4 0.06% MP2/6-31G(d)RHF/3-21G 0.831 1.4 0.05% MP2/6-31++GRHF/3-21G 1.033 0.9 0.05% MP2/6-31GRHF/3-21G 1.263 0.6 0.05% MP2/3-21GRHF/3-21G 3.043 0.1 0.05% Table 7: List of the most efficient MP2MP2 and MP2RHF MCs located at the lower-left envelope of the cloud of points in fig. 15. The first block contains the only MP2MP2 MC in the list, the second one the MP2RHF MCs with a distance below , and the third one those that are inaccurate even for dipeptides. Distance with the reference MC (the homolevel MP2/6-311++G(2df,2pd)), in units of at K. Maximum number of residues in a polypeptide potential up to which the corresponding MC may correctly approximate the reference, under the assumptions in sec. 2.2. Required computer time, expressed as a fraction of .
The RHF geometries calculated with the unpolarized basis sets 3-21G and 6-31G are, in general, less accurate than the rest, however, due to their low computational cost, they turn out to be the most efficient ones from on. Remarkably, 3-21G is more efficient than 6-31G.
In fig. 16b, we can observe that, for the medium-sized basis sets 6-31++G (d,p), 6-31+G(d), 6-31G(d,p) and 6-31G(d), the single-point accuracy is rather insensitive to their differences and they may be used interchangeably. There is, however, a weak signal, in the region of unpolarized RHF geometries, indicating that the addition of diffuse functions may increase the quality of the energy calculations at MP2.
The relative accuracy of the MCs whose MP2 single-point has been computed at 6-31++G(2d,2p) and at 6-31G(2d,2p) suggests that, like in previous parts of the study, it is a good idea to add diffuse functions to basis sets that contain doubly-split polarizations shells, also for the MP2 energy calculations in MP2RHF-intermethod MCs.
In this study, we have investigated more than 250 PESs of the model dipeptide HCO-L-Ala-NH calculated with homo- and heterolevel RHFRHF, MP2MP2 and MP2RHF MCs. All of the PESs are available as supplementary material. As far as we are aware, the highest-level PESs in the literature, the MP2/6-311++G(2df,2pd) homolevel in fig. 7, has been used as a reference and all the rest of calculations have been compared to it (except for sec. 3.1, where the RHFRHF MCs have been compared to RHF/6-311++G(2df,2pd)). The data and the results extracted are so extense that we have considered convenient to give here a brief summary of the most important ones.
The first conclusion that we want to point out is that, for the largest basis set evaluated here, the 6-311++G(2df,2pd) one, for which the RHF and MP2 limits appear to have been reached, the convergence in method has not been achieved. I.e., the distance between the MP2 and RHF references is , so that the latter cannot be used to approximate the former even for dipeptides. Therefore, we discourage the use of RHFRHF MCs for peptides, and, unless otherwise stated, most of the conclusions below should be understood as referring either to MP2MP2-intramethod or to MP2RHF-intermethod MCs, which have proved to be acceptably accurate with respect to the best MP2 calculation.
The second observation related to the comparison between RHF and MP2 is that the RHF MP2 transferability of the relative accuracies between MCs is imperfect, and the conclusions regarding the relative efficiency of the different basis sets arrived using the former cannot be directly extrapolated to the latter. This point has two distinct aspects: On the one hand, we have shown that to compare RHFRHF MCs to a good RHF reference gives little information about their accuracy with respect to a good MP2 MC. On the other hand, the comparison with the RHF reference of PESs calculated with MCs of the form RHF/RHF/ may provide useful information about the relative accuracy of the analogous MP2/MP2/ MCs with respect to their own MP2 reference.
Now, keeping these considerations in mind, let us summarize the most important conclusions pertaining the relative efficiency of the Pople split-valence basis sets investigated:
In the whole study, the polarization shells in 1st-row atoms have been shown to be essential to accurately account for both the conformational dependence of the geometry and of the energy of the system. Except for some particular MCs with 3-21G geometries, which may be used if we plan to describe short oligopeptides, our recommendation is that polarization functions in 1st-row atoms be included.
In most cases, we have also observed a strong signal indicating that no basis sets should be used containing doubly-split polarization shells and no diffuse functions.
Regarding the basis set convergence issue, we can conclude that, for the largest basis sets in the Pople split-valence family, both the RHF and MP2 infinite basis set limits are approximately reached.
Finally, some weaker signals have been observed suggesting that to add higher angular momentum polarization shells (f,d) before adding the lower ones may be inefficient, that it is not recommendable to put polarization or diffuse functions on hydrogens only, and that it may be efficient in some cases to add diffuse functions to singly-polarized basis sets.
Referring to the heterolevel assumption, which, as far as we are aware, has been tested in this work for the first time in full PESs:
As a general and very clear conclusion, since only some small-basis set homolevels lie in the lower-left envelope of the efficiency plots presented in the previous sections, and, in all cases, it happens for distances greater than , we can say that the heterolevel assumption is correct for the description of the conformational behaviour of the system studied here with MP2MP2 and MP2RHF MCs (also for RHFRHF-heterolevels but, as we remarked above, this has little computational interest).
Due to the much stronger dependence of the accuracy of MCs on the level used for the single-point than on the one used for the geometry optimization, together with the lower computational cost of the former, the general recommendation is that the greatest computational effort be dedicated to the energy calculation.
Despite this general thumb rule, and under the assumptions in sec. 2.2, if one wants to approximate the MP2 reference calculation for peptides of more than 100 residues, the geometry must be calculated using MP2. Nevertheless, with small and cheap basis sets, such as 6-31G(d), the MP2MP2 results can be good enough at a low computational cost.
Finally, let us remark that the investigation performed here has been done in one of the simplest dipeptides. The fact that we have treated it as an isolated system, the small size of its side chain and also its aliphatic character, all play a role in the results obtained. Hence, for bulkier residues included in polypeptides, and, specially for those that are charged or may participate in hydrogen-bonds, the conclusions drawn about the relative importance of the different type of functions in the basis set, as well as those regarding the comparison between RHF and MP2, should be approached with caution and much interesting work remains to be done.
All the PESs investigated are publicly available as supplementary material at http://www.pabloechenique.com/files/public/supp_materials/. Each one of them is a three-column text file containing, in this order, the values of the Ramachandrand angles and in the 1212 grid defined in sec. 2.1 and the energy in hartrees. They are organized in subfolders indicating whether they correspond to RHF- or MP2-homolevels, or to RHFRHF-, MP2MP2- or MP2RHF-heterolevels. The filenames are explicative (note that a letter ’o’ has been used to indicate that a particular Gaussian shell is missing in either the 1st-row atoms or the hydrogens).
We would like to thank F. Jensen, T. van Mourik and A. Perczel for illuminating discussions. The numerical calculations have been performed at the BIFI supercomputing facilities. We thank all the staff there, for the invaluable CPU time and the efficiency at solving the problems encountered.
This work has been supported by the research projects E24/3 and PM048 (Aragón Government), MEC (Spain) FIS2006-12781-C02-01 and MCyT (Spain) FIS2004-05073-C04-01. P. Echenique and is supported by a BIFI research contract.
-  C. B. Anfinsen, Principles that govern the folding of protein chains, Science 181 (1973) 223–230.
-  J. Skolnick, Putting the pathway back into protein folding, Proc. Natl. Acad. Sci. USA 102 (2005) 2265–2266.
-  V. Daggett and A. R. Fersht, Is there a unifying mechanism for protein folding?, Trends Biochem. Sci. 28 (2003) 18–25.
-  B. Honig, Protein folding: From the Levinthal paradox to structure prediction, J. Mol. Biol. 293 (1999) 283–293.
-  H. Zhong and H. A. Carlson, Conformational studies of polyprolines, J. Chem. Theory Comput. 2 (2006) 342–353.
-  D. Toroz and T. van Mourik, The structure of the gas-phase tyrosine-glycine dipeptide, Mol. Phys. 104 (2006) 559–570.
-  R. A. DiStasio Jr., Y. Jung, and M. Head-Gordon, A Resolution-of-The-Identity implementation of the local Triatomics-In-Molecules model for second-order Møller-Plesset perturbation theory with application to alanine tetrapeptide conformational energies, J. Chem. Theory Comput. 1 (2005) 862–876.
-  A. Perczel, P. Hudáky, A. K. Füzéry, and I. G. Csizmadia, Stability issues of covalently and noncovalently bonded peptide subunits, J. Comp. Chem. 25 (2004) 1084–1100.
-  M. Beachy, D. Chasman, R. Murphy, T. Halgren, and R. Friesner, Accurate ab initio quantum chemical determination of the relative energetics of peptide conformations and assessment of empirical force fields, J. Am. Chem. Soc. 119 (1997) 5908–5920.
-  R. Hegger, A. Altis, P. Nguyen, and G. Stock, How complex is the dynamics of peptide folding?, Phys. Rev. Lett. 98 (2007) 028102.
-  A. Perczel, I. Jákli, and I. G. Csizmadia, Intrinsically stable secondary structure elements of proteins: A comprehensive study of folding units of proteins by computation and by analysis of data determined by X-ray crystallography, Chem. Eur. J. 9 (2003) 5332–5342.
-  M. Elstner, K. J. Jalkanen, M. Knapp-Mohammady, T. Frauenheim, and S. Suhai, DFT studies on helix formation in -acetyl-(L-alanyl)--methylamide for =1–20, Chem. Phys. 256 (2001) 15–27.
-  A. G. Császár and A. Perczel, Ab initio characterization of building units in peptides and proteins, Prog. Biophys. Mol. Biol. 71 (1999) 243–309.
-  J. Kaminský and F. Jensen, Force field modelling of amino acid conformational energies, In press, 2007.
-  A. Láng, I. G. Csizmadia, and A. Perczel, Peptide models. XLV: Conformational properties of N-formyl-L-methioninamide ant its relevance to methionine in proteins, PROTEINS: Struct. Funct. Bioinf. 58 (2005) 571–588.
-  J. C. P. Koo, G. A. Chass, A. Perczel, Ö. Farkas, L. L. Torday, A. Varro, J. G. Papp, and I. G. Csizmadia, Exploration of the four-dimensional-conformational potential energy hypersurface of N-acetyl-L-aspartic acid-N’-methylamide with its internally hydrogen bonded side-chain orientation, J. Phys. Chem. A 106 (2002) 6999–7009.
-  P. Hudáky, I. Jákli, A. G. Császár, and A. Perczel, Peptide models. XXXI. Conformational properties of hydrophobic residues shaping the core of proteins. An ab initio study of N-formyl-L-valinamide and N-formyl-L-phenylalaninamide, J. Comp. Chem. 22 (2001) 732–751.
-  P. J. Rossky and M. Karplus, Solvation. A molecular dynamics study of a dipeptide in water, J. Am. Chem. Soc. 101 (1979) 1913.
-  M. Mezei, P. K. Mehrotra, and D. L. Beveridge, Monte Carlo determination of the free energy and internal energy of hydration for the Ala dipeptide at 25C, J. Am. Chem. Soc. 107 (1985) 2239–2245.
-  T. Head-Gordon, M. Head-Gordon, M. J. Frisch, C. Brooks III, and J. Pople, A theoretical study of alanine dipeptide and analogs, Intl. J. Quant. Chem. 16 (1989) 311-322.
-  A. Perczel, J. G. Angyán, M. Kajtar, W. Viviani, J.-L. Rivail, J.-F. Marcoccia, and I. G. Csizmadia, Peptide models. 1. Topology of selected peptide conformational potential energy surfaces (glycine and alanine derivatives), J. Am. Chem. Soc. 113 (1991) 6256-6265.
-  T. Head-Gordon, M. Head-Gordon, M. J. Frisch, C. L. Brooks III, and J. A. Pople, Theoretical study of blocked glycine and alanine peptide analogues, J. Am. Chem. Soc. 113 (1991) 5989–5997.
-  R. F. Frey, J. Coffin, S. Q. Newton, M. Ramek, V. K. W. Cheng, F. A. Momany, and L. Schäfer, Importance of correlation-gradient geometry optimization for molecular conformational analyses, J. Am. Chem. Soc. 114 (1992) 5369–5377.
-  I. R. Gould, W. D. Cornell, and I. H. Hillier, A quantum mechanical investigation of the conformational energetics of the alanine and glycine dipeptides in the gas phase and in aqueous solution, J. Am. Chem. Soc. 116 (1994) 9250–9256.
-  G. Endrédi, A. Perczel, O. Farkas, M. A. McAllister, G. I. Csonka, J. Ladik, and I. G. Csizmadia, Peptide models XV. The effect of basis set size increase an electron correlation on selected minima of the ab initio 2D-Ramachandran map of For-Gly-NH and For-L-Ala-NH, J. Mol. Struct. (Theochem) 391 (1997) 15-26.
-  A. M. Rodríguez, H. A. Baldoni, F. Suvire, R. Nieto Vázquez, G. Zamarbide, R. D. Enriz, Ö. Farkas, A. Perczel, M. A. McAllister, L. L. Torday, J. G. Papp, and I. G. Csizmadia, Characteristics of Ramachandran maps of L-alanine diamides as computed by various molecular mechanics, semiempirical and ab initio MO methods. A search for primary standard of peptide conformational stability, J. Mol. Struct. (Theochem) 455 (1998) 275–301.
-  M. Elstner, K. J. Jalkanen, M. Knapp-Mohammady, and S. Suhai, Energetics and structure of glycine and alanine based model peptides: Approximate SCC-DFTB, AM1 and PM3 methods in comparison with DFT, HF and MP2 calculations, Chem. Phys. 263 (2001) 203–219.
-  C.-H. Yu, M. A. Norman, L. Schäfer, M. Ramek, A. Peeters, and C. van Alsenoy, Ab initio conformational analysis of N-formyl L-alanine amide including electron correlation, J. Mol. Struct. 567–568 (2001) 361–374.
-  M. Iwaoka, M. Okada, and S. Tomoda, Solvent effects on the potential surfaces of glycine and alanine dipeptides studied by PCM and I-PCM methods, J. Mol. Struct. (Theochem) 586 (2002) 111–124.
-  R. Vargas, J. Garza, B. P. Hay, and D. A. Dixon, Conformational study of the alanine dipeptide at the MP2 and DFT levels, J. Phys. Chem. A 106 (2002) 3213–3218.
-  A. Perczel, Ö. Farkas, I. Jákli, I. A. Topol, and I. G. Csizmadia, Peptide models. XXXIII. Extrapolation of low-level Hartree-Fock data of peptide conformation to large basis set SCF, MP2, DFT and CCSD(T) results. The Ramachandran surface of alanine dipeptide computed at various levels of theory, J. Comp. Chem. 24 (2003) 1026–1042.
-  Z.-X. Wang and Y. Duan, Solvation effects on alanine dipeptide: A MP2/cc-pVTZ//MP2/6-31G** study of () energy maps and conformers in the gas phase, ether and water, J. Comp. Chem. 25 (2004) 1699–1716.
-  P. Echenique, I. Calvo, and J. L. Alonso, Quantum mechanical calculation of the effects of stiff and rigid constraints in the conformational equilibrium of the Alanine dipeptide, J. Comp. Chem. 27 (2006) 1748–1755.
-  J. W. Ponder and D. A. Case, Force fields for protein simulations, Adv. Prot. Chem. 66 (2003) 27–85.
-  A. D. MacKerell Jr., B. Brooks, C. L. Brooks III, L. Nilsson, B. Roux, Y. Won, and M. Karplus, CHARMM: The energy function and its parameterization with an overview of the program, in The Encyclopedia of Computational Chemistry, edited by P. v. R. Schleyer et al., pp. 217–277, John Wiley & Sons, Chichester, 1998.
-  B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus, CHARMM: A program for macromolecular energy, minimization, and dynamics calculations, J. Comp. Chem. 4 (1983) 187–217.
-  W. F. Van Gunsteren and M. Karplus, Effects of constraints on the dynamics of macromolecules, Macromolecules 15 (1982) 1528–1544.
-  W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, J. Merz, K. M., D. M. Ferguson, D. C. Spellmeyer, T. Fox, J. W. Caldwell, and P. A. Kollman, A second generation force field for the simulation of proteins, nucleic acids, and organic molecules, J. Am. Chem. Soc. 117 (1995) 5179–5197.
-  D. A. Pearlman, D. A. Case, J. W. Caldwell, W. R. Ross, T. E. Cheatham III, S. DeBolt, D. Ferguson, G. Seibel, and P. Kollman, AMBER, a computer program for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to elucidate the structures and energies of molecules, Comp. Phys. Commun. 91 (1995) 1–41.
-  W. L. Jorgensen and J. Tirado-Rives, The OPLS potential functions for proteins. Energy minimization for crystals of cyclic peptides and Crambin, J. Am. Chem. Soc. 110 (1988) 1657–1666.
-  W. L. Jorgensen, D. S. Maxwell, and J. Tirado-Rives, Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids, J. Am. Chem. Soc. 118 (1996) 11225–11236.
-  T. A. Halgren, Merck Molecular Force Field. I. Basis, form, scope, parametrization, and performance of MMFF94, J. Comp. Chem. 17 (1996) 490–519.
-  A. R. MacKerell Jr., M. Feig, and C. L. Brooks III, Extending the treatment of backbone energetics in protein force fields: Limitations of gas-phase quantum mechanics in reproducing protein conformational distributions in molecular dynamics simulations, J. Comp. Chem. 25 (2004) 1400–1415.
-  A. R. MacKerell Jr., M. Feig, and C. L. Brooks III, Improved treatment of the protein backbone in empirical force fields, J. Am. Chem. Soc. 126 (2004) 698–699.
-  Y. K. Kang and H. S. Park, Comparative conformational study of of N-acetyl-L-N’-methylprolineamide with different basis sets, J. Mol. Struct. (Theochem) 593 (2002) 55–64.
-  G. A. Kaminski, R. A. Friesner, J. Tirado-Rives, and W. L. Jorgensen, Evaluation and reparametrization of the OPLS-AA force field for proteins via comparison with accurate quantum chemical calculations on peptides, J. Phys. Chem. B 105 (2001) 6476–6487.
-  T. Wang and R. Wade, Force field effects on a -sheet protein domain structure in thermal unfolding simulations, J. Chem. Theory Comput. 2 (2006) 140–148.
-  C. D. Snow, E. J. Sorin, Y. M. Rhee, and V. S. Pande, How well can simulation predict protein folding kinetics and thermodynamics?, Annu. Rev. Biophys. Biomol. Struct. 34 (2005) 43–69.
-  O. Schueler-Furman, C. Wang, P. Bradley, K. Misura, and D. Baker, Progress in modeling of protein structures and interactions, Science 310 (2005) 638–642.
-  K. Ginalski, N. V. Grishin, A. Godzik, and L. Rychlewski, Practical lessons from protein structure prediction, Nucleic Acids Research 33 (2005) 1874–1891.
-  A. V. Morozov, T. Kortemme, K. Tsemekhman, and D. Baker, Close agreement between the orientation dependence of hydrogen bonds observed in protein structures and quantum mechanical calculations, Proc. Natl. Acad. Sci. USA 101 (2004) 6946–6951.
-  C. Gómez-Moreno Calera and J. Sancho Sanz, editors, Estructura de Proteínas, Ariel ciencia, Barcelona, 2003.
-  M. Karplus and J. A. McCammon, Molecular dynamics simulations of biomolecules, Nat. Struct. Biol. 9 (2002) 646–652.
-  R. Bonneau and D. Baker, Ab initio protein structure prediction: Progress and prospects, Annu. Rev. Biophys. Biomol. Struct. 30 (2001) 173–189.
-  C. J. Cramer, Essentials of Computational Chemistry: Theories and Models, John Wiley & Sons, Chichester, 2nd edition, 2002.
-  F. Jensen, Introduction to Computational Chemistry, John Wiley & Sons, Chichester, 1998.
-  A. Szabo and N. S. Ostlund, Modern Quantum Chemistry: Introduced to Advanced Electronic Structure Theory, Dover Publications, New York, 1996.
-  P. Maurer, A. Laio, H. W. Hugosson, M. C. Colombo, and U. Rothlisberger, Automated parametrization of biomolecular force fields from Quantum Mechanics/Molecular Mechanics (QM/MM) simulations through force matching, J. Chem. Theory Comput. 3 (2007) 628–639.
-  Y. A. Arnautova, A. Jagielska, and H. A. Scheraga, New force field (ECEPP-05) for peptides, proteins and organic molecules, J. Phys. Chem. B 110 (2006) 5025–5044.
-  J. A. Pople, Nobel lecture: Quantum chemical models, Rev. Mod. Phys. 71 (1999) 1267–1274.
-  J. M. García de la Vega and B. Miguel, Basis sets for computational chemistry, in Introduction to Advanced Topics of Computational Chemistry, edited by L. A. Montero, L. A. Díaz, and R. Bader, chapter 3, pp. 41–80, Editorial de la Universidad de la Habana, 2003.
-  T. Helgaker and P. R. Taylor, Gaussian basis sets and molecular integrals, in Modern Electronic Structure Theory. Part II, edited by D. R. Yarkony, pp. 725–856, World Scientific, Singapore, 1995.
-  J. C. Sancho-García and A. Karpfen, The torsional potential in 2,2 revisited: High-level ab initio and DFT results, Chem. Phys. Lett. 411 (2005) 321–326.
-  P. Hobza and J. Šponer, Toward true DNA base-stacking energies: MP2, CCSD(T) and Complete Basis Set calculations, J. Am. Chem. Soc. 124 (2002) 11802–11808.
-  P. Jurečka, J. Šponer, J. Černý, and P. Hobza, Benchmark database of accurate (MP2 and CCSD(T) complete basis set limit) interaction energies of small model complexes, DNA base pairs, and amino acid pairs, Phys. Chem. Chem. Phys. 8 (2006) 1985–1993.
-  G. A. Petersson, D. K. Malick, M. J. Frisch, and M. Braunstein, The convergence of complete active space self-consistent-field energies to the complete basis set limit, J. Chem. Phys. 123 (2005) 074111.
-  F. Jensen, Estimating the Hartree-Fock limit from finite basis set calculations, Theo. Chem. Acc. 113 (2005) 267–273.
-  Z.-H. Li and M. W. Wong, Scaling of correlation basis set extension energies, Chem. Phys. Lett. 337 (2001) 209–216.
-  M. R. Nyden and G. A. Petersson, Complete basis set correlation energies. I. The assymptotic convergence of pair natural orbital expansions, J. Chem. Phys. 75 (1981) 1843–1862.
-  P. Jurečka and P. Hobza, On the convergence of the term for complexes with multiple H-bonds, Chem. Phys. Lett. 365 (2002) 89–94.
-  E. W. Ignacio and H. B. Schlegel, On the additivity of basis set effects in some simple fluorine containing systems, J. Comp. Chem. 12 (1991) 751–760.
-  J. S. Dewar and A. J. Holder, On the validity of polarization and correlation additivity in ab initio molecular orbital calculations, J. Comp. Chem. 3 (1989) 311–313.
-  R. H. Nobes, W. J. Bouma, and L. Radom, The additivity of polarization function and electron correlation effects in ab initio molecular-orbital calculations, Chem. Phys. Lett. 89 (1982) 497–500.
-  J. A. Pople, M. J. Frisch, B. T. Luke, and J. S. Binkley, A Moller-Plesset study of the energies of AH molecules (A = Li to F), Intl. J. Quant. Chem. 17 (1983) 307–320.
-  R. Crespo-Otero, L. A. Montero, W.-D. Stohrer, and J. M. García de la Vega, Basis set superposition error in MP2 and density-functional theory: A case of methane-nitric oxide association, J. Chem. Phys. 123 (2005) 134107.
-  M. L. Senent and S. Wilson, Intramolecular basis set superposition errors, Intl. J. Quant. Chem. 82 (2001) 282–292.
-  I. Mayer and P. Valiron, Second order Møller-Plesset perturbation theory without basis set superposition error, J. Chem. Phys. 109 (1998) 3360–3373.
-  F. Jensen, The magnitude of intramolecular basis set superposition error, Chem. Phys. Lett. 261 (1996) 633–636.
-  I. Mayer, On the non-additivity of the basis set superposition error and how to prevent its appearance, Theo. Chem. Acc. 72 (1987) 207–210.
-  S. F. Boys and F. Bernardi, The calculation of small molecular interactions by the differences of separate total energies. Some procedures with reduced errors, Mol. Phys. 19 (1970) 553–566.
-  H. B. Jansen and P. Ros, Non-empirical molecular orbital calculations on the protonation of carbon monoxide, Chem. Phys. Lett. 3 (1969) 140–143.
-  R. Ditchfield, W. J. Hehre, and J. A. Pople, Self-consistent molecular-orbital methods. IX. An extended Gaussian-type basis for molecular-orbital studies of organic molecules, J. Chem. Phys. 54 (1971) 724–728.
-  W. J. Hehre, R. Ditchfield, and J. A. Pople, Self-consistent molecular-orbital methods. XII. Further extensions of Gaussian-type basis sets for use in molecular-orbital studies of organic molecules, J. Chem. Phys. 56 (1972) 2257–2261.
-  P. C. Hariharan and J. A. Pople, The influence of polarization functions on molecular orbital hydrogenation energies, Theor. Chim. Acta 28 (1973) 213–222.
-  M. J. Frisch, J. A. Pople, and J. S. Binkley, Self-consistent molecular-orbital methods. 25. Supplementary functions for Gaussian basis sets, J. Chem. Phys. 80 (1984) 3265–3269.
-  R. Krishnan, J. S. Binkley, R. Seeger, and J. A. Pople, Self-consistent molecular-orbital methods. XX. A basis set for correlated wave functions, J. Chem. Phys. 72 (1980) 650–654.
-  J. S. Binkley, J. A. Pople, and W. J. Hehre, Self-consistent molecular-orbital methods. 21. Small split-valence basis sets for first-row elements, J. Am. Chem. Soc. 102 (1980) 939–947.
-  G. W. Spitznagel, T. Clark, J. Chandrasekhar, and P. v. R. Schleyer, Stabilization of methyl anions by first row substituents. The superiority of diffuse function-augmented basis sets for anion calculations, J. Comp. Chem. 3 (1982) 363–371.
-  T. Clark, J. Chandrasekhar, G. W. Spitznagel, and P. v. R. Schleyer, Efficient diffuse function-augmented basis sets for anion calculations. III. The 3-21+G basis set for first-row elements Li–F, J. Comp. Chem. 4 (1983) 294–301.
-  L. A. Curtiss, P. C. Redfern, and K. Raghavachari, Gaussian-4 theory, J. Chem. Phys. 126 (2007) 084108.
-  A. G. Császár, On the structures of free glycine and -alanine, J. Mol. Struct. 346 (1995) 141–152.
-  J. L. Alonso and P. Echenique, A physically meaningful method for the comparison of potential energy functions, J. Comp. Chem. 27 (2006) 238–252.
-  P. Echenique and J. L. Alonso, Definition of Systematic, Approximately Separable and Modular Internal Coordinates (SASMIC) for macromolecular simulation, J. Comp. Chem. 27 (2006) 1076–1087.
-  M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, J. A. Montgomery, Jr., T. Vreven, K. N. Kudin, J. C. Burant, J. M. Millam, S. S. Iyengar, J. Tomasi, V. Barone, B. Mennucci, M. Cossi, G. Scalmani, N. Rega, G. A. Petersson, H. Nakatsuji, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, M. Klene, X. Li, J. E. Knox, H. P. Hratchian, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, P. Y. Ayala, K. Morokuma, G. A. Voth, P. Salvador, J. J. Dannenberg, V. G. Zakrzewski, S. Dapprich, A. D. Daniels, M. C. Strain, O. Farkas, D. K. Malick, A. D. Rabuck, K. Raghavachari, J. B. Foresman, J. V. Ortiz, Q. Cui, A. G. Baboul, S. Clifford, J. Cioslowski, B. B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. L. Martin, D. J. Fox, T. Keith, M. A. Al-Laham, C. Y. Peng, A. Nanayakkara, M. Challacombe, P. M. W. Gill, B. Johnson, W. Chen, M. W. Wong, C. Gonzalez, and J. A. Pople, Gaussian 03, Revision C.02, Gaussian, Inc., Wallingford, CT, 2004.
-  P. Echenique, A note on the accuracy of free energy functions in protein folding: Propagation of errors from dipeptides to polypeptides, In progress, 2007.
-  T. H. Dunning Jr., Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen, J. Chem. Phys. 90 (1989) 1007–1023.
-  I. A. Topol, S. K. Burt, E. Deretey, T.-H. Tang, A. Perczel, A. Rashin, and I. G. Csizmadia, - and -helix interconversion: A quantum-chemical sttudy on polyalanine systems in the gas phase and in aqueous solvent, J. Am. Chem. Soc. 123 (2001) 6054–6060.
-  C. Møller and M. S. Plesset, Note on an approximation treatment for many-electron systems, Phys. Rev. 46 (1934) 618–622.
-  F. Jensen, An introduction to the state of the art in quantum chemistry, Ann. Rep. Comp. Chem. 1 (2005) 1–17.
-  P. Hobza and J. Šponer, Structure, energetics, and dynamics of the nucleic acid base pairs: Nonempirical ab initio calculations, Chem. Rev. 99 (1999) 3247–3276.
-  M. D. Halls and H. B. Schlegel, Comparison study of the prediction of Raman intensities using electronic structure methods, J. Chem. Phys. 111 (1999) 8819–8824.
-  A. St.-Amant, W. D. Cornell, P. A. Kollman, and T. A. Halgren, Calculation of molecular geometries relative conformational energies, dipole moments, and molecular electrostatic potential fitted charges of small organic molecules of biochemical interest by Density Functional Theory, J. Comp. Chem. 12 (1995) 1483–1506.
-  Y. Zhao and D. G. Truhlar, Infinite-basis calculations of binding energies for the hydrogen bonded and stacked tetramers of formic acid and formamide and their use for validation of hybrid DFT and ab initio methods, J. Phys. Chem. A 109 (2005) 6624–6627.
-  W. Wang, Method-dependent relative stability of hydrogen bonded and – stacked structures of the formic acid tetramer, Chem. Phys. Lett. 402 (2005) 54–56.
-  A. J. Bordner, C. N. Cavasotto, and R. A. Abagyan, Direct derivation of van der Waals force fields parameters from quantum mechanical interaction energies, J. Phys. Chem. B 107 (2003) 9601–9609.
-  P. Knowles, M. Schütz, and H.-J. Werner, Ab initio nethods for electron correlation in molecules, in Modern Methods and Algorithms of Quantum Chemistry, edited by J. Grotendorst, volume 3, pp. 97–179, Jülich, 2000, John von Neumann Institute for Computing.
-  W. Kutzelnigg and P. Von Herigonte, Electron correlation at the dawn the century, Adv. Quantum Chem. 36 (1999) 185–229.
-  Y. Zhao and D. G. Truhlar, Density functionals for noncovalent interaction energies of biological importance, J. Chem. Theory Comput. 3 (2007) 289–300.
-  C. Tuma, A. D. Boese, and N. C. Handy, Predicting the binding energies of H-bonded complexes: A comparative DFT study, Phys. Chem. Chem. Phys. 1 (1999) 3939–3947.
-  L. González, O. Mó, and M. Yáñez, High-level ab initio versus DFT calculations on (HO) and HO-HO complexes as prototypes of multiple hydrogen bonds, J. Comp. Chem. 18 (1998) 1124–1135.
-  T. Van Mourik, P. G. Karamertzanis, and S. L. Price, Molecular conformations and relative stabilities can be as demanding of the electronic structure method as intermolecular calculations, J. Phys. Chem. 110 (2006) 8–12.
-  R. A. Bachorz, W. Klopper, and M. Gutowski, Coupled-cluster and explicitly correlated perturbation-theory calculations of the uracil anion, J. Chem. Phys. 126 (2007) 085101.
-  T. Müller, Basis sets, accuracy, and calibration in quantum chemistry, in Computational Nanoscience: Do It Yourself!, edited by J. Grotendorst, S. Blügel, and D. Marx, volume 31, pp. 19–43, John von Neumann Institute for Computing, Jülich, 2006.
-  M. Ramek, F. A. Momany, D. M. Miller, and L. Schäfer, On the importance of full geometry optimization in correlation-level ab initio molecular conformational analyses, J. Mol. Struct. 375 (1996) 189–191.
-  T. Beke, I. Csizmadia, and A. Perczel, Theoretical study on tertiary structural elements of -peptides: Nanotubes formed from parallel-sheet-derived assemblies of -peptides, J. Am. Chem. Soc. 128 (2006) 5158–5167.
-  H. A. Baldoni, G. Zamarbide, R. D. Enriz, E. A. Jauregui, Ö. Farkas, A. Perczel, S. J. Salpietro, and I. G. Csizmadia, Peptide models XXIX. Cis-trans iosmerism of peptide bonds: Ab initio study on small peptide model compound; the 3D-Ramachandran map of formylglycinamide, J. Mol. Struct. 500 (2000) 97–111.