Universal geometrical factor of protein conformations as a consequence of energy minimization

Universal geometrical factor of protein conformations as a consequence of energy minimization

Ming-Chya Wu mcwu@ncu.edu.tw Research Center for Adaptive Data Analysis, National Central University, Chungli 32001, Taiwan Institute of Physics, Academia Sinica, Nankang, Taipei 11529, Taiwan Department of Physics, National Central University, Chungli 32001, Taiwan    Mai Suan Li Institute of Physics, Polish Academy of Sciences, Al. Lotnikow 32/46, 02-668 Warsaw, Poland    Wen-Jong Ma Institute of Physics, Academia Sinica, Nankang, Taipei 11529, Taiwan Graduate Institute of Applied Physics, National Chengchi University, Taipei 11605, Taiwan    Maksim Kouza Institute of Physics, Polish Academy of Sciences, Al. Lotnikow 32/46, 02-668 Warsaw, Poland Department of Physics, Michigan Technological University, Houghton, MI 49931, USA    Chin-Kun Hu huck@phys.sinica.edu.tw Institute of Physics, Academia Sinica, Nankang, Taipei 11529, Taiwan Center for Nonlinear and Complex Systems and Department of Physics, Chung Yuan Christian University, Chungli 32023, Taiwan
February 12, 2012

The biological activity and functional specificity of proteins depend on their native three-dimensional structures determined by inter-and intra-molecular interactions. In this paper, we investigate the geometrical factor of protein conformation as a consequence of energy minimization in protein folding. Folding simulations of polypeptides with chain length ranging from to residues manifest that the dimensionless ratio () of the van der Waals volume to the surface area and average atomic radius of the folded structures, calculated with atomic radii setting used in SMMP [Eisenmenger F., et. al., Comput. Phys. Commun., 138 (2001) 192], approach quickly during the course of energy minimization. A large scale analysis of protein structures show that the ratio for real and well-designed proteins is universal and equal to . The fractional composition of hydrophobic and hydrophilic residues does not affect the ratio substantially. The ratio also holds for intrinsically disordered proteins, while it ceases to be universal for polypeptides with bad folding properties.

87.14.E-, 87.15.A-, 87.15.-v
preprint: Paper published in EPL — Europhys. Lett. 96, 68005 (2011).

I Introduction

In recent decades, physical methods have been widely used to study properties and structures of biopolymers glaser2001 (); huang2005 (); waigh2007 (), including DNA peng1992 (); 10EPL-DNA (), RNA 07pre-RNA (), and protein hyman1995 (); okamoto1998 (); dokholyan1998 (); 11EPL-protein (). Proteins assume specified conformations from their chemical compositions or sequences to develop biological activity and functional specificity. The corresponding three-dimensional (3D) structures are a consequence of inter- and intra-molecular interactions, in which energy minimization is the principle governing the folding tendency. In spite of various components involved in the interactions, there has been attempts to derive simple geometric factors from a variety of conformations, which can be either considered as a factor for structure validity or used as an effective constraint in folding simulation.

Geometric properties of protein molecules have been studied for more than three decades richards1974 (); chothia1975 (); chothia1984 (). Among others, the Ramachandran plot ramachandran1963 () is a practical criterion widely used for improving the quality of NMR or crystallographic protein structures. In a polypeptide, the main chain N-C and C-C bonds are relatively free to rotate, and can be respectively represented by two torsion angles. These angles can only appear in certain combinations due to steric hindrances, which define allowed regions of the torsion angles for secondary structures in the plot.

Furthermore, it has been found that the mean volume of an amino acid in the interior of proteins is very close to that of the amino acid in crystals chothia1975 (); chothia1984 (). With the help of the Delaunary triangulation method, Liang and Dill liang2001 () have reported that the protein packing is heterogeneous, and in terms of packing density, protein molecules may be either well-packed or loosely packed. Zhang et al. zhang2003 () showed that the packing density of single domain proteins decreases with chain length, which shares a generic feature of random polymers satisfying loose constraint in compactness.

Beside the Ramachandran plot and the packing density which are conclusions based on observations, there has been theoretical models introduced to simulate properties of protein geometric structures. For example, Banavar and Maritan banavar2003 () have introduced the effective backbone tube model to analyze the secondary structures of proteins under the constraint of minimum energy and showed that the tube has an effective radius of Å.

When a polypeptide folds, the hydrophobic effects cause nonpolar side chains to cluster together in the protein interior or interface, whereas polar side chains tend to maximize the contacts with outer solvent molecules. The stability of the system is partially due to the burial of the nonpolar residues, and can be measured by the loss of the solvent accessible surface area lee1971 (); richard1977 (); connolly1983 (); stephanie2006 (). An atom or group of atoms is defined as accessible if a solvent molecule of specified size, generally water, can be brought into van der Waals contact. The solvent accessible surface is then simply defined as the surface traced out by the center of a probe sphere, which represents the solvent molecule, as it rolls over the van der Waals surface of the protein connolly1983 (); 05jcc (). Hence, volume and surface area are suitable parameters to characterize the geometrical conformation of protein.

Ii Methods

ii.1 Folding simulation

We used the SMMP package eisenmenger2001 (); eisenmenger2006 () for protein folding simulation and simulated annealing, as well as canonical Monte Carlo method, to generate folded structures. Starting with a polypeptide in a solvent, the SMMP searches the lowest energy conformation by utilizing the energy function




Here is the distance in Å between atoms and . , , , and are parameters of the empirical potentials. and are the partial charges on the atoms and , respectively, is the dielectric constant of the protein interior space. The factor in Eq.(3) is used to express the energy in kcal/mol. is the energetic torsion barrier of rotation about the bond and is the multiplicity of the torsion angle banavar2003 (). The input file for SMMP is a sequence of amino acids and the output file is in the Protein Data Bank (PDB) format berman2000 (). The protein-solvent interactions were implemented with the implicit water solvation by selecting type 1 solvent in the SMMP main.f program. All parameters needed for the simulation have been self-contained in the SMMP package.

ii.2 Calculation of volume and surface area

To compute the volume and surface area of the polypeptide in the course of folding simulation, we used the ARVO package 05cpc () developed based on analytic equations 05jcc (). ARVO can calculate and of a system of atoms, which can overlap in any way. The main idea of the algorithms of ARVO is converting computation of volume and surface area of overlapping spheres as surface integrals of the second kind over closed regions. Using stereographic projection, one can transform the surface integrals to a sum of double integrals which are then reduced to curve integrals 05jcc (); 05cpc (). It has been shown that the Van der Waals surface areas labanowski1996 () computed by the GETAREA module in FANTOM package ven1993 (); fraczkiewicz1998 () and ARVO module are consistent 05cpc (). Comparing with programs implementing different algorithms and approximations to describe geometrical properties of atomic groups, the differences among the computed surface area by VOLBL liang1998 (), GEPOL silla1990 (); silla1991 () and ARVO 05cpc () are less than 1%, and the differences among the computed volumes are about 2% (see Refs.05cpc () and busa2009 () for detailed discussions). On the basis of analytical method, the accuracy of the computation of volume and surface area of protein molecules by using ARVO is superior to numerical integration which always contains numerical errors busa2009 ().

The input file for ARVO contains the coordinates of the center and radius of all atoms in the system, where . The atoms can overlap in any way. To calculate van der Waals surface area and volume of a PDB protein structure, we used the coordinates of carbon (C), nitrogen (N), oxygen (O), and sulfur (S) of the PDB data and van der Waals radii of C, N, O, and S as input data. According to the conventional parameter settings in protein folding simulations hansmann1993 (); eisenmenger2001 (); hayryan2001 (); lin2003 (); eisenmenger2006 (); ghulghazaryan2007 (), N atom has (van der Waals) radius Å, C atom has radius Å, S atom has radius Å, and O atom has radius Å. The relatively smaller radius of hydrogen (H) atom is neglected; a water (solvent) molecule is represented by an O atom with radius Å. The radii of these atoms at the atomic level are determined by the densities of the electron cloud, and they are self-consistent with other physical quantities used in the SMMP simulation eisenmenger2001 (); eisenmenger2006 (). Further, to calculate solvent accessible surface area and related volume of the protein structure, we added radius of the solvent Å  to van der Waals radii of C, N, O, and S, i.e. the effective radii of C, N, O, and S are 2.95Å, 2.95Å, 2.80Å, and 3.40Å 05jcc (); 05cpc (), respectively. The average atomic radius and average effective radius of folded structures are calculated using these radii.

Iii Results and Discussions

iii.1 ratio for the van der Waals volume and surface area and average atomic radius

The ratio and the total energy are computed in the time course of simulation. In all cases of our simulation, the final energy using canonical Monte Carlo method is lower than using simulated annealing. The results of small proteins (with ) (Table 1) reveal that approaches to as the energy decreases, while the resultant structures are not necessary close to native structures. It turns out that the energy minimization criterion is likely connected with the geometric conformation defined by the ratio .

1HP9 183
1KDL 193
1GCN 246
1VII 295
2PLH 330
2OVO 418
1PGB 436
1HPT 440
1UOY 452
1UTG 548
Table 1: The ratio for typical proteins structures. The is the ratio of the structure with a randomly chosen configuration by the SMMP package eisenmenger2001 (); eisenmenger2006 (), is for final structure after performing the folding simulation, subscript “” stands for simulated annealing and “” for canonical Monte Carlo, and is for the structure from PDB.
Figure 1: (a) Dependence of and on for PDB protein structures. The numbers indicate the slopes. (b) The distribution (blue, solid line) for the structures shown in (a). The Gaussian fit with the maximum located at , with , , and . The estimation of the fitting is adjusted . The histogram of for 932 artificial extended structured is shown in red (dotted line). The green (dash-dotted) line refers to the Gaussian fit with maximum at . (c) A typical compact PDB structure (PDB code: 1VII). (d) An extended structure obtained by Swiss-pdb viewer 3.7 (SP5). (e) The distribution of for 28236 proteins using the probe sphere with radius of 1.4Å. The maximum of the Gaussian fit is located at .

To confirm that the ratio is relevant, we have tested PDB structure data from the Protein Culling Server wang2003 (), in which only X-ray data with high resolution have been selected. The ratio is found to be . To determine a reasonable tolerance for the ratio, we have also tested a larger database from the Protein Data Bank. Totally PDB entries deposited at the Protein Data Bank in June 2005 have been downloaded for the test. After excluding non-proteins, such as DNA and RNA, and problematic structures in which only carbons are included, there are finally protein structures involved in statistics. In our analysis, both X-ray and NMR data have been used. For NMR data consisting of more than one model, we selected the first model which is considered as the most accurate one or is an average of the models. We plotted the dependence of van der Waals surface area and volume on the total number of (C, N, O, and S) atoms in Fig. 1(a) which shows that and increase linearly with . The linear correlation between the volume and the number of atoms or area has been found by Lorenz et al. lorenz1993 () by using the Monte Carlo studies with the model of clusters of random uncorrelated spheres. Similar results of the linear relations have also been discussed by Liang and and Dill liang2001 () with protein structures. The result in Fig. 1(a) provides a more solid demonstration from the basis of a larger database. Furthermore, we plotted the distribution of as histograms in blue (solid line) in Fig. 1(b), which locates in a very narrow interval centered at .

implies that one cannot imagine a protein as a chain of small spheres because in this case we would have . However, the result might be understood qualitatively by considering that a protein consists of tubes of radius . There is a tube to represent the backbone of the protein; there are also some tubes to represent side chains of the protein. The total length of tubes is . Using and we obtained which is consistent with our numerical result. The linear dependence of and on is supported by Fig. 1(a). It is worth noting the linear correlation is independent of the settings of the radii of atoms. If other radii are used, the linear relation remains but the ratio is different. The ratio derived from the average of an ensemble of PDB protein chains (selected by Protein Sequence Culling Server wang2003 ()), using the Richard’s parameters lee1971 (), is . Similarly, using the Protori radii tsai1999 (), the result is . The relation approximately holds in the two cases.

Figure 2: (a) Probability density function of statistics for whole protein molecules (28664 samples, Gaussian distribution centered at ), helix structures (extracted from 26040 samples, ), sheet structures (24537 samples, ) and other structures (25513 samples, ). (b) as a function of the fraction of atoms in helix structures. The slope of the linear fit (black dashed line) is , and the correlation level is . (c) as a function of the fraction of atoms in sheet structures. The slope of the linear fit (black dashed line) is , and the correlation level is .

To clarify the relation between the ratio and the compactness of native structures, we computed for artificial extended structures of protein molecules. The extended structures are obtained by setting all torsion angles of existing 3D structures from PDB equal to , using the Swiss-pdb viewer 3.7 (SP5) (http://www.expasy.org/spdbv/). A typical compact PDB structure and extended structure are shown in Fig. 1(c) and Fig. 1(d), respectively; the latter is similar to those obtained from mechanical unfolding of proteins studied in Refs.li2006 (); li2007 (). The histograms in red (dotted line) in Fig. 1(b) show the distribution of for 932 artificial structures. Its maximum locates at which is higher compared to real protein structures. This interesting result confirms that the value comes from the requirement for the formation of compact native conformations as a result of energy minimization.

Further, direct comparisons of volumes and surface areas for real and extended structures show that from an extended structure to a real structure, there is a small change (increase or reduction) in volume while there is usually a large increase in surface such that the ratio changes from to . All of these indicate that the larger ratio of extended structure is attributed to nonphysical geometrical properties, such as loosely connections of monomers, and unbalance of electrostatic interactions among monomers, and interactions between monomers and water molecules. It should be noted that both real and artificial structures satisfy the requirements imposed by the Ramachandran plot, but only the former has protein-like properties. Thus, can serve as an useful factor for selecting three-dimensional protein-like structures. In addition, we have also found that the beta structures extracted from PDB protein structures have smaller in comparison with the helixes (), as shown in Fig. 2(a). Whereas the ratios for individual secondary structures are different, a protein molecule as whole is a self-organized geometric unit, which blends various secondary structures to form a properly folded 3D structure. In contrast to secondary structures, tertiary and quaternary structures then have universal property of regardless of their details. Defining the relative beta/helix content of a protein as a number of amino acids belonging to beta strands/helix structures divided by its total number of residues, we found that, as shown in Figs. 2(b) and 2(c), there is no correlation between and beta- as well as helix-content as the correlation level for the linear fits is very low (0.005) for both cases.

iii.2 ratio for solvent accessible volume and surface area and average effective radius

In order to compare the distributions of (with zero radius of solvent) and (with radius of solvent 1.4Å), we calculated the distributions of and for protein structures from PDB. The histogram of is not shown because it is similar to the larger set of 28664 protein structures (Fig. 1(b)). The distribution of (Fig. 1(e)) has a maximum at .

iii.3 ratio and hydrophobicity of amino acids

Consider a polypeptide chain consisting of a sequence of amino acids with different hydrophobicities. The hydrophobic condensation drives the polypeptide chain toward a conformation with lower free energy. This is achieved by burying hydrophobic contents into interior and reside polar monomers on the surface contacting with water. This process involves not only the regulation of the connections between monomers but also compensations of volume and surface area. According to the statistics shown in Fig. 3 for protein structures (from Protein Sequence Culling Server wang2003 ()), the ratio of hydrophilic and hydrophobic amino acids () of proteins in the Kyte-Doolittle scale kyte1982 () is generally in a narrow range with respect to variable range of , suggesting that the universality of is probably a consequence of compositions of hydrophilic and hydrophobic amino acids in a protein. The linear fit for as a function of (lower part of Fig. 3) gives the correlation level of . Since this level is notably lower than 0.5 there is no correlation between these two quantities. This is not unexpected because varies in a very narrow interval. For this reason, one can show that does not correlate with individual values of and . Furthermore, folding simulations of ten polypeptides with fixed and randomized sequences show that the averages of the ratios are for initial structure and for final structure after energy minimization. This implies that the ratio is not only the property of disordered protein, but is also that of random copolymers.

Figure 3: Upper panel shows the distribution of the ratio of hydrophilic () and hydrophobic () amino acids in a molecule for protein structures (from Protein Sequence Culling Server wang2003 ()), based on the Kyte-Doolittle scale kyte1982 (). The Gaussian fit is centered at . Lower panel shows as a function of . The slope of the linear fit (blue dashed line) is , and the correlation level is . The inset shows the distribution of for the protein structures.

iii.4 ratio for intrinsically disordered proteins

It is also of interest to study the ratio for the intrinsically disordered proteins which usually lead to misfolding meszaros2007 (); mittag2007 (). We have calculated for protein structures meszaros2007 () and found that the ratio is (see Fig. 4), which is within the tolerance determined by the ensemble of 28664 PDB protein structures. Thus, the ratio holds once a polypeptide fold to a compact structure no matter of its species.

Figure 4: (a) Volume as a function of surface area for disordered proteins meszaros2007 (). (b) The histogram of the structures shown in (a). The Gaussian fit with the maximum located at .

Iv Summary and Conclusions

In summary, we have found a universal ratio of the van der Waals volume to the surface area and average atomic radius of folded structures for native protein structures, including intrinsically disordered proteins. We have studied the connection between the energy minimization and geometric conformation by monitoring the ratio during folding simulations using the SMMP package eisenmenger2001 (); eisenmenger2006 (). Our results reveal that should be somewhat related to the energy global minimum of protein molecules. This result can be imposed as a rule in searching for native conformations in folding simulations using protein sequences. can also serve as a necessary condition for checking the validity of PDB data and designing protein-like sequences.

It is well known that hydrophobic residues are buried in the core of proteins and the van der Waals volume should be, therefore, proportional to the number of such residues. The van der Waals area should linearly depend on the number of hydrophilic residues, which have tendency to reside on the protein surface. Thus, the universality of is probably a consequence of the fact that the ratio of the hydrophobic and hydrophilic amino acids of proteins is roughly a constant.

Here we should emphasize that does not correspond to a unique conformation, but it confines molecular conformations in a folding simulation from vast possibilities to a smaller space. It excludes improperly folded structures which are characterizable by such geometrical properties and is beneficial for the reduction of simulation time. One possible implementation of this property shall be in the calculation of surface energy associated with the solvent access area. A preliminary test of the , working as a filter, can be performed before next update step in simulations. Other independent factors can work together to define the conformation to have a native-like structure.

This work was supported by Taiwan Grants NSC 96-2112-M-008-021-MY3, 96-2911-M-001-003-MY3, 97-2627-B-008-004, 98-2627-B-008-004, 99-2627-B-008-002, 100-2112-M-008-003-MY3, Poland-Taiwan NSC Grant 100-2911-I-001-507, NCTS (North), AS-95-TP-A07, and the Polish Grant No. 202-204-234.


  • (1) Glaser R., Biophysics, (Springer-Berlag, Berlin) 2001.
  • (2) Huang K., Lectures on Statistical Physics and Protein Folding, (World Scientific, Singapore) 2005.
  • (3) Waigh T. A., Applied Biophysics, (John Wiley & Sons, England) 2007.
  • (4) Peng C.-K. et al., Nature, 356 (1992) 168.
  • (5) Lando D. Y. et al., EPL, 91 (2010) 38003.
  • (6) Mamasakhlisov Y. S. et al., Phys. Rev. E, 75 (2007) 061907.
  • (7) Hyman B. T. et al., Proc. Natl. Acad. Sci., 92 (1995) 3586.
  • (8) Okamoto Y. Physica A, 254 (1998) 7.
  • (9) Dokholyan N. et al., Folding & Design, 3 (1998) 577.
  • (10) Gevorkian S. G. et al., EPL, 95 (2011) 23001.
  • (11) Chothia C., Nature, 254 (1975) 304.
  • (12) Chothia C., Ann. Rev. Biochem., 53 (1984) 537.
  • (13) Richards F. M., J. Mol. Biol., 82 (1974) 1.
  • (14) Ramachandran G. N. et al., J. Mol. Biol., 7 (1963) 95.
  • (15) Liang J. and Dill K. A., Biophys. J., 81 (2001) 751.
  • (16) Zhang J. et al., J. Chem. Phys., 118 (2003) 6102.
  • (17) Banavar J. R. and Maritan A., Rev. Mod. Phys., 75 (2003) 23.
  • (18) Lee B. and Richards F. M. J. Mol. Biol., 55 (1971) 379.
  • (19) Richards F. M. Annu. Rev. Biophys. Bioeng., 6 (1977) 151.
  • (20) Connolly M. L. Science, 221 (1983) 709.
  • (21) Stephanie M. E. T. et al., J. Am. Soc. Mass. Spectrom., 17 (2006) 1490.
  • (22) Hayryan S. et al., J. Comp. Chem., 26 (2005) 334.
  • (23) Eisenmenger F. et al., Comput. Phys. Commu., 138 (2001) 192.
  • (24) Eisenmenger F. et al., Comput. Phys. Commu., 174 (2006) 422.
  • (25) Berman H. M.et al., Nucl. Acids Res., 28 (2000) 235.
  • (26) Buša J. et al., Comp. Phys. Commun., 165 (2005) 59.
  • (27) Labanowski J. K., Molecular Modeling (1996). Posted at http://www.ccl.net/cca/documents/molecular-modeling/node1.html.shtml
  • (28) Fraczkiewicz R. and Braun W., J. Comp. Chem., 19 (1998) 319.
  • (29) Freygerg B. V. and Braun W., J. Comp. Chem., 14 (1993) 510.
  • (30) Liang J. et al., Proteins: Struc. Func. Genet., 33 (1998) 1.
  • (31) Silla E. et al., J. Mol. Graphics, 8 (1990) 168.
  • (32) Silla E., Tun I., and Pascual-Ahuir J. L., J. Comp. Chem., 12 (1991) 1077.
  • (33) Buša J. et al., J. Comp. Chem., 30 (2009) 346.
  • (34) Ghulghazaryan R. G.et al., J. Comp. Chem., 28 (2007) 715.
  • (35) Hansmann U. H. E. and Okamoto Y., J. Comp. Chem., 14 (1993) 1333.
  • (36) Hayryan S. et al., J Comp Chem, 22 (2001) 1287.
  • (37) Lin C.-Y.et al., Proteins: Struct Funct Genet, 52 (2003) 436.
  • (38) Wang G. and Dunbrack Jr R. L., Bioinformatics, 19 (2003) 1589. The website of the protein sequence culling server is http://dunbrack.fccc.edu/pisces/. We used version cullpdb_pc20_res1.6_R0.25_d040724_chains743.
  • (39) Lorenz B. et al., J. Phys. A Math. Gen., 26 (1993) 4711.
  • (40) Tsai J. et al., J. Mol. Biol., 290 (1999) 253.
  • (41) Li M. S. et al., Proc. Natl. Acad. Sci. USA, 103 (2006) 93.
  • (42) Li M. S. et al., Biophys. J., 92 (2007) 547.
  • (43) Kyte J. and Doolittle R. F., J. Mol. Biol., 157 (1982) 105.
  • (44) Mészáros B. et al., J. Mol. Biol., 372 (2007) 549. There are PDB structures in Table 1, while the entry “1F83” has been obsoleted by the Protein Data Bank.
  • (45) Mittag T. and Forman-Kay J. D., Curr. Opinion Struc. Biol., 17 (2007) 3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description