Order parameter prediction from molecular dynamics simulations in proteins
Abstract
A molecular understanding of how protein function is related to protein structure will require an ability to understand large conformational changes between multiple states. Unfortunately these states are often separated by high free energy barriers and within a complex energy landscape. This makes it very difficult to reliably connect, for example by all-atom molecular dynamics calculations, the states, their energies and the pathways between them. A major issue needed to improve sampling on the intermediate states is an order parameter – a reduced descriptor for the major subset of degrees of freedom – that can be used to aid sampling for the large conformational change. We present a novel way to combine information from molecular dynamics using non-linear time series and dimensionality reduction, in order to quantitatively determine an order parameter connecting two large-scale conformationally distinct protein states. This new method suggests an implementation for molecular dynamics calculations that may dramatically enhance sampling of intermediate states.
I Introduction
Proteins represent complex dynamical systems with many different stable states. Sampling on these systems has been termed the multiple time scale problem Berne (1985) with many different wells and barrier heights being found. Despite considerable effort, there is currently no rapid way to determine the range of stable states from a single structure or from sequence alone. Because biological function is intrinsically linked to the large scale conformational change of proteins, an improved understanding of how conformational change in the complex energy landscape of the protein is determined will have an immediate impact on biological and physical questions.
One method that has been used exhaustively for attempting to determine reduced descriptors for protein motions has been through the concepts of normal mode analysis Florence Tama and Sanejouand (2000); Zhenggo and Brooks (2005); van Aalten et al. (1995); Hayward et al. (1997). If the potential function driving the energies was fully harmonic, then the solution would be analytic and all the stable states and their motions would be immediately determined. But, with thermal motions driving non-harmonic interactions such as van der Waals and electrostatic forces, the motions are controlled by a very complex, but physically defined, energy surface. The current paradigm for interpreting these motions is to run the longest possible molecular dynamics calculations and then to use the fluctuations seen in the trajectory to define a set of effective normal modes. These normal modes have been called the principal normal modes when they are restricted to the lowest collective degrees of freedom. Several groups have suggested that following these normal modes could be used as an effective order parameter for controlling large conformational changevan Aalten et al. (1995); Skjaerven et al. (2011); Miloshevsky et al. (2010). Nevertheless, the approach has not led to great success, with the modes often leading to blind alleys in the surface that don’t reflect large conformational changeBalsera et al. (1996); Ma (2005); Petrone and Pande (2006); Ramanathan and Agarwal (2009).
To improve on the sampling of large-scale conformational change there have been many methods proposed. The most well known contemporary method has been called transition path sampling and uses a small number of conformations along a candidate transition pathway along with a Monte Carlo move set to try and anneal an optimal prediction of intermediates in a conformationally changing system Dellago et al. (1998); Bolhuis et al. (2000). Alternative methods have used the RMS differences between states as an order parameter to control change Maragakis and Karplus (2005), directly adding a new force that biases motion along the RMS gradient. We have developed a method, called dynamic importance sampling (DIMS)Woolf (1998); Zuckerman and Woolf (2000, 1999, 2002); Perilla et al. (2010) that uses concepts from stochastic differential equations Wagner (1987) to create a family of independent transitions that together define the likelihood of different pathways and the kinetics of the transition with sufficient sampling. However, similar to the RMS based methods, the DIMS method requires a progress variable to gauge progress of the transitions and to create the biasing and its correction for an unbiased estimate of pathways, kinetics and states.
In this contribution we describe the use of effective transfer entropy for the determination of a reduced set of degrees of freedom that can be used to define order parameters behind large scale conformational change. Our approach combines insights from the physics of non-linear time-series analysis, dimensionality reduction, and the chemical physics of protein motions on a complex energy surface to enable the dynamics of the complex system to define an order parameter candidate. This improves on other methods for the determination of order parameters where the candidate order parameter was inferred from empirical analysis of the static structure or simply assumed to correlate with the RMS between two different states. In the calculations to follow we mainly use the receiver domain of nitrogen regulatory protein C (NtrC) (Fig. 1), in addition, we have performed steered molecular dynamics and checks of the implementation on the glucose-galactose binding protein (GGBP).
Ii Principal component analysis
The method of principal component analysis has been used in the analysis of protein motions for many years. This approach depends on the determination of a set of effective harmonic modes that define the complex motions that have been seen in the dynamics. While the initial excitement over the method as a way to obtain longer time-scales seems to have faded, there remains much effort to use this approach as a tool for the analysis of conformational change. Recent work has suggested that this approach may lead to significant systematic error when there are multiple stable states separated by a large barrier.
To compute the normal modes a molecular dynamics (MD) trajectory is used along with the determination of the average fluctuations in the simulation. Then, from the MD trajectory of a protein with atoms, the correlation matrix is built as follows:
(1) |
Where the brackets () denote time averages. The orthonormal basis vectors (principal components/PC) are determined by the eigenvalue problem .
The lowest frequency modes from PCA are normally associated with slow, collective motions and have been used to try and predict intermediate states. Figure 1 depicts the lowest frequency mode obtained by applying equation (1) and, solving the eigenvalue problem for our 600 ns trajectory of NtrC. On this plot the porcupine spines are located at the C atoms and their magnitude and direction shows the type of motion involved in the mode.
For a given mode , the involvement coefficients (IC) are defined as:
(2) |
where indicates the set of normalized coordinates () that represent the active-state and inactive-state conformations, respectively. Therefore, the ICs measure the amount of overlap between a PC and the direction defined by the displacement vector between structures. As mentioned in the case of hinge-bending motions, PCA shows higher values for the ICs compared to those from more complex motions. For instance, in the case of Adenalyte Kinase (AdK), the ICs for the first two modes are 0.49 and 0.63, respectively. In the case of NtrC the ICs are much lower (Fig. 4), in consequence the direction of the first PCs of both stable states are not pointing directly towards the other end state. In contrast, in another study, by using an empirical order parameter (based on observations of both stable structures) yields higher ICs values Lei et al. (2009). These order parameters involve only localized regions of the system and are proposed in an orderly series of events. This result would imply that the PCs are not pointing directly towards the other end state but towards a different intermediate state.
One of the ideas behind an order parameter is that a few degrees of freedom dominate part of or the entire transition, while the rest of the system would follow. Therefore finding an order parameter is equivalent to locating such leading modes. In this paper we use an information theoretical approach to identify the leading modes by measuring the transfer entropy between pairs of residues. The more dominant residues are those that transfer the largest amount of entropy to the rest of the system.
Iii Information flow in proteins
The networks of interactions between atoms and residues define the web of dependencies and patterns of dynamic coupling between domains in a protein, characterized by the directed flow of information spanning multiple spatial and temporal scales. An initial application of transfer entropy to DNA binding proteins was the first to apply the asymmetry of information transfer to protein molecular motionsKamberaj and van der Vaart (2009). Let be the time series for the center of mass of the th residue and, its probability distribution. Therefore it is possible to measure the average number of bits needed to optimally encode independent draws by using the Shannon entropy Reza (1994); Shannon (1948), where the sum extends over all the states that can reach.
iii.1 Transfer entropies
For a residue with a center of mass and, probability distribution ; one could say that its trajectory is independent of that of residue if
(3) |
where is the conditional probability to find residue at state given its past and, is the conditional probability to find residue at state given the past of both and . In the case where there is not a flux of information from to then equation (3) is correct. On the other hand, in the event that there is flux of information in any direction, the divergence from correctness of equation (3) can be quantified by the Kullback-Leibler entropy Kullback and Leibler (1951) hence defining the transfer entropySchreiber (2000):
(4) |
The transfer entropy between and is minimum and equal to zero when the two residues are independent and there is a maximum and equal to the entropy rate:
(5) |
when the residues are completely coupled. In order to minimize artifacts within the time series we use the normalized effective transfer entropy given by Gourévitch and Eggermont (2007); Marschinski and Kantz (2002):
(6) |
where the second term is the average transfer entropy from surrogated samples of , to .
iii.2 The set of most dominant residues
The total flux between two residues and , can be calculated by the equation,
(7) |
Residues are selected according to the following rules: is selected if , residue is selected if and, if then no residue is selected. The set of most dominant residues is then defined as the set of residues that follow the rules above and also that are above a fixed cutoff .
Iv Experiments with GGBP
To verify that our implementation was correct, we performed analysis of coupled chaotic Ulam maps, for Henon maps and for autoregressive processes. In addition, as a more challenging test case, we used the Glucose-galactose binding protein (GGBP) Borrok et al. (2007). The two domains of GGBP exhibit a 0.5 hinge opening motion from one state to the other. The structure of the open state for an unbound glucose-galactose binding protein (GGBP) was crystallized by Borrock et al. (PDBID:2FW0)Borrok et al. (2007) at Å. For the purpose of testing we used both DIMS transitions and we applied a constant pulling force along the line determined by residues Phe:142 and Leu:144 to create a system with a known directional change (highlighted in green in Figure 2).
As a comparison point, we performed a PCA analysis over the trajectory generated by pulling along the residues Phe:142 and Leu:144. With the transient nature of the pulling, it can be seen how PCA is unable to detect the pulling direction(Figure 2). We now describe the data treatment and some results from our initial testing for the transfer entropy analysis that we propose.
iv.1 Time series treatment
The time series from MD describing the atomic motions of proteins are real valued. Estimation of the joint probability densities in equation (4), from real valued data is not only computationally expensive but unnecessary. It has been shown that the amplitude of collective excitations, representing correlated global motions in the protein, samples multicentered distributions Garcia (1992). Therefore, although single or double precision arithmetic is necessary for the stability and accuracy of the simulations, if done carefully, discretization or coarse-graining of the data does not affect these distributions while greatly reducing noise and increasing computational efficiency. We optimize our implementation by incorporating high performance computing techniques (massively parallel calculations extended over thousands of cores) and by applying dimensionality reduction and data mining techniques that we briefly describe in the following sections.
In other applications of transfer entropies Kamberaj and van der Vaart (2009); Marschinski and Kantz (2002); Gourévitch and Eggermont (2007); Lungarella et al. (2007); Staniek and Lehnertz (2008) discretization of the data is performed mainly by using symbolization techniques. In some cases the discretization is so severe that all data is mapped to single bit time series (spikes), as is the case to analyse data from neurophysiological data in epilepsy patients.
Piecewise Aggregate Approximation (PAA)
A time series of length can be represented by a second time series of length , where each element is computed according toLin et al. (2003):
(8) |
where . In other words, each vector of the time series is simply the average, over a time range , of the time series . When is constant, PAA can be seen as an attempt to approximate the original time series with a series of linear functions. Other approaches of PAA include using an adaptive mechanism to adjust according to certain rules, i.e. defining a threshold such that . For all calculations we set the time range ns.
iv.2 Transfer entropies from DIMS trajectories
In a previous work we generated a set of transitions for GGBP Perilla et al. (2010);the simulations were carried out using CHARMM27FF with CMAP Brooks et al. (2009) with our implementation of DIMS and using an implicit solvent (ACE2)Schaefer et al. (1998). The rotational and translational degrees of freedom were removed by rms fitting the target structure to the evolving system and, the alignment atoms were selected on the N- terminal domain (Residues 111 to 252 and, 293 to 305). By applying our transfer entropy analysis we were able to identify the key residues in the DIMS transition (Figure 3). The results show that the leading residues for the transition are located in the three-segment hinge that connects the N- and C- termini 3.
V Finding the leading modes on NtrC
The structures of the inactive-state and active-state conformations of NtrC have been solved by NMRKern et al. (1999); Volkman et al. (2001); Hastings et al. (2003). At room temperature NtrC samples both conformational states, however after phosphorylation the active states dominate the ensemble set of populations. Recent studies suggest that the transition pathway between the two conformations can be decomposed in a series of segmented progress variables (order parameters)Lei et al. (2009). For this study both states were solvated in box of dimensions with TIP3 waters, equilibrated for 15 ns; the total number of atoms, including solvent and ions, is 12168 and 13688 for the active and inactive sates respectively. Production runs were performed for 600 ns using NAMD2.7b2Phillips et al. (2005) at NICS-Kraken. Analysis of the trajectories were executed using our code at NCSA-Abe/Lincoln.
v.1 Computing the modes
A key insight is that the atoms with the strongest leading effective transfer entropy can be used as a subset of degrees of freedom to define collective modes that are new candidate order parameters. To accomplish this goal, once a cutoff and a time-length for the interrogation of the dynamics has been defined, is straightforward. The modes are determined by fluctuations of the leading effective transfer components and together describe a set of collective motions.
For the residues in the set we compute the correlation matrix as in equation (1) over the full trajectory and obtain a set of modes . The involvement coefficients (equation (2)) for different values of the cutoff are presented in Figure 4. As the cutoff increases fewer residues are selected as dominant, however the involvement coefficients are clearly increasing. This suggests that the most dominant modes are pointing towards the end structure. Since the modes are transferring entropy to the entire system biasing along these modes would result in a collective bias for the entire system.
Since is an orthonormal base we can define the cumulative involvement coefficient of the first PCs as:
(9) |
and measure how much of the overall difference is accounted by the first modes.
This last figure suggests that relatively short molecular dynamics simulations are converging onto the important degrees of freedom determined by the effective transfer entropy analysis. It suggests that an algorithm for the use of the effective transfer entropy modes can be readily defined in CHARMM or other computer code. In that algorithm the lowest frequency modes would be the direction of biasing that is applied through DIMS or another approach (e.g. transition path sampling or targeted MD). The modes would be defined by a relatively short unbiased simulation and then followed by biasing for a similar amount of time to the mode determination. For example, this figure would suggest that 5 ns of sampling for the effective modes followed by 5 ns of sampling along the modes could be used to improve the confidence that the most important intermediate states are being reached. This would then be repeated with unbiased sampling including light restraints on the backbone atoms to define a new set of effective transfer entropy modes. By continuing this process until the end state is reached, a transition pathway would be defined. If this process is then repeated for multiple starting points with various sampling windows and different random number seeds, along with a random selection of cutoffs and mode selections, then a good sampling of the intermediate space should be obtained. We expect to implement this type of algorithm in the future and the results will be presented at that time.
Vi Conclusions
A molecular understanding of how protein function is related to protein structure will require an ability to understand large conformational changes between multiple states. Unfortunately these states are often separated by high free energy barriers and within a complex energy landscape. This makes it very difficult to reliably connect, for example, by all-atom molecular dynamics calculations, the states, their energies and the pathways between them. A major issue needed to improve sampling on the intermediate states is an order parameter – a reduced descriptor for the major subset of degrees of freedom – that can be used to aid sampling for the large conformational change. In this paper we present a novel way to combine information from molecular dynamics using non-linear time series and dimensionality reduction, in order to quantitatively determine an order parameter connecting two large-scale conformationally distinct protein states. The results presented show that the leading modes can be computed from short simulations. This new method suggests an implementation for molecular dynamics calculations that may dramatically enhance sampling of intermediate states.
References
- B. Berne, Multiple Time Scales, edited by J. Brackbill and B. Cohen (Academic Press, 1985) p. 419.
- O. M. Florence Tama, Florent Xavier Gadea and Y.-H. Sanejouand, Proteins: Structure, Function, and Genetics 41, 1 (2000).
- W. Zhenggo and B. R. Brooks, Biophysical Journal 88, 3109 (2005).
- D. M. van Aalten, A. Amadei, A. B. Linssen, V. G. Eijsink, G. Vriend, and H. J. Berendsen, Proteins 22, 45 (1995).
- S. Hayward, A. Kitao, and H. J. Berendsen, Proteins 27, 425 (1997).
- L. Skjaerven, B. Grant, A. Muga, K. Teigen, J. A. McCammon, N. Reuter, and A. Martinez, PLoS Comput Biol 7, e1002004 (2011).
- G. V. Miloshevsky, A. Hassanein, and P. C. Jordan, Biophys J 98, 999 (2010).
- M. A. Balsera, W. Wriggers, Y. Oono, and K. Schulten, Journal of Physical Chemistry 100, 2567 (1996).
- J. Ma, Structure 13, 373 (2005).
- P. Petrone and V. S. Pande, Biophys J 90, 1583 (2006).
- A. Ramanathan and P. K. Agarwal, J Phys Chem B 113, 16669 (2009).
- C. Dellago, P. G. Bolhuis, F. S. Csajka, and D. Chandler, The Journal of Chemical Physics 108, 1964 (1998).
- P. G. Bolhuis, C. Dellago, and D. Chandler, Proc Natl Acad Sci U S A 97, 5877 (2000).
- P. Maragakis and M. Karplus, Journal of Molecular Biology 352, 807 (2005).
- T. Woolf, Chemical Physics Letters 289, 433 (1998).
- D. M. Zuckerman and T. B. Woolf, Physical Review E 63, 016702 (2000).
- D. M. Zuckerman and T. B. Woolf, The Journal of Chemical Physics 111, 9475 (1999).
- D. M. Zuckerman and T. B. Woolf, The Journal of Chemical Physics 116, 2586 (2002).
- J. R. Perilla, O. Beckstein, E. J. Denning, and T. B. Woolf, Journal of Computational Chemistry 32, 196 (2010).
- W. Wagner, Journal of Computational Physics 71, 21 (1987).
- M. Lei, J. Velos, A. Gardino, A. Kivenson, M. Karplus, and D. Kern, Journal of Molecular Biology 392, 823 (2009).
- H. Kamberaj and A. van der Vaart, Biophysical Journal 97, 1747 (2009).
- F. M. Reza, An introduction to information theory, Vol. 44 (Dover Publications, 1994) p. 671â679.
- C. E. Shannon, MD computing computers in medical practice 27, 306 (1948).
- S. Kullback and R. A. Leibler, The Annals of Mathematical Statistics 22, 79 (1951).
- T. Schreiber, Physical Review Letters 85, 461 (2000).
- B. Gourévitch and J. J. Eggermont, Journal of Neurophysiology 97, 2533 (2007).
- R. Marschinski and H. Kantz, European Physical Journal B 30, 275 (2002).
- J. J. Borrok, L. L. Kiessling, and K. T. Forest, Protein science : a publication of the Protein Society 16, 1032 (2007).
- A. E. Garcia, Physical Review Letters 68, 2696 (1992).
- M. Lungarella, A. Pitti, and Y. Kuniyoshi, Physical Review E 76, 056117 (2007).
- M. Staniek and K. Lehnertz, Physical Review Letters 100, 158101 (2008).
- J. Lin, E. Keogh, S. Lonardi, and B. Chiu, Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery DMKD 03 , 2 (2003).
- B. Brooks, C. Brooks, A. Mackerell, L. Nilsson, R. Petrella, B. Roux, Y. Won, G. Archontis, C. Bartels, S. Boresch, A. Caflisch, L. Caves, Q. Cui, A. Dinner, M. Feig, S. Fischer, J. Gao, M. Hodoscek, W. Im, K. Kuczera, T. Lazaridis, J. Ma, V. Ovchinnikov, E. Paci, R. Pastor, C. Post, J. Pu, M. Schaefer, B. Tidor, R. Venable, H. Woodcock, X. Wu, W. Yang, D. York, and M. Karplus, Journal of computational chemistry 30, 1545 (2009).
- M. Schaefer, C. Bartels, and M. Karplus, Journal of molecular biology 284, 835 (1998).
- D. Kern, B. F. Volkman, P. Luginbühl, M. J. Nohaile, S. Kustu, and D. E. Wemmer, Nature 402, 894 (1999).
- B. F. Volkman, D. Lipson, D. E. Wemmer, and D. Kern, Science 291, 2429 (2001).
- C. A. Hastings, S.-Y. Lee, H. S. Cho, D. Yan, S. Kustu, and D. E. Wemmer, Biochemistry 42, 9081 (2003).
- J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale, and K. Schulten, J Comp Chem 26, 1781 (2005).