Fast and Flexible Geometric Method For Enhancing MC Sampling of Compact Configurations For Protein Docking Problem

Fast and Flexible Geometric Method For Enhancing MC Sampling of Compact Configurations For Protein Docking Problem

Abstract

EASAL (efficient atlasing and sampling of assembly landscapes) is a recently reported geometric method for representing, visualizing, sampling and computing integrals over the potential energy landscape tailored for small molecular assemblies. EASAL’s efficiency arises from the fact that small assembly landscapes permit the use of so-called Cayley parameters (inter-atomic distances) for geometric representation and sampling of the assembly configuration space regions; this results in their isolation, convexification, customized sampling and systematic traversal using a comprehensive topological roadmap.

By sampling the assembly landscape of 2 TransMembrane Helices, with short-range pair-potentials, this paper demonstrates that EASAL provides reasonable coverage of crucial but narrow regions of low effective dimension with much fewer samples and computational resources than traditional MonteCarlo or Molecular Dynamics based sampling. Promising avenues are discussed, for combining the complementary advantages of the two methods.

Additionally, since accurate computation of configurational entropy and other integrals is required for estimation of both free energy and kinetics, it is essential to obtain uniform sampling in appropriate cartesian or moduli space parameterization. EASAL’s flexibility is demonstrated with a variety of sampling distributions, from Cayley sampling skewed towards lower energy regions, to uniform Cartesian sampling at the two ends of the spectrum.

American Chemical Society, Supramolecular Assembly, Molecular, Configuration Space, Free Energy Landscape, Configurational Entropy, Algorithmic Performance Guarantees, Atlas, Geometric Constraints, Optimal Parameterization, Cayley Parameters, Convex Programming, Distance Geometry, Structural Rigidity, Stratification, Semi-Algebraic sets

University of Florida] CISE department, University of Florida, Gainesville, FL 32611-6120 Carnegie Mellon University] Chemistry Department, Carnegie Mellon University, Pittsburgh, PA 15213 University of Florida] CISE department, University of Florida, Gainesville, FL 32611-6120 \phone+1 352 392 1200 \fax+1 352 392 1220 Carnegie Mellon University] Chemistry Department, Carnegie Mellon University, Pittsburgh, PA 15213

1 Introduction

The problem of protein-protein assembly is an area of active research and development. [Janin(2010), Ritchie(2008), Vajda and Kozakov(2009), Fernández-Recio and Sternberg(2010)]. Currently the most successful approach to docking two proteins together uses a direct exhaustive search of the whole configurational space. It is usually performed in the inverse space for translational moves on a cubic grid (using the fast Fourier Transform (FFT)). The FFT algorithm makes the translation search very efficient, but it has to be repeated for all orientations of a molecule being docked resulting in thousands of FFT operations, each comprising millions of translations. The majority of the available software for molecular docking can only deal with two proteins (i.e. a dimer). A set of docking procedures that is based on the shape recognition and image segmentation techniques of Computer Vision. PatchDock algorithm starts with a smooth representation of the molecular surface as a set of discrete points, but the set is restricted to critical points (convex caps, toroidal belts and concave pits), and the normal vectors at these points. A geometric hashing algorithm performs a very fast matching of the caps and pits with opposing normal directions on two surfaces, and collects all the rigid-body solutions that are geometrically acceptable after rejecting volume overlaps. Other geometric criteria can easily be incorporated in the procedure, for instance molecular symmetry in SymmDock, which allows building models of oligomeric proteins with up to twenty subunits.

It is more common that problem of computational protein assembly and folding is typically approached by the third class of methods: molecular dynamics (MD) or Monte-Carlo (MC) simulations. Both these techniques sample the system’s configurational space with probabilities corresponding to the Boltzmann distribution. Ideally, such simulation can produce a probability density function for the whole phase space of the system. An absolute free energy and relative probabilities of various states in the phase space can then be estimated. However, in practice, systems of interest are rarely ergodic, in a sense that their energy landscapes consist of an unknown number of energy minima separated by large energy barriers, moreover, in tightly packed molecular systems, the majority of phase space has high energy and low probability. In such conditions, most sampling procedures have tendency to oversample local basins of the energy function, and have difficulty crossover between low energy basins. This results in uncertainty in both, i) relative probabilities of visited states, as well as ii) in whether the range of low energy configurations visited during simulations is ever complete. Despite recent progress all currently existing methods of protein assembly are extremely computationally expensive.

To overcome the problem of incomplete sampling of relevant phase space when modeling protein assembly we apply a recently introduced approach called EASAL (Efficient Atlasing of Assembly Landscapes), for representing, visualizing, sampling and computing integrals over the potential energy landscape tailored for small molecular assemblies. EASAL’s efficiency arises from the fact that small assembly landscapes permit the use of so-called Cayley parameters (inter-atomic distances) for geometric representation and sampling of constant potential energy regions of the assembly configuration space. This results in the isolation, convexification, customized sampling and systematic traversal of regions using a comprehensive topological roadmap, providing reasonable coverage of low potential energy, but narrow regions of low effective dimension, with surprisingly few samples.

Additionally, since accurate computation of configurational entropy and other integrals is required for estimation of both free energy and kinetics, it is essential to obtain uniform sampling in appropriate cartesian or moduli space parameterization. EASAL’s flexibility permits a variety of sampling distributions, from Cayley sampling skewed towards lower energy regions, to uniform Cartesian sampling at the two ends of the spectrum. The theory and algorithms behind EASAL appears in [Ozkan and Sitharam(2011), Ozkan et al.(2014)Ozkan, Wu, Peters, and Sitharam, Ozkan and Sitharam(2014)] and is sketched in the next section. Preliminary extensions of EASAL beyond dimer assemblies and viability of using EASAL for atlasing wide variety of assembly systems including clusters of up to 12 assembling spherical particles is demonstrated in [Ozkan et al.(2014)Ozkan, Wu, Peters, and Sitharam]. The software implementation of EASAL (architecture and functionalities) is reported in [Ozkan et al.(2014)Ozkan, Pence, Wu, Baker, Willoughbyand, Peters, and Sitharam]; the software has recently been employed to predict crucial intermonomeric interface interactions for viral capsid assembly [Wu et al.(2012)Wu, Ozkan, Bennett, Agbandje-Mckenna, and Sitharam, Wu et al.(2014)Wu, Ozkan, Bennett, Agbandje-McKenna, and Sitharam].

By sampling the assembly landscape of 2 TransMembrane Helices, with short-range pair-potentials, our result demonstrates that EASAL provides reasonable coverage of crucial but narrow regions of low effective dimension with much fewer samples and computational resources than traditional MonteCarlo or Molecular Dynamics based sampling. Promising avenues are discussed, for combining the complementary advantages of the two methods.

2 Materials and Methods

2.1 Metropolis Monte Carlo method.

Metropolis Monte Carlo (MC) is an importance sampling method that generates an ensemble according to the Boltzmann factor. MC simulations were performed in order to explore the conformational space that is accessible by translational and rotational random steps of rigid helixes. In all simulations reported protein transmembrane alpha-helixes were held rigid. The rigid body MC simulator was implemented in HARLEM program.

2.2 Move Sets.

Trial conformations of rigid bodies are generated by a basic move set of a small translational and rotational displacement. The maximum step size for both type of displacements are to follow a well known criterion of the MC acceptance ratio. This criterion establishes that for an optimum sampling the acceptance ratio should fluctuate around 50 % of trial move should be accepted. However, from our experience this high ratio of acceptance will generate random structures with small conformational fluctuations. In order to overcome high energy barriers we implemented a move set based on the exponential distribution of the maximum step size of translation and rotation. The purpose of this move set is to randomly generate trial conformations with higher probability to jump over the energy barrier.

Moreover, analyzing MC trajectories we found that low energy conformations of different energy basins are conformational different by the rotation around the principal axis of a rigid helix. Using this information we included in our MC implementation a move set in which a helix is randomly rotated around its principal axis.

Description of the level of representation, energy terms.

2.3 Scoring Energy

The intermolecular energy of a structure model in this work is calculated as

(1)

Where is a pairwise distance-dependent potential of mean force of interaction between residues and , is the steric overlap energy, is the energy term that constrains the TM helixes in the membrane plane and prevents the helix mass center to sample farther than a radius of 15 Å. accounts for the interaction of amino acids in different regions in the lipid bilayer. A description of each of these terms follows:

2.4 Background: Theory underlying EASAL

This subsection gives background from [Ozkan and Sitharam(2011), Ozkan et al.(2014)Ozkan, Wu, Peters, and Sitharam] for the theoretical underpinnings of EASAL’s key features - geometrization, stratification and convexification using Cayley parameters - culminating in the concept of an atlas of an assembly configuration space. We begin with a description of the input to EASAL. An assembly system consisting of the following.

  • A collection of rigid molecular components, drawn from a small set of rigid component types (often just a single type). Each type is a is specified as the set of positions of atom-centers, in a local coordinate system. In many cases, an atom-center could be the representation for the average position of a collection of atoms in a residue. Note that an assembly configuration is given by the positions and orientations of the entire set of rigid molecular components in an assembly system, relative to one fixed component. Since each rigid molecular component has 6 degrees of freedom, a configuration is a point in dimensional Euclidean space. The maximum number of atom-centers in any rigid molecular component is denoted .

  • The potential energy is specified using Lennard-Jones (which includes Hard-Sphere) pairwise potential energy functions. The pairwise Lennard-Jones term for a pair of atoms, and , one from each component, is given as a function of the distance between and ; The function is typically discretized to take different constant values on 3 intervals of the distance value : Typically, is the so-called Van der Waal or steric distance given by ”forbidden” regions around atoms and And is a distance where the interaction between the two atoms is no longer relevant. Over these 3 intervals respectively, the Lennard-Jones potential assumes a very high value , a small value , and a medium value All of these bounds for the intervals for , as well as the values for the Lennard-Jones potential on these intervals are specified constants as part of the input to the assembly model. These constants are specified for each pair of atoms and , i.e., the subscripts are necessary. The middle interval is called the well. In the special case of Hard Spheres, .

  • A non-pairwise component of the potential energy function in the form of global potential energy terms that capture other factors including the implicit solvent (water or lipid bilayer membrane) effect [Lazaridis and Karplus(1999), Lazaridis(2003), Im et al.(2003)Im, Feig, and Brooks]. These are specified as a function of the entire assembly configuration.

It is important to note that all the above potential energy terms are functions of the assembly configuration.

Note that the input to the assembly usually specifies the configurations of interest i.e., a region of the configuration space, often specified as a collection of atom pairs ”of interest” with the understanding that the only configurations of interest are those in which at least one of these pairs in occupy their corresponding Lennard-Jones well. Clearly . In addition, we assume the desired level of refinement of sampling is specified as a desired number of sample configurations .

Geometrization

Observe that for the purposes of this paper stated in Section 1, it is sufficient to view the assembly landscape as a union of constant potential energy regions. Thus an assembly system can alternatively be represented as a set of rigid molecular components drawn from a small set of types, together with assembly constraints, in the form of distance intervals. These constraints define feasible configurations (where the pairwise inter-atoms distances are larger than , and any relevant tether and implicit solvent constraints are satisfied). The set of feasible configurations is called the assembly configuration space. The active constraints of a configuration are those atom-pairs in the configuration that lie in the Lennard-Jones well. An active constraint region of the configuration space is a region consisting of all configurations where a specified (nonempty) set of constraints is active, i.e, those Lennard-Jones inter-atom distances between atoms and lie in their corresponding wells, i.e, the interval .

Stratification, Active Constraint Graphs

Consider an assembly configuration space of rigid components, defined by a system of assembly constraints. The configuration space has dimension , the number of internal degrees of freedom of the configurations since a rigid object in Euclidean 3-space has rotational and translational degrees of freedom. For , this dimension is at most and in the presence of two active constraints, it is at most .

A Thom-Whitney stratification of the configuration space (see Fig. 3) is a partition of the space into regions grouped into strata of that form a filtration , . Each is a union of nonempty closed active constraint regions where the set of pairwise constraints are active, meaning each pair in lies in its corresponding Lennard-Jones well, and the constraints are independent (i.e., no proper subset of these constraints generically implies any other constraint in the set). Each active constraint set is itself part of at least one, and possibly many, hence -indexed, nested chains of the form . See Figures 4 and 3(left). These induce corresponding reverse nested chains of active constraint regions : Note that here for all , is closed and effectively dimensional; by which we mean that if all the Lennard-Jones wells that define the active constraint set narrowed to zero width (i.e, if they degenerated to a Hard-Sphere potentials), then the active constraint region would be dimensional.

{subfigure}

[b]0.40 {subfigure}[b]0.57

Figure 1: stratification of assembly
Figure 2: top: Cayley points, bottom: Cartesian realizations
Figure 3: (a) Stratification: of assembly constraint system with parameters 4 (red), 3 (yellow), 2 (green), 1 (white), 0 (purple). Strata of each dimension for the assembly constraint system visualized in the lower right inset are shown as nodes of one color and shape in a directed acyclic graph. Each node represents an active constraint region. Edges indicate containment in a parent region one dimension higher. (b) top: Realizable Cayley points (distance values) corresponding to one node in (a). Note a different use of color in the display of sample boxes in Cayley configuration space than in the stratification diagram. One Cayley point in the green group is highlighted. bottom: Three Cartesian realizations of the highlighted Cayley point. Each edge on a realization represents an active constraint graph and its chosen parameters.
Figure 4: Adding constraints, removing parameters until . top: Cartesian realizations with non-white segments: parameters and white segements constraints and bottom: activeConstraintGraph G yielding configurations with ever fewer free parameters as constraints are added one by one.

We represent the active constraint system for a region, by an active constraint graph (sometimes called contact graph) whose vertices represent the participating atoms (at least in each rigid component) and edges representing the active constraints between them. Between a pair of rigid components, there are only a small number of possible active constraint graph isomorphism types since there are at most contact vertices. For the case of these are listed in Figure 5, and for higher a partial list appears in Figure 6.

There could be regions of the stratification of dimension whose number of active constraints exceeds , i.e. the active constraint system is overconstrained, or whose active constraints are not all independent. Dependent constraints diminish the set of realizations. For entropy calculations, these regions should be tracked explicitly, but in the present paper, we do not consider these overconstrained regions in the stratification. Our regions are obtained by choosing any independent active constraints.

Figure 5: All active constraint graphs
Figure 6: All non-isomorphic active constraint graphs with 6 vertices and 12 edges.

Convex Representation of Active Constraint Region and Atlas

A new theory of Convex Cayley Configuration Spaces (CCCS) recently developed by the author [Sitharam and H.Gao(2010)] gives a clean characterization of active constraint graphs whose configuration spaces are convex when represented by a specific choice of so-called Cayley parameters i.e., distance parameters between pairs of atoms (vertices in the active constraint graph) that are inactive in the given active constraint region (non-edges in the active constraint graph). See Figure 14. Such active constraint regions are said to be convexifiable, and the corresponding Cayley parameters are said to be its convexifying parameters. See Figures 9 9

In general, the active constraint regions for an active constraint graph , can be entirely convexified after ignoring the remainder of the assembly constraint system, namely the atom markers not in and their constraints. Fig. 14 The true active constraint region is subset of , however the cut out regions are also defined by active constraints, hence they, too, could be convexified. See Figures 9, 9.

{subfigure}

0.45   {subfigure}0.45

Figure 7: Cayley charts of dimensions 1,2,3 attached to nodes.
Figure 8: Cartesian realizations of dimensions 1,2,3 attached to nodes.
Figure 9: Nested chains for one region of the atlas, i.e. nodes and paths in the directed acyclic graph of the stratification containing a 2d contraint region. center, green: a d active constraint region. left, red and yellow: 4d and 3d parent regions containing the 2d region. right: 1d and 0d child regions. The G and chart are displayed next to each region. (a) The -dimensional (exact, convex) chart in the center has a hole due to infeasible configurations also defined by Cayley parameter ranges, hence convex. Also, due to choice of different Cayley parameters, the same 2-dimensional region appears, without hole, in the -dimensional parent charts as orange boxes top left, pink boxes middle left and red-orange boxes lower left; green boxes on right: 1-dimensional subregions. (b) Three grey fans attach the Cartesian realizations to their nodes as separate sweeps for different chirality of a region (the blue molecular unit is fixed without loss of generality).

When a constraint (edge ) not in becomes active (at a configuration in ), defines a child active constraint region containing . This new region belongs to the stratum of the assembly configuration space that is of one lower dimension (Definition 2.4.2) and defines within a boundary of the smaller, true active constraint region . We can still choose the chart of as tight convex chart for , but now region has an exact or tight convex chart of its own. Then the configurations in the region have lower potential energy since the configurations in that region lie in one more Lennard-Jones well. Hence they should be carefully sampled in free energy and entropy computations although the region has one lower effective dimension (e.g, represents a much narrower boundary channel). However, sampling in the larger parent chart of (of one higher effective dimension) often does not provide adequate coverage of the narrow boundary region . For example, Fig. 14 shows that providing a separate chart for each active constraint region can reveal additional realizations at the same level of sampling.

{subfigure}

3.3in   {subfigure}1.4in
{subfigure}1.5in {subfigure}2.25in

Figure 10:
Figure 11:
Figure 12:
Figure 13:
Figure 14: Top Left: atlas region showing interiors and boundaries sampled in its convexifying Cayley parameters; boundary/child regions sampled in their own Cayley parameters and mapped back to the parent region’s Cayley parameters (note increase in samples). Top Right: boundary/child regions sampled in their own Cayley parameters shown as sweeps around grey reference (toy) helix. Bottom Left: union of boundary regions sampled in parent’s Cayley parameters, shown as sweep around blue reference helix (notice (b) is bigger) Bottom Right: sweep of one of the boundary regions sampled in parent’s Cayley parameters is shown in red around gray reference helix; the sampling misses the other colored configurations in the same boundary region, obtained by sampling in its own Cayley parameters.

The Atlas of an assembly configuration space is a stratification of the configuration space into convexifiable regions. In [Ozkan and Sitharam(2011)], we have shown that molecular assembly configuration spaces with 2 rigid molecular components have an atlas. The software EASAL (Efficient Atlasing and Search of Assembly Landscapes) efficiently finds the stratification, incorporates provably efficient algorithms to choose the Cayley parameters [Sitharam and H.Gao(2010)] that convexify an active constraint region, efficiently computes bounds for the parametrized convex regions [Chittamuru(2010)], and converts the parametrized configurations into standard cartesian configurations [Peters et al.(2004)Peters, Fan, Sitharam, and Zhou].

EASAL variants and Sampling Distributions

EASAL’s flexibility is demonstrated with a variety of sampling distributions in Cayley space, which translates to sampling variants in Cartesian space: skewed towards lower energy regions at one end of the spectrum, towards higher energy regions at the other end of the spectrum and explicit adjustment to uniform Cartesian sampling at the middle. Since the basic EASAL (uniform sampling in Cayley) is already skewed towards lower energy regions in Cartesian, we would expect the second option to approximate the uniform Cartesian without explicit adjustment.

Accordingly, we have tested the following variants of the basic EASAL. EASAL3 is designed to have more samples closer to the boundaries of all active constraint regions; i.e, it uses step size linearly proportional with the Cayley parameter value. EASAL2 is designed to have more samples in the interiors; i.e, it uses step size inversely proportional with the Cayley parameter value. EASAL-Jacobian uses a sophisticated Cayley sampling method [Ozkan and Sitharam(2014)] to force uniform sampling in Cartesian. It recursively and adaptively computes the next Cayley step size and direction using an interative computation of the Jacobian of the Cartesian-Cayley map to achieve this goal. See Figure 23.

{subfigure}

[b]0.24 {subfigure}[b]0.24 {subfigure}[b]0.24 {subfigure}[b]0.24 {subfigure}[b]0.24 {subfigure}[b]0.24 {subfigure}[b]0.24 {subfigure}[b]0.24

Figure 15: EASAL sampling in Cayley
Figure 16: EASAL2 sampling in Cayley
Figure 17: EASAL3 sampling in Cayley
Figure 18: EASAL-Jacobian sampling in Cayley
Figure 19: EASAL sampling in Cartesian
Figure 20: EASAL2 sampling in Cartesian
Figure 21: EASAL3 sampling in Cartesian
Figure 22: EASAL-Jacobian sampling in Cartesian
Figure 23:

3 Experimental Results

We first describe the experimental setup.

We use reduced system with only 11 residues per helix represented by 20 atoms to reduce the size of the sampling space. See  24 for reduced TM helix.

Figure 24: Easal screenshot: displays the reduced TM helices

3.1 Grid Generation

  • The Grid is uniform along the Cartesian configuration space.

  • The bounds of the Cartesian configuration space for both Grid and EASAL are:
    -20 to 20 Angstroms
    -3.5 to 3.5 Angstroms

  • The angle parameters are described in Euler angles representation (Cardan angle ZXZ).
    to

  • Inter principal-axis angle degrees where where and are the principal axis of each rigid body. I.e. and are eigenvectors of the inertia matrix.

  • Additionally, there is the pairwise distance lower bound criterion:
    For all atom pairs belonging to different rigid molecular components, where and are residues, is the distance for residues and , and are the radius of residue atoms and .

  • Million Grid configurations are generated in this manner.

  • Over of them are discarded to ensure at least one pair , i.e, an active constraint and to eliminate collisions. About Million Grid configurations remain.

3.2 Epsilon Coverage

Ideally, we would expect each Grid point to be covered by at least one EASAL sample point that is situated in an -cube centered around a Grid point with a range of in each of the 6 dimensions.

  • The value of is computed as follows: = ( of Grid points / # of Easal points)

  • We set to be since Grid points are by definition a discrete number of steps from each other.

  • In order to compute the coverage, we assign each EASAL/MC sample to its closest Grid point. Call those Grid points EASAL/MC-mapped Grid points. We say that a Grid point is covered by EASAL/MC if there is at least one EASAL/MC -mapped Grid point within the -cube centered around

3.3 MultiGrid

All variants of EASAL are designed to isolate and sample each active constraint region with a variety of sampling distributions in Cayley space. In addition, EASAL-Jacobian samples each such region uniformly in Cartesian. Yet, when we combine all such regions, those regions where more pairs of atoms are in their Lennard-Jones wells (regions with more active constraints) will have denser sampling. i.e. EASAL tends to oversample the lower energy regions. This is a positive feature of EASAL that we preserve in EASAL-Jacobian.

Since the 5D strata of the atlas generated by both versions of EASAL would sample a configuration that has active constraints times (once for each of the 5D active constraint regions in which the configuration lies), the meaningful comparison would require similarly replicating such configurations in the Grid, which we call the MultiGrid.

3.4 Computational Time/Resources

MC has around 100 million configurations. MultiGrid/Grid have around 12/6 million configurations. EASAL1/2/3 have around 100K/40K/30K configurations respectively. EASAL_Jacobian has around 1 million configurations. Since EASAL is light in terms of resources and can generate reasonable atlases with very few samples, any undersampling caused by that should be taken into account during comparisons.

The specification of the processor for MC is I5-2540 and for EASAL is Intel Core 2 Quad CPU Q9450 @ 2.66GHz x 4 with Memory:3.9 GiB.

EASAL1 took 3 hours 8 minutes(188 minutes). EASAL2 took 4 hours 24 minutes(264 minutes). EASAL3 took 10 hours 20 minutes(620 minutes). EASAL-Jacobian took 14 hours 22 minutes(862 minutes).

Now we are ready to show the results.

3.5 Coverage Results

sampling method EASAL1 EASAL2 EASAL3 EASAL_Jacobian MC
coverage percentange
Table 1: Coverage Results
{subfigure}

[b].4 {subfigure}[b].4 {subfigure}[b].4 {subfigure}[b].4 {subfigure}[b].4

Figure 25: MC
Figure 26: EASAL_Jacobian
Figure 27: EASAL1
Figure 28: EASAL2
Figure 29: EASAL3
Figure 30: horizontal axis: = the # of Easal points that lay in an -cube. vertical axis: the # of regions (having Easal-mapped points inside of it )
{subfigure}

[b].4 {subfigure}[b].4 {subfigure}[b].4 {subfigure}[b].4

Figure 31: EASAL_Jacobian
Figure 32: EASAL1
Figure 33: EASAL2
Figure 34: EASAL3
Figure 35: horizontal axis: = the # of Easal points that lay in an -cube. vertical axis: the # of regions (having Easal-mapped points inside of it )

EASAL1 and EASAL2 have more than coverage (See  1). EASAL has a dense coverage after we expand cube range to be as seen by fig. 35. Hence EASAL is able to have almost full coverage for where is in the range (0, 1).

We have located those uncovered regions on projection for -cube by Fig. 43. They are populated around the outer circle where two molecules are far from each other that cause increase in in Equation 1. Hence we are assuming them not to be low energy regions. Hence they are most probably the regions out of interest.

{subfigure}

[b].35 {subfigure}[b].35 {subfigure}[b].35 {subfigure}[b].35 {subfigure}[b].35 {subfigure}[b].35 {subfigure}[b].35

Figure 36: MC
Figure 37: Grid
Figure 38: MultiGrid
Figure 39: EASAL_Jacobian
Figure 40: EASAL1
Figure 41: EASAL2
Figure 42: EASAL3
Figure 43: horizontal axis: var discr, vertical axis: var discr
color code: =the # of points that lay in an -cube centered around Grid point var discr, var discr

Yet, EASAL2 has very similar coverage as MC, even though it has very less number of configurations. In a short time with few samples, EASAL2 covers most of the Grid.

3.6 Comparison of MC - Grid - EASAL

Next we look at projection on and where low energy regions are located. For the following plots, we have divided Grid into larger regions i.e., ( by ) instead of original ( by ) subregions. I.e. a region with value of in has index , in has index and so on. The and axes of following plots denote the ’indices’ for these and ranges.

Note. During any ratio computations, to avoid division by zero, we compute ratios only in the ’accepted’ region where there is guaranteed to be at least one Grid point.

For the following plots, there are no samples on the outer rim since Evol prevents the helix mass center to sample further than a radius of 15 Å. There are also no samples in the center of the plots because in these regions the helix mass center of the molecules are so close to each other hence forces pairwise distances between residues that causes elimination by steric constraints. The feasible regions that are close to center are boundary regions that possibly satisfy more contacts. Each additional active constraint (contact) essentially does drop the energy hence those boundary regions can be considered as lower energy regions. Hence those lower dimensional regions do have to be accurately sampled.

MultiGrid, which is effectively a weighted version of Grid according to number of contacts, shed light on those regions that satisfy more contacts. As seen by Fig. 51 a) and c) MultiGrid has denser sampling insider regions where contacts happens to be more (low energy regions) than Grid. EASAL is similar to MultiGrid in that sense. Whereas, MC is very similar to Grid as seen by Fig. 51 a) and b). MC has denser sampling at outer surrounding regions compared to MultiGrid see Fig. 57 a).

Hence, EASAL2 represents low energy regions better than MC since the regions with low energy basins have more probability to be sampled by EASAL thanks to higher refinement at the lower dimensions.

{subfigure}

[b].38 {subfigure}[b].38 {subfigure}[b].38 {subfigure}[b].38 {subfigure}[b].38 {subfigure}[b].38 {subfigure}[b].38

Figure 44: Grid %
Figure 45: MC %
Figure 46: MultiGrid %
Figure 47: EASAL_Jacobian %
Figure 48: EASAL1 %
Figure 49: EASAL2 %
Figure 50: EASAL3 %
Figure 51: -axis : the percentage of samples (per region of XY)
The way ’number of samples’ computed is:
- sum all samples of subregions with a specific - divide it to the total number of samples
{subfigure}

[b].4 {subfigure}[b].4 {subfigure}[b].4 {subfigure}[b].4 {subfigure}[b].4

Figure 52: MC/MultiGrid
Figure 53: EASAL_Jacobian/MultiGrid
Figure 54: EASAL1/MultiGrid
Figure 55: EASAL2/MultiGrid
Figure 56: EASAL3/MultiGrid
Figure 57: -axis : ratio (per specific value)
The way ratio is computed is:
- sum all (*) samples of subregions per specific value
- then do the same thing for MultiGrid
- then divide those 2 summation.

3.7 Distribution over Atlas Nodes

{subfigure}

[b].38 {subfigure}[b].38 {subfigure}[b].38 {subfigure}[b].38 {subfigure}[b].38 {subfigure}[b].38

Figure 58: Atlas MC
Figure 59: Atlas MultiGrid
Figure 60: EASAL_Jacobian
Figure 61: EASAL1
Figure 62: EASAL2
Figure 63: EASAL3
Figure 64: vertical-axis : index of contact atom in helixB, horizontal-axis : index of contact atom in helixA
Color code : The distribution ratio per node is computed as: ’number of samples withinGlobalLimits withSterics in one node’ / ’number of samples withinGlobalLimits withSterics in all nodes’

3.8 Coverage Results for Lower Dimensional Regions

We would like to see the lower dimesional sampling density through all these methods in order to see how they are good at covering the lower energy regions. Assume we are experimenting on a low dimensional region with i.e. contacts. This low dimensional region is populated through d regions. Then the ratio of number of samples of dimensional region over total number of samples in all d regions that lead to dimensional region of interest would give us how often this dimensional region is sampled.

For the experimental purpose, we have picked a random dimensional region with contacts , , where a contact represented by atom indices from helices.

The following  2 shows the results of the frequency of sampling for the chosen lower dimensional region.

sampling method EASAL1 EASAL2 EASAL3 EASAL_Jacobian MC Grid MultiGrid
ratio percetange
Table 2: The ratio is computed as the # of d samples that satisfies ’all’ contacts / the # of samples that satisfy ’any’ contacts

For all EASAL versions, the denominator is the total number of samples of three d nodes each labeled by one of contacts. The numerator is the sum of d samples per each d node.

For Grid and MC, the denominator is the total number of samples that satisfies any contacts, the numerator is the number of samples that satisfies all contacts. To compute that number, we have read all samples from provided trajectories and then identify the samples that satisfies the specific set of contacts.

For MultiGrid, the denominator and numerator are same as Grid however each sample that satisfy any contacts replicated that many times.

Among all those methods, EASAL2 covers more lower dimensional regions even though it has very low number of samples compared to MC/Grid/MultiGrid.

The MultiGrid also have a good coverage ratio, however to compute MultiGrid ratio, the number of samples in numerator is multiplied exactly times since they all satisfy all contacts i.e. each sample that contributes to numerator replicated times. Since it is not producing different samples hence not finding different lower dimensional regions, it is not considered exactly as sampling method. We keep MultiGrid for meaningful comparison purpose.

Besides, when the full EASAL is run, then a dimensional ( contact) sample will potentially appear in -dimensional regions and should be appropriately weighted.

For the case of , d samples will be generated by d nodes, d nodes and d node, regions in total. The current ratios for EASAL in  2 considers only d regions. Hence when the full EASAL is run the ratio will increase and be multiplied times.

4 Discussion

By using a large number of samples, we can sample the grid by brute force. Over 90% are typically discarded since they are high energy configurations (no single active constraint). Also, low energy (more than 1 active constraint) has almost no chance of being sampled.

In the case of MC, the simulation has tendency to sample around a localized sub-set of low energy configurations that are located close by. To ensure coverage, one needs to compensate for the lack of ergodicity by computationally intensive use of a large number of samples.

EASAL has flexibility in sampling distributions, but in all cases guarantees reasonable coverage of low energy regions even if they have low effective dimension. It tends to oversample these lower energy regions (i.e, regions where more pairs of atoms are in their Leonard-Jones well). Hence EASAL can help MMCS to find high energy barriers and force the MC simulations to move to the locations of low energy basins. Besides, Atlas regions with lower energy values where MC did not sample can be used as a seed for MC in order to generate the free energy profile for the whole configuration space.

More precisely, we expect EASAL to be used to help evaluate/improve MC with the following path.

  1. Run a coarse-grained version of MC with the usual energy function and a relatively small number of samples

  2. Run EASAL with constraints extracted from the MC energy function, verify that EASAL space encompasses the relevant space

  3. List description of Atlas regions where MC did not sample, and give an estimate of the volume of the regions.

  4. Compute energy on these regions, and if low, then run MC seeded on these (lower volume) regions (from which larger volume regions can be reached), in order to generate the free energy profile for the whole configuration space.

  5. Compare new trajectories with old MC trajectories.

5 Conclusion

Our results in sampling the assembly landscape of 2 TransMembrane Helices, with short-range pair-potentials, demonstrate that variants of EASAL provide good coverage of narrow regions of low potential energy and low effective dimension with much fewer samples and computational resources than traditional MonteCarlo or Molecular Dynamics based sampling. Combining the complementary advantages of the two methods can significantly improve the current state of the art of free energy and other integral computations specifically for small assemblies.

EASAL is tailored for such assemblies and can be used to improve accuracy and ergodicity guarantees for Monte Carlo trajectories. It can also help explain the behavior of MC trajectories by using EASAL to infer geometric and topological features of the configuration space.

EASAL can additionally be used to evaluate and complement MC’s performance. EASAL could help MMCS to find high energy barriers and force the simulations to move to the locations of low energy basins. Additionally, lower energy regions located by EASAL can be used as seeds for MC trajectories, to speed up traversal of the entire configuration space. To validate these hypotheses, an MC simulation enhanced with EASAL will have to be compared with a standard MC simulation in terms of accuracy as well as resource-intensiveness.

References

  1. Janin, J. Mol. BioSyst. 2010, 6, 2351–2362.
  2. Ritchie, D. W. Current Protein and Peptide Science 2008, 9, 1–15, doi:10.2174/138920308783565741.
  3. Vajda, S.; Kozakov, D. Current Opinion in Structural Biology 2009, 19, 164 –170.
  4. Fernández-Recio, J.; Sternberg, M. J. E. Prot.Struct.Func.Bioinfo. 2010, 78, 3066, 3065.
  5. Ozkan, A.; Sitharam, M. EASAL: Efficient Atlasing, Analysis and Search of Molecular Assembly Landscapes. Proceedings of the ISCA 3rd International Conference on Bioinformatics and Computational Biology, 2011.
  6. Ozkan, A.; Wu, R.; Peters, J.; Sitharam, M. (on arxiv).
  7. Ozkan, A.; Sitharam, M. (on arxiv).
  8. Ozkan, A.; Pence, J.; Wu, R.; Baker, T.; Willoughbyand, J.; Peters, J.; Sitharam, M. (on arxiv).
  9. Wu, R.; Ozkan, A.; Bennett, A.; Agbandje-Mckenna, M.; Sitharam, M. Robustness Measure for an Adeno-associated Viral Shell Self-assembly is Accurately Predicted by Configuration Space Atlasing Using EASAL. Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, New York, NY, USA, 2012; pp 690–695.
  10. Wu, R.; Ozkan, A.; Bennett, A.; Agbandje-McKenna, M.; Sitharam, M. (on arxiv).
  11. Lazaridis, T.; Karplus, M. Proteins 1999, 35, 133–152.
  12. Lazaridis, T. Proteins 2003, 52, 176–192.
  13. Im, W.; Feig, M.; Brooks, C. L. Biophysical Journal 2003, 85, 2900–2918.
  14. Sitharam, M.; H.Gao, Discrete and Computational Geometry 2010.
  15. Chittamuru, U. Efficient Iterative algorithm for bounding and sampling the Cayley configuration space of partial 2-trees in 3D; M.S. Thesis University Of Florida, 2010.
  16. Peters, J.; Fan, J.; Sitharam, M.; Zhou, Y. Elimination in generically rigid 3D geometric constraint systems. Proceedings of Algebraic Geometry and Geometric Modeling, Nice, 2004; pp 27–29.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
108125
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description