Physical model of the genotypetophenotype map of proteins
Abstract
How DNA is mapped to functional proteins is a basic question of living matter. We introduce and study a physical model of protein evolution which suggests a mechanical basis for this map. Many proteins rely on largescale motion to function. We therefore treat protein as learning amorphous matter that evolves towards such a mechanical function: Genes are binary sequences that encode the connectivity of the amino acid network that makes a protein. The gene is evolved until the network forms a shear band across the protein, which allows for longrange, soft modes required for protein function. The evolution reduces the highdimensional sequence space to a lowdimensional space of mechanical modes, in accord with the observed dimensional reduction between genotype and phenotype of proteins. Spectral analysis of the space of solutions shows a strong correspondence between localization around the shear band of both mechanical modes and the sequence structure. Specifically, our model shows how mutations are correlated among amino acids whose interactions determine the functional mode.
pacs:
87.14.E, 87.15.v, 87.10.eI Introduction: proteins and the question of the genotypetophenotype map
DNA genes code for the threedimensional configurations of amino acids that make functional proteins. This sequencetofunction map is hard to decrypt since it links the collective physical interactions inside the protein to the corresponding evolutionary forces acting on the gene Koonin et al. (2002); Xia and Levitt (2004); Dill and MacCallum (2012); Zeldovich and Shakhnovich (2008); Liberles et al. (2012). Furthermore, evolution has to select the tiny fraction of functional sequences in an enormous, highdimensional space Povolotskaya and Kondrashov (2010); Keefe and Szostak (2001); Koehl and Levitt (2002), which implies that protein is a nongeneric, informationrich matter, outside the scope of standard statistical methods. Therefore, although the structure and physical forces within a protein have been extensively studied, the fundamental question as to how a functional protein originates from a linear DNA sequence is still open, in particular, how the functionality constrains the accessible DNA sequences.
To examine the geometry of the sequencetofunction map, we devise a mechanical model of proteins as amorphous learning matter. Rather than simulating concrete proteins, we construct a model which captures the hallmarks of the genotypetophenotype map. The model is simple enough to be efficiently simulated to gain statistics and insight into the geometry of the map. We base our model on the growing evidence that largescale conformational changes – where big chunks of the protein move with respect to each other – are central to function Koshland (1958); HenzlerWildman et al. (2007); Savir and Tlusty (2007); Schmeing et al. (2009); Savir and Tlusty (2010); Huse and Kuriyan (2002); Savir and Tlusty (2013). In particular, allosteric proteins can be viewed as ‘mechanical transducers’ that transmit regulatory signals between distant sites Perutz (1970); Goodey and Benkovic (2008); Lockless and Ranganathan (1999); Ferreon et al. (2013).
Dynamics is essential to protein function, but it is hard to measure and simulate due to the challenging spatial and temporal scales. Nevertheless, recent studies suggest a physical picture of the functionallyrelevant conformational changes within the protein: Nanorheological measurements showed lowfrequency viscoelastic flow within enzymes Qu and Zocchi (2013), with mechanical stress affecting catalysis Joseph et al. (2014). Computation of amino acid displacement, by analysis of structural data, demonstrated that the strain is localized in 2D bands across allosteric enzymes Mitchell et al. (2016). We therefore take as a target function to be evolved in our protein such a largescale dynamical mode. Other important functional constraints, such as specific chemical interactions at binding sites, are disregarded here because they are confined to a small fraction of the protein. We focus on this mechanical function whose large scale, collective nature leads to longrange correlation patterns in the gene.
Our model includes essential elements of the genotypetophenotype map: the target mechanical mode is evolved by mutating the ‘gene’ that determines the connectivity in the amino acid network. During the simulated ‘evolution’, mutations eventually divide the protein into rigid and ‘floppy’ domains, and this division enables largescale motion in the protein Gerstein et al. (1994). This provides a concrete map between sequence, configuration, and function of the protein. The computational simplicity allows for a massive survey of the sequence universe, which reveals a strong signature of the protein’s structure and function within correlation ‘ripples’ that appear in the space of DNA sequences.
Ii Model and results
We give here a summary and interpretation of our results, The appendix contains further details and explains choices we made in designing the model as close as possible to real proteins.
ii.1 Mechanical model of protein evolution
Our model is based on two structures: a gene, and a protein, which are coupled by the genotypetophenotype map.
The coarsegrained protein is an aggregate of amino acids (AAs), modeled as beads, with shortrange interactions given as bonds (Fig. 1). A typical protein is made of several hundred AAs, and we take . We layer the AAs on a cylinder, high wide, similar to dimensions of globular proteins. The cylindrical configuration allows for fast calculation of the low energy modes, and thereby fast evolution of the protein.
Each AA may connect to the nearest five AAs in the layer below, so that we get effective AA species, which are encoded as 5letter binary codons
To become functional, we want the protein to evolve to a configuration of AAs and bonds that can
transduce a mechanical signal from a prescribed input at the bottom of the cylinder to a prescribed output at its top
These largescale deformations are governed by the rigidity pattern of the configuration, which is determined by the connectivity of the AA network via a simple majority rule (Fig. 1) which we detail in Sect. A.3. The basic idea is that each AA can be either rigid or fluidized and that this rigidity state propagates upwards: Depending on the number of bonds and the state of other AAs in its immediate neighborhood, an AA will be rigidly connected, ‘shearable’, i.e., loosely connected, or in a pocket of less connected AAs within a rigid neighborhood
The model is easy to simulate: We start from a random gene of bits, and at each time step we flip a randomly drawn bit, thus adding or deleting a bond. In a zerotemperature Metropolis fashion, we keep only mutations which do not increase the distance from the target function, i.e., the number of errors between the state in the top row and the prescribed outcome. Note that, following the logics of biological evolution, the ‘fitness’ of the protein is only measured at its functional surface (e.g., where a substrate binds to an enzyme) but not in its interior.
Typically, after  mutations this inputoutput problem is solved (Fig. 2). Although the functional sequences are extremely sparse among the possible sequences, the small bias for getting closer to the target in configuration space directs the search rather quickly. Therefore, we could calculate as much as runs of the simulation which gave independent solutions of the evolutionary task.
Dimension measurement for the straight (S, top) and tilted (T, bottom) cases. independent functional configurations were found for the inputoutput problem. An estimate for the dimension of the solutions is the correlation length, the slope of the cumulative fraction of solution pairs as a function of distance. In configuration space (red), the distance is the number of AAs (out of 540) with a different rigidity state. The estimated dimension from distances is about (black line) for problem S and in problem T.
The sequence space is a 2550dimensional hypercube with sequences. Most distances are close to the typical distance between two random sequences (2550/2 = 1275), indicating a highdimensional solution space. An estimate for the dimension is (black line) for both S and T problems. The similarity of the dimensions in both cases suggests that these numbers are not specific to the problem.
ii.2 Dimensional reduction in the phenotypetogenotype map
Thanks to the large number of simulations, we can
explore vast regions of the genetic universe. That the sampling is welldistributed can be seen from
the typical intersequences distance, which is comparable with the universe diameter (Fig. 4).
This also indicates that the dimension of the solution set is high. Indeed, the observed dimension
of sequence space, as estimated following Procaccia (1988); Eckmann and Ruelle (1992), is practically infinite ()
On the other hand, very few among the configurations are solutions, owing to the physical constraints of contiguous rigid and shearable domains. As a result, when mapped to the configuration space, the solutions exhibit a dramatic reduction to a dimension of about 810 Grassberger and Procaccia (1983). This reduction between ‘genotype’ (sequence) and ‘phenotype’ (configuration, function) Savir et al. (2010); Shoval et al. (2012) is the outcome of physical constraints on the mechanical transduction problem. In the nearly random background of sequence space, these constraints are also manifested in longrange correlations among AAs on the boundary of the shearable region (Fig. 5 and Sect. B.4).
ii.3 Spectral analysis reveals correspondence of genotype and phenotype spaces
Spectral analysis of the solution set in both sequence and configuration spaces provides further information on the sequencetofunction map (Fig. 6). The sequence spectrum is obtained by singular value decomposition (SVD) of a matrix, whose rows are the binary genes of the solution set. The first few eigenvectors (EVs) with the larger eigenvalues capture most of the genetic variation among the solutions, and are therefore the collective degreesoffreedom of protein evolution (Fig. 6B). The EV is the average sequence, and the next EVs highlight positions in the gene that tend to mutate together to create the fluid channel.
The spectrum of the configuration space is calculated in a similar fashion by the SVD of a matrix, whose rows are the configurations of the solutions set (Fig. 6A). In the configuration spectrum, there are 810 EVs which stand out from the continuous spectrum, corresponding to the dimension shown in Fig. 3. Although the dimension of the sequence space is high (), there are again only 89 eigenvalues outside the continuous random spectrum.
These isolated EVs distill beautifully the nonrandom components within the mostlyrandom functional sequences. The EVs of both sequence and configuration are localized around the interface between the shearable and rigid domains. The similarity in number and in spatial localization of the EVs reveals the tight correspondence between the configuration and sequence spaces.
This duality is the outcome of the sequencetofunction map defined by our simple model: The geometric constraints of forming a shearable band within a rigid shell, required for inducing longrange modes, are mirrored in longrange correlations among the codons (bits) in sequence space. The corresponding sequence EVs may be viewed as weak ’ripples’ of information over a sea of random sequences, as only about 8 out of 2550 modes are nonrandom (0.3%). These information ripples also reflect the selfreference of proteins and DNA via the feedback loops of the cell circuitry Tlusty (2016).
It is instructive to note similarities and differences between the spectra. While the spectra of the configuration space and of the sequence space have a similar form — with a continuous, more or less random, part and a few isolated eigenvalues above it — the location of the random part is different: In the configuration case it is close to zero while in the sequence case it is concentrated at large values around .
The geometric interpretation is that the cloud of solution points looks like an 89 dimensional flat disk in the configuration case, while in the sequence space, it looks like a highdimensional almostspherical ellipsoid. The few directions slightly more pronounced of this ellipsoid correspond to the nonrandom components of the sequence. The slight eccentricity of the ellipsoid corresponds to the weak nonrandom signal above the random background. This also illustrates that the dimension of the sequence space is practically infinite, while in the configuration space it is comparable to the number of isolated eigenvalues.
We verified that the dimensional reduction and the spectral
correspondence depend very little on the details of the models. For
example, we examined a model with AA species instead of .
ii.4 Stability of the mechanical phenotype under mutations
First, we determine how many mutations lead to a destruction
of the solution (Fig. 8A).
About of all solutions are destroyed by just one random mutation. The exponentially decaying probability of surviving mutations signals that these mutations act quite independently. Fig. 8B which shows the location of these destructive mutations around the shearable channel
We have also studied the loci where two interacting mutations will destroy a solution (i.e., none of the two is by itself destructive). In most cases, the two mutations are close to each other, acting on the same site. The channel is less vulnerable to such mutations, but the twin mutations are evenly distributed over the whole rigid network (Fig. 8C).
ii.5 Fluid channel supports lowenergy shear modes
The evolved rigidity pattern supports lowenergy modes with strain localized in the floppy, fluid channel. We tested whether the evolved AA network indeed induces such modes (Fig. 9), by calculating the mechanical spectrum of a spring network in which bonds are substituted by harmonic springs. The shear motion of the network is characterized by the modes of , its elastic tensor. is the x curvature matrix in the harmonic expansion of the elastic energy , where is the vector of the 2D displacements of the AAs. has the structure of the network Laplacian multiplied by the x tensors of directional derivatives (see Sect. B.3, which is derived from (Chung and Sternberg, 1992, pp. 618–9)).
We traced the mechanical spectrum of the protein during the evolution of the fluidized channel (a shear band). We found that the formation of a continuous channel of less connected amino acids indeed facilitates the emergence of lowenergy modes of shear or hinge deformations (Fig. 9). The energy of such low modes nearly vanishes as the channel is close to completion. Similar deformations, where the strain is localized in a rather narrow channel, occur in real proteins, as shown in recent analysis of structural data Mitchell et al. (2016).
ii.6 Proteins can adapt simultaneously to multiple tasks
Our models were designed to trace the evolution of a mechanical function and show how it constrains the genotypetophenotype map, as shown above. Real proteins also evolve towards other essential functions, such as binding affinity and biochemical catalysis at specific binding sites. Here, we examine another important molecular trait, stability.
Many studies examine the energetic stability of the protein, as measured by its overall free energy () Liberles et al. (2012); Zeldovich and Shakhnovich (2008); Koehl and Levitt (2002). In the present model, this free energy is given by the number of bonds, which represent chemical and physical interactions among the amino acids. The higher the number of bonds the more stable and less flexible is the protein. By tuning stability, organisms adapt to their environment. Thermophiles that live in hotter places, such as hydrothermal vents, evolve more stable proteins to withstand the heat. Cryophiles that reside in colder niches have more flexible proteins Jaenicke and Böhm (1998).
We simulated the evolution of the two phenotypes, our specific dynamical mode together with an energetic state (i.e., a given bond density). We find that the large solution set of the mechanical problem allows the protein to select a subset with a specific energetic state. Thus, the evolutionary dynamics could find solutions to the same mechanical function when we imposed extreme values of bond density (Fig. 10). This demonstrates the capacity of the protein to search in parallel for the solutions of several biological tasks. Evolving a specific binding site is expected to be an easier task, since such sites are confined to a small fraction of the protein.
ii.7 Amino acid interactions
In the model described so far, the bonds were determined by the AA
species alone, while in real proteins, it is the interaction between pairs of AA which determines the formation of bonds
To model twobody AA interactions we consider a set of three AA species, which we call , and . Whether a bond is formed or not is determined by a symmetric binary relation , which we write as a interaction matrix,
This variant of the model is reminiscent of the HP model with its two species of AAs Dill (1985). The interaction range is kept identical to that our standard model, namely an AA can form a bond the 5 nearest neighbors in the adjacent rows.
The ‘gene’ in this variant of the model is a sequence of twoletter binary codons, , each representing an AA, such that the overall length of the gene is bits. The genetic code is a map from codons to AAs, . Since there are four codons and only three AAs, there is a redundancy in the ‘genetic code’. This is reminiscent of the (higher) redundancy of the natural genetic code in which AAs are encoded by codons Tlusty (2007); Eckmann (2008); Tlusty (2010) (out of the codons are ‘stop’ codons). In our 4codon genetic code, the redundant AA is chosen to be , , and the two other AA are encoded as , . For a given gene, the bond pattern is determined by looking at all AA pairs within the interaction range and calculating their coupling according to the interaction matrix (Table 1), . Once the bond network is determined from the gene, the rigidity pattern, rigid, fluid or ‘trapped’, is calculated as in the standard model (Sect. II.1).
In the simulations, at each step we flip one letter in a randomly selected codon. A quarter of the mutations are synonymous, since they exchange ’’ and ’’. The other three quarters add or cut bonds, and we check, as before, whether the connectivity change moves the rigidity pattern closer to a pattern that allows for a lowenergy floppy mode. A small number of beneficial mutations eventually resolve the mechanical transduction problem, typically after mutations.
In Fig. 11 we present some data (obtained from solutions) to illustrate the robustness of the results relative to model changes. We find that, despite having changed the connectivity model, our main conclusions regarding the geometry of the phenotypetogenotype map remain intact: A huge reduction from a highdimensional genotype space () to a lowdimensional phenotype space (), similar to the dimensions in Fig. 3. It is noteworthy that the configuration eigenvectors are very similar to those of simpler model (as in Fig. 6), although they are determined by very different bonding interactions. This is evident in the (nonrandom) bond eigenvectors which are similar in number to those of the pervious model but differ in pattern owing to the different bonding rules of Table 1. The robustness of the results manifests the universality of the dimensional reduction which originates from the continuity of the mechanical transduction.
Iii Conclusions
Our models of the genotypetophenotype map put forward a new physical picture of protein evolution. Our thesis is that rather than structure itself, it is the dynamics that governs protein fitness. Our method considers proteins as evolving amorphous matter with a mechanical function, a specific lowenergy conformational change. The rigidity/shearability pattern of the protein, and hence its dynamical modes, are determined by the connectivity of the amino acid interaction network. The model explains how the spatiallyextended modes appear as the gene mutates and changes the amino acid network. These modes are shear and hinge motions where the strain is localized in the shearable channel and where the surrounding domains translate or rotate as rigid bodies (Fig. 9).
A main insight from our model is that requiring the protein to have ‘floppy’ modes puts strong constraints on the space of mechanical phenotypes. As a consequence there is a huge dimensional reduction when mapping genotypes to phenotypes. We find that the collective mechanical interactions among the amino acids are mirrored in corresponding modes of sequence correlation in the genes. These main results do not depend on details of the model and have been reproduced in versions with (i) a different number of AA species (16 instead of 32), (ii) bonds that depend on pairwise interactions, and (iii) harmonic spring network Dutta et al. (). All these suggest that the results are generic and apply to a wide range of realizations.
Our models are distilled to their simplest physicalmathematical schemes, but have concrete, experimentally testable predictions. In the functional protein, the least random, strongly correlated sites are concentrated in a rigid shell that envelops the shearable channel Mitchell et al. (2016). Our model therefore predicts that these sites are also the most vulnerable to mutations (Fig. 8B), which distort the lowfrequency modes and thus hamper the biological function. These effects can be examined by combining mutation surveys, biochemical assays of the function, and physical measurements of the lowfrequency spectrum, especially in allosteric proteins.
To that end, one may take an enzyme with a known shear band (via analysis similar to Mitchell et al. (2016)) and mutate amino acids within and around the band. We expect the mutation of these amino acids to have a significant impact on the dynamics and biochemical function of the protein, as compared to other mutations in the rigid subdomains. By sequence alignment methods Casari et al. (1995); Gogos et al. (2000); Teşileanu et al. (2015); Rivoire et al. (2016), it is possible to test whether these sensitive positions in the protein exhibit strong correlations in the gene, as predicted by the model. One may also search for the dimensional reduction predicted by the model in high resolution maps of molecular fitness landscapes Jacquier et al. (2013); Roscoe et al. (2013); Firnberg et al. (2014); Kondrashov and Kondrashov (2015).
Past studies have shown that the motion of proteins Levitt et al. (1985); Tirion and Benavraham (1993); Bahar and Rader (2005); Zheng et al. (2006) and their hydrophobicity patterns England (2011) may often be approximated by a few normal modes, while others have demonstrated that the variation in aligned sequences may be characterized by a few correlation modes Casari et al. (1995); Gogos et al. (2000); Teşileanu et al. (2015); Rivoire et al. (2016). The present study links the genotype and phenotype spaces, and explains the dimensional reduction as the outcome of a nonlinear mapping between genes and patterns of mechanical forces: We characterize the emergent functional mode to be a soft, ‘floppy’ mode, localized around a fluidized channel (a shear band), a region of lower connectivity which is therefore easier to deform. The contiguity of this rigidity pattern implies that it can be described by a few collective degrees of freedom, implying a vast dimensional reduction of configuration space.
The concrete genotypetophenotype map in our simple models demonstrates that most of the gene records random evolution, while only a small nonrandom fraction is constrained by the biophysical function. This drastic dimensional reduction is the origin of the flexibility and evolvability in the functional solution set.
Appendix A The protein evolution model
a.1 The cylindrical amino acid network
We model the protein as an aggregate of amino acids (AAs) with short range interactions. In our coarse grained model, beads represent the AAs and bonds their interactions with neighboring AAs (Fig. 1). We consider a simplified cylindrical geometry, where the AAs are layered on the surface of a cylinder at randomized positions, to represent the noncrystalline packing of this amorphous matter. Throughout this study, we examine a geometry with height , i.e., the number of layers in the direction, and width , i.e., the circumference of the cylinder. When the cylinder is shown as a flat 2D surface (such as in Fig. 2), there are still periodic boundary conditions in the horizontal direction. The row and column coordinates of an AA are , with for the row and for the column . The cylindrical periodicity is accounted for by taking the horizontal coordinate modulo , .
Each AA in row can connect to any of its five nearest neighbors in the next row below, . This defines effective species of amino acids that differ by their ‘chemistry’, i.e., by the pattern of their bonds. Therefore, in the gene, each AA at is encoded as a letter binary codon, , where the kth letter denotes the existence () or absence () of the kth bond. The gene is the sequence of codons which represent the AAs of the protein. This means that each codon just specifies which of the 5 bonds are present or absent. Therefore, the codons are a genetic sequence of digits 0 or 1. Each of these numbers determines whether or not a bond connects two positions of the grid. Since the bonds from the bottom row do not affect the configuration of the protein and the resulting dynamical modes, the relevant length of the gene is somewhat smaller, .
a.2 Evolution searches for a mechanical function
We now define the target of evolution as finding a functional protein, in the following specific sense: To become functional, the protein has to evolve a configuration of AAs and bonds that can transduce a mechanical signal from a prescribed input at the bottom of the cylinder to a prescribed output at its top. This signal is a largescale, lowenergy deformation where one domain moves rigidly with respect to another in a shear or hinge motion, which is facilitated by the presence of a fluidized, ‘floppy’ channel separating the rigid domains Alexander et al. (1983); Phillips and Thorpe (1985); Alexander (1998).
a.3 Rigidity propagation algorithm
The largescale deformations are governed by the rigidity pattern of the configuration, which is determined by the connectivity of the AA network via a simple majority rule (Fig. 1). The details of this majority rule are as follows (Fig. 12): Each AA position will have two binary properties, which define its state:

The rigidity : This property can be rigid () or fluid ().

The shearability : This property can be shearable () or nonshearable (). As shown below, a nonshearable AA can be either rigid or fluid within a rigid domain of the protein. Nonshearable domains tend to move as a rigid body (i.e., via translation or rotation), whereas shearable regions are easy to deform.
Only 3 of the 4 possible combinations are allowed :

Nonshearable and solid AA (yellow): ().

Nonshearable and fluid AA (red): ().

Shearable and fluid AA (blue): ().

Shearable solid is forbidden.
Given a fixed sequence, and an input state in the bottom row of the cylinder, , the state of the cylinder is completely determined as follows: The three states percolate through the network, from row to row (see Fig. 12). This propagation is directed by the presence of bonds, with a maximum of bonds ending in each AA (of rows to ; the state of the first row is given as input). These bonds can be present(=1) or absent(=0). according to the codon , when they point to the AA with coordinate coming from the AA .
In a first sweep through the rows, we deal with the rigidity property . In row each of the AAs is in a rigidity state rigid () or fluid (). In all other rows, to , the 5 bonds determine the value of the rigidity of through a majority rule:
(1) 
where is the step function (, )). The parameter is the minimum number of rigid AAs from the row that are required to rigidly support AA: In 2D each AA has two coordinates which are constrained if it is connected to two or more static AAs. In this way, the rigidity property of being pinned in place propagates through the lattice, as a function of the initial row and the choice of the bonds which are present as encoded in the gene.
We next address the shearability property. It is determined by the rigidity of AAs as follows: We assume that all fluid AAs in row are also shearable (blue: ()). A fluid node in row will become shearable exactly if at least one of its neighbors or is shearable:
(2) 
where . The first term on the lhs ensures that a solid AA can never become shearable. This completes the definition of the map from the sequence to the state.
a.4 Fitness and mutations
As we explained before, the aim is to find a functional protein which can transfer forces. To find such a protein, we start from a random sequences (of 2550 codons), and from an initial state (input) in the bottom row of the cylinder. This initial state is just made from rigid and fluid beads, as shown e.g., in Fig. 2. For most simulations, we just took 5 consecutive fluid beads among the remaining solid beads.
We next define the target. It is a chain of values, fluid and shearable () or solid (), in the top row, which the protein should yield as an output: , . Given (i) a gene sequence, which determines the connectivity and (ii) the input state, , , the algorithm described above uniquely defines the output state in the top row, , . At each step of evolution, the output state is compared to the fixed, given target, by measuring the Hamming distance, the number of positions where the output differs from the target:
(3) 
In the biological convention is the fitness that should increase towards a maximum value of , when the inputoutput problem is solved.
Solutions are found by mutations. At each iteration, a randomly drawn digit in the gene is flipped, that is, the values of 0 and 1 are exchanged. This corresponds to erasing or creating a randomly chosen link of a randomly chosen AA. After each flip, a sweep is performed, and the new output at the top row is again compared to the target. A mutation is kept only if the Hamming distance is not increased as compared to the value before the mutation (equivalently the fitness is not allowed to decrease). This procedure is repeated until a solution is found. This will happen with probability 1, perhaps after very many flips, if the problem has a solution at all. This is really the Metropolis algorithm Metropolis et al. (1953) algorithm (at 0 temperature).
Remark: It is an important feature of our model that the quality of a network is only measured at the target line. This corresponds to the biological fact that the protein can only interact with the outside world through is surface (in our case, the ends of the cylinder). One of the surprising outcomes of our study is that this requirement has a strong influence on what happens in the interior of the protein. Also, the propagation of fluidity should not be confused with learning in neural networks, but is rather of the percolation type.
a.5 Simulation of evolutionary dynamics
All simulations are done on the playground, as described above. We have done simulations for many variants of the model, and many targets, but we present only two specific problems, for which the most extensive study was done: In the first, the fluid regions of the input and the target are opposite and of length 6 at the bottom and length 5 at the top. In the second run, top and bottom are the same, but the top is shifted sideways by 5 units. We will call these two examples straight and tilted, denoted as S and T. We have also studied examples in which the position of the target (relative to the input) is left free, but here we only discuss the results for the ’S’ and ’T’ case. This serves to illustrate that the results are largely independent of the details of the model. We have studied many other variants, and in all cases, the main results are qualitatively unchanged.
Remark: We view this as an important outcome of our theory, namely that it illustrates a close connection between gene and protein which goes way beyond the simple model we consider here.
For both, S and T, we study 200 independent branches, starting from a random sequence with about of the bonds present at the start. Given any fixed sequence, we sweep according to the rules of Eq(1)(2) through the net, and measure the Hamming distance (Eq(3)) between the last row and the desired target. When this Hamming distance is 0, we consider the problem as solved. If not, we flip randomly a bond (exchanging 0 with 1) and recalculate the Hamming distance. We view this flip as a mutation of the sequence, equivalent to mutating one nucleic base in a gene. If the Hamming distance decreases or remains unchanged, we keep the flip, otherwise we backtrack and flip another randomly chosen bond. This is repeated until a solution is found. (This is really a Metropolis algorithm Metropolis et al. (1953) at zero temperature.) Typically, after  mutations this inputoutput problem is solved. Although the functional sequences are extremely sparse among the possible sequences, the small bias for getting closer to the target in configuration space directs the search rather quickly.
Once a solution is found, we destroy it by further mutations and then look for a new solution, as before, starting from the destroyed state. This we call a generation. For each of the 200 branches, we followed 5000 generations, leading to a total of solutions. The time to recover from a destroyed state is about 1500 flips per error in that state, which is similar to time it takes to find a solution starting from a random gene. A destruction takes around 11.2 mutations on average.
We also did another simulations starting each time from another random configuration. The statistics in both cases are very similar, but the destructionreconstruction simulations obviously show some correlations between a generation and the next. This effect disappears after about 4 generations.
Appendix B Results, analysis and interpretation
b.1 Dimension of solution set
Dimension of a space measures the number of directions in which one can move from a point. In the case of our model, since from any sequence in sequence space one can move along axes by flipping just one bit, we see that the sequence space has dimension 2550, and the number of different elements in this space is a hypercube with elements.
The set of solutions which we find, has however much smaller dimension, as we show in Fig. 3 for the straight and tilted example. In the case of experimental data, as ours, the dimension is most conveniently determined by the boxcounting (GrassbergerProcaccia Grassberger and Procaccia (1983)) algorithm. This is obtained by just counting the number of pairs at distances , and then finding the slope in a loglog plot. This is indicated by the black lines in Fig. 3 we see that, clearly, the dimension in the space of configurations is about 89, while, in the space of sequences, the dimension is basically ‘infinite’, namely just limited by the maximal slope one can obtain Procaccia (1988).
b.2 Spectrum in phenotype and genotype spaces
We compute spectra for both the sequences and the configurations, for the solutions. Let us detail this for the case of sequences: We have binary vectors with components each, and we want to know the ‘typical’ spectrum of such vectors. This is conveniently found with the Singular Value Decomposition (SVD), in which one forms a matrix of size . This matrix can be written as , where is , is and is an matrix which is diagonal in the sense that only the elements with are nonzero. (We assume here that we are in the case .) The are in general and in this case the singular value decomposition is unique. We call the set of the the spectrum of the sequences, and the vectors in the eigenvectors of the SVD. It is the first few of those which are shown in Fig. 6.
Note that the SVD eigenvalues are the square roots of the spectrum of the covariance matrix which has the same eigenvectors as . Therefore the high SVD eigenvalues correspond to the principal components, the directions with maximal variation in the solution set.
Mutatis mutandis, we perform the same SVD for the case of the configurations, using the values (that is, of the shearability) of vectors of the configurations. (This is reasonable, because, in general, there are very few nonshearable and fluid AAs.)
Apart from the numerical findings, which are shown in Fig. 6 for the straight (S) example and in Fig. 7 for the tilted (T) one, some comments are in order:
Configuration space (The eight figures on the bottom left): The first mode is proportional to the average configuration. The next modes reflect the basic deviations of the solution around this average. For example, the second modes is lefttoright shift, the third mode is expansioncontraction etc. Since, the shearable/nonshearable interface can move at most one AA sideways between consecutive rows, the modes are constrained to diamondshaped areas in the center of the protein. This is the joint effect of the ‘influence zones’ of the input and output rows.
Sequence space (The eight figures on the bottom right): The first eigenvector is the average bond occupancy in the solutions. The higher eigenvalues reflect the structure in the manybody correlations among the bonds. The typical pattern is that of ‘diffraction’ or ‘oscillations’ around the fluid channel. This pattern mirrors the biophysical constraint of constructing a rigid shell around the shearable region. Higher modes exhibit more stripes, until they become noisy, after about the tenth eigenvalue.
The bondspectrum, top right in Figs. 6 and 7, has some outliers, which correspond to the localized modes shown in the eight panels below. Apart from that, the majority of the eigenvalues seem to obey the MarčenkoPastur formula, see Marčenko and Pastur (1967). If the matrix is , , then the support of the spectrum is . In our case, since we have a matrix, one expects (if they were really random) to find the spectrum at , which is close to the experiment, and confirms that most of the bonds are just randomly present or absent. We attribute the slight enlargement of the spectrum to memory effects between generation in the same branch. This corresponds to the wellknown phylogenetic correlations among descendants in the same tree.
It is tempting to also study the continuous part of this spectrum, which is not quite of the standard form. While in principle, this could be done by taking into account the known correlations, even the techniques of Guhr et al. (1998) seem difficult to implement. We thank T. Guhr for helpful discussions on his issue.
b.3 Shear modes in the amino acid network
Consider now either of the two examples, straight or tilted (S and T). A solution of such an example is given by a set of bonds, and this set of bonds defines a graph on the AAs. This graph is embedded in 2D where are the positions of the AAs, which are connected by straight bonds. We now extend the scope of our study somewhat, by assuming that the bonds are not totally rigid, but given by harmonic springs (see also Dutta et al. ()). This allows us to study mechanical properties which would be too stiff if we only worked with bonds which are rigid sticks.
In this case, the calculations are straightforward, if somewhat complex, and they are, e.g., well explained in (Chung and Sternberg, 1992, pp. 618–619). We thus consider the elastic tensor, , which is the tensor product of the network Laplacian with the 2 by 2 tensor of directional derivatives.
For the reader who is unfamiliar with Chung and Sternberg (1992), we describe what this means componentwise. The playground has size in the direction and size in the direction, with periodic boundary condition in the direction. All bonds go from some to , , , again with periodic boundary conditions in the direction. Each such bond defines a direction vector in which we normalize to . Note that this vector depends on both the origin and the target of the bond.
If we imagine harmonic springs between the nodes connected by bonds (all with the same spring constant), then we can define the (symmetric) tensor matrix of deformation energies in the and direction by
and where each element of is—when and are connected
by a bond—the 2 by 2 matrix (indexed by )
{equa}
M(k,m )&=(d_x(k,m),d_z(k,m))^T⊗(d_x(k,m),d_z(k,m))
&=
(dx2dxdzdxdzdz2) .
If and are not connected, then is the 0
matrix. The elements of are denoted .
Finally we complete the matrix to a ‘Laplacian’ by adding diagonal elements to it, so that the row (and column) sums are 0. In components, this means that we require, for each and each , the sums
to vanish. Other properties of are described in Chung and Sternberg (1992).
Since we take periodic boundary conditions in the direction, there will always be a (simple) 0 eigenvalue of in this direction. Other 0 eigenvalues correspond to translation in the direction or rotation in the plane. Another type of (double) 0 eigenvalues are associated with any patch of nodes which is totally disconnected from the rest of the lattice. Since the density of bonds is about and otherwise quite random, and there are twice 5 bonds at each interior node we expect (assuming random distribution of bonds) there to be about isolated nodes, i.e., isolated singletons, and even fewer patches of greater size.
Further zero modes come from nodes which can oscillate sideways without first order effects. This will happen if a node is only connected by one bond. Since , the probability of finding such a node is about
Thus, we show in Figures 6 and 7 the eigenfunctions only for the first eigenvalues after the trivial ones. Due to the tensorial nature of the problem, the eigenvectors have two components, which we show as 2D shearflow.
b.4 Genetic correlation matrix
In Fig. 5, we study the correlations among the solutions in sequence space. Given the matrix , of all sequences, with , (of binary digits), we compute the means and the standard deviations . Then, in the usual way, we form and
Fig. 5 then shows , with the autocorrelation omitted.
Note that both, the means and the variances depend very weakly on . Fig. 5 reveals and reinforces several observations also made in other calculations of this paper. First, looking onto the axis in the figure one sees a periodicity of the patterns corresponding to the 17 gaps between the 18 rows of the configuration space. This reflects the necessity to maintain a connected liquid channel. Also, as seen in Fig. 5, the correlations grow somewhat towards the ends, especially toward the upper () end. This is because of the mechanical constraint which forces the channel to become more precise towards the ends, in analogy with Fig. 8B.
The periodic patterns all over the square reflect not only the natural periodicity of 150 () elements in the sequence, but also show that the boundaries of the channel form a special shell (with two peaks per row).
b.5 Survival under mutations
Here, we ask how robust the solutions are as further mutations take place. First, we determine how many mutations lead to a destruction of the solution. The statistics of this is shown in Fig. 8. We note that about of all solutions are destroyed by just one mutation, while there is an exponential decay of survival of mutations. This signals that the mutations act independently.
One can also ask where the critical mutations take place. This is illustrated in Fig. 8B, and was discussed in the main text. We have also studied the places where exactly two mutations will kill a solution (and none of the 2 is a single site ‘killer’) (Fig. 8C) and in these cases, one finds that the two mutations are generally close to each other, acting on the same site. Again, the channel is less vulnerable to mutations but now the mutations are evenly distributed over the rest of the network.
b.6 Expansion of the protein universe
Let us explain in further detail how Fig. 4 was obtained. Here, we test our model against the ideas of Povolotskaya and Kondrashov (2010). Our results will give some insight about the nature of the graph of solutions. First, we describe the question as it is found in Povolotskaya and Kondrashov (2010). Take any two solutions and consider their gene sequences and . They will have a Hamming distance , which we normalize by dividing by 2550 (the number of elements in ), which we call the protein universe diameter. The question is how much the solution following one generation after differs from . If we call that solution , then the observed quantity is defined as follows: Let if and if , for . Then for each let . Note that if the change between and is towards and if it is away from . Finally, and , and we plot in Fig. 4 as a function of .
In Fig. 4 we show the results for data set S, (the plot for set T looks similar). The black curve is nothing but , where is the normalized Hamming distance, i.e., the proportion of sites which are different between and . The fit to this curve tells us an important aspect about the set of possible solutions. Note that the set of all possible forms a hypercube of dimension with corners. The set of solutions is a very small subset of this hypercube, where all corners which are not solutions have been taken away, including the bonds leading to these corners. This leads to a very complicated subgraph of the hypercube. While we do not have a good mathematical description of how it looks, the good fit shows that the comparisons between , , and are as if one performed a random walk on the full cube. (Note that such a result must be intimately connected to the high dimension of the problem, since for low dimensional hypercubes it does not hold.) Almost all solutions are at the edge of the universe, where the typical Hamming distances among the sequences are close to the typical distance between random sequences,
b.7 Flexibility of solutions: thermal stability
The histogram of the density of links for the solutions is
shown in Fig. 13. These distributions are obtained for simulations
in which links are flipped randomly in a symmetric fashion. One can
easily push these densities somewhat up or down, by
favoring/restricting the flips of links towards 1. However, much more
extreme solutions can be found by deterministic procedures which turn
as many links to 1 resp. 0. In these cases, we have obtained densities
of as high as 0.96 and as low as 0.14, that is, 2452/2550 links,
resp. 372/2550 links. Two such extreme cases are illustrated in
Fig. 10.
This shows that the model, if needed, can be adapted to questions of
temperature dependence of the protein, for example, by giving more or
less weight to the number of bonds, something like a chemical
potential in statistical mechanics.
Acknowledgements.
We thank Stanislas Leibler, Michael R. Mitchell, Elisha Moses and Giovanni Zocchi for essential discussions and encouragement. JPE is supported by an ERC advanced grant ‘Bridges’, and TT by the Institute for Basic Science IBSR020 and the Simons Center for Systems Biology of the Institute for Advanced Study, Princeton.Footnotes
 In our model, the AA species is determined by the bonds, while in real proteins the bonds are determined by the chemical nature and position of the AA (see also Sect. II.7).
 Note that in this simulation, we do not take as evolutionary criterion the mechanical signal itself, but require that the protein forms a fluid channel with a prescribed configuration. We show that this configuration facilitates the soughtafter mechanical shear motion in Sections II.5 and B.3. (In Dutta et al. () we take the mechanical modes themselves as the target function.)
 The propagation of rigidity is effectively a “double” percolation problem in which both fluid (blue) and rigid (gray) regions are continuous (see Sect. A.3).
 We lack sufficient data to determine such high dimensions precisely, and is a lower bound.
 The natural genetic code with its 20 AAs is therefore an intermediate case.
 The natural genetic code is redundant, i.e. several codons encode the same AA and are therefore synonymous. Such redundancy reduces the fraction of destructive mutations, since mutations that exchange synonymous codons do not change the encoded AA and are theretofore bound to be neutral. A case of redundant code is examined in Sect. II.7.
 At least two AAs. There may be also higher order terms of threebody interactions etc.
References
 Eugene V. Koonin, Yuri I. Wolf, and Georgy P. Karev, “The structure of the protein universe and genome evolution,” Nature 420, 218–223 (2002).
 Y. Xia and M. Levitt, “Simulating protein evolution in sequence and structure space,” Curr Opin Struct Biol 14, 202–207 (2004).
 Ken A. Dill and Justin L. MacCallum, “The proteinfolding problem, 50 years on,” Science 338, 1042–1046 (2012).
 Konstantin B Zeldovich and Eugene I Shakhnovich, “Understanding protein evolution: from protein physics to darwinian selection,” Annu Rev Phys Chem 59, 105–127 (2008).
 David A. Liberles, Sarah A. Teichmann, Ivet Bahar, Ugo Bastolla, Jesse Bloom, Erich BornbergBauer, Lucy J. Colwell, A. P. Jason de Koning, Nikolay V. Dokholyan, Julian Echave, Arne Elofsson, Dietlind L. Gerloff, Richard A. Goldstein, Johan A. Grahnen, Mark T. Holder, Clemens Lakner, Nicholas Lartillot, Simon C. Lovell, Gavin Naylor, Tina Perica, David D. Pollock, Tal Pupko, Lynne Regan, Andrew Roger, Nimrod Rubinstein, Eugene Shakhnovich, Kimmen Sjölander, Shamil Sunyaev, Ashley I. Teufel, Jeffrey L. Thorne, Joseph W. Thornton, Daniel M. Weinreich, and Simon Whelan, “The interface of protein structure, protein biophysics, and molecular evolution,” Protein Sci 21, 769–785 (2012).
 Inna S. Povolotskaya and Fyodor A. Kondrashov, “Sequence space and the ongoing expansion of the protein universe,” Nature 465, 922–926 (2010).
 A. D. Keefe and J. W. Szostak, “Functional proteins from a randomsequence library,” Nature 410, 715–718 (2001).
 Patrice Koehl and Michael Levitt, “Protein topology and stability define the space of allowed sequences,” Proceedings of the National Academy of Sciences 99, 1280–1285 (2002).
 DE Koshland, “Application of a theory of enzyme specificity to protein synthesis,” Proc Natl Acad Sci U S A 44, 98–104 (1958).
 K. A. HenzlerWildman, V. Thai, M. Lei, M. Ott, M. WolfWatz, T. Fenn, E. Pozharski, M. A. Wilson, G. A. Petsko, M. Karplus, C. G. Hubner, and D. Kern, “Intrinsic motions along an enzymatic reaction trajectory,” Nature 450, 838–U13 (2007).
 Y. Savir and T. Tlusty, “Conformational proofreading: the impact of conformational changes on the specificity of molecular recognition,” PLoS One 2, e468 (2007).
 T. M. Schmeing, R. M. Voorhees, A. C. Kelley, Y. G. Gao, F. V. th Murphy, J. R. Weir, and V. Ramakrishnan, “The crystal structure of the ribosome bound to eftu and aminoacyltrna,” Science 326, 688–94 (2009).
 Y. Savir and T. Tlusty, “Recamediated homology search as a nearly optimal signal detection system,” Mol Cell 40, 388–396 (2010).
 Morgan Huse and John Kuriyan, “The conformational plasticity of protein kinases,” Cell 109, 275–282 (2002).
 Y. Savir and T. Tlusty, “The ribosome as an optimal decoder: A lesson in molecular recognition,” Cell 153, 471–479 (2013).
 M. F. Perutz, “Stereochemistry of cooperative effects in haemoglobin: Haemhaem interaction and the problem of allostery,” Nature 228, 726–734 (1970).
 N. M. Goodey and S. J. Benkovic, “Allosteric regulation and catalysis emerge via a common route,” Nat Chem Biol 4, 474–482 (2008).
 Steve W. Lockless and Rama Ranganathan, “Evolutionarily conserved pathways of energetic connectivity in protein families,” Science 286, 295–299 (1999).
 Allan Chris M. Ferreon, Josephine C. Ferreon, Peter E. Wright, and Ashok A. Deniz, “Modulation of allostery by protein intrinsic disorder,” Nature 498, 390–394 (2013).
 H. Qu and G. Zocchi, “How enzymes work: A look through the perspective of molecular viscoelastic properties,” Phys Rev X 3 (2013).
 C. Joseph, C. Y. Tseng, G. Zocchi, and T. Tlusty, “Asymmetric effect of mechanical stress on the forward and reverse reaction catalyzed by an enzyme,” PLoS One 9 (2014), e101442.
 Michael R. Mitchell, Tsvi Tlusty, and Stanislas Leibler, “Strain analysis of protein structures and low dimensionality of mechanical allosteric couplings,” Proc Natl Acad Sci U S A 113, E5847–E5855 (2016).
 M. Gerstein, A. M. Lesk, and C. Chothia, “Structural mechanisms for domain movements in proteins,” Biochemistry 33, 6739–6749 (1994).
 S. Dutta, JP. Eckmann, and T. Tlusty, To be published.
 S. Alexander, C. Laermans, R. Orbach, and H. M. Rosenberg, “Fraction interpretation of vibrational properties of crosslinked polymers, glasses, and irradiated quartz,” Phys Rev B 28, 4615–4619 (1983).
 J. C. Phillips and M. F. Thorpe, “Constraint theory, vector percolation and glassformation,” Solid State Commun 53, 699–702 (1985).
 S. Alexander, “Amorphous solids: Their structure, lattice dynamics and elasticity,” Phys Rep 296, 65–236 (1998).
 Itamar Procaccia, “Complex or just complicated?” Nature 333, 498–499 (1988).
 J.P. Eckmann and D. Ruelle, “Fundamental limitations for estimating dimensions and lyapunov exponents in dynamical systems,” Physica D 56, 185–187 (1992).
 Peter Grassberger and Itamar Procaccia, “Characterization of strange attractors,” Phys Rev Lett 50, 346–349 (1983).
 Y. Savir, E. Noor, R. Milo, and T. Tlusty, “Crossspecies analysis traces adaptation of rubisco toward optimality in a lowdimensional landscape,” Proc Natl Acad Sci U S A 107, 3475–3480 (2010).
 O. Shoval, H. Sheftel, G. Shinar, Y. Hart, O. Ramote, A. Mayo, E. Dekel, K. Kavanagh, and U. Alon, “Evolutionary tradeoffs, pareto optimality, and the geometry of phenotype space,” Science 336, 1157–1160 (2012).
 Tiberiu Teşileanu, Lucy J. Colwell, and Stanislas Leibler, “Protein sectors: Statistical coupling analysis versus conservation,” PLoS Comput Biol 11, e1004091 (2015).
 Tsvi Tlusty, “Selfreferring dna and protein: a remark on physical and geometrical aspects,” Phil. Trans. Roy. Soc. A 374 (2016).
 F. R. K. Chung and S. Sternberg, ‘‘Laplacian and vibrationalspectra for homogeneous graphs,” Journal of Graph Theory 16, 605–627 (1992).
 Rainer Jaenicke and Gerald Böhm, “The stability of proteins in extreme environments,” Curr Opin Struct Biol 8, 738–748 (1998).
 Ken A. Dill, “Theory for the folding and stability of globular proteins,” Biochemistry (Mosc ) 24, 1501–1509 (1985).
 Tsvi Tlusty, “A model for the emergence of the genetic code as a transition in a noisy information channel,” J Theor Biol 249, 331–342 (2007).
 JeanPierre Eckmann, “Trading codes for errors,” Proceedings of the National Academy of Sciences 105, 8165–8166 (2008).
 Tsvi Tlusty, “A colorful origin for the genetic code: Information theory, statistical mechanics and the emergence of molecular codes,” Phys Life Rev 7, 362–376 (2010).
 G. Casari, C. Sander, and A. Valencia, “A method to predict functional residues in proteins,” Nat Struct Biol 2, 171–8 (1995).
 Arhonda Gogos, Derek Jantz, Sema Sentürker, Delwood Richardson, Miral Dizdaroglu, and Neil D. Clarke, “Assignment of enzyme substrate specificity by principal component analysis of aligned protein sequences: An experimental test using dna glycosylase homologs,” Proteins: Struct , Funct , Bioinf 40, 98–105 (2000).
 O. Rivoire, K. A. Reynolds, and R. Ranganathan, “Evolutionbased functional decomposition of proteins,” PLoS Comput Biol 12 (2016), ARTN e1004817 10.1371/journal.pcbi.1004817.
 Hervé Jacquier, André Birgy, Hervé Le Nagard, Yves Mechulam, Emmanuelle Schmitt, Jérémy Glodt, Beatrice Bercot, Emmanuelle Petit, Julie Poulain, GuilÃ¨ne Barnaud, PierreAlexis Gros, and Olivier Tenaillon, “Capturing the mutational landscape of the betalactamase tem1,” Proc. Nat. Acad. Sci. USA 110, 13067–13072 (2013).
 Benjamin P. Roscoe, Kelly M. Thayer, Konstantin B. Zeldovich, David Fushman, and Daniel N. A. Bolon, ‘‘Analyses of the effects of all ubiquitin point mutants on yeast growth rate,” J Mol Biol 425, 1363–1377 (2013).
 Elad Firnberg, Jason W. Labonte, Jeffrey J. Gray, and Marc Ostermeier, “A comprehensive, highresolution map of a geneâs fitness landscape,” Mol Biol Evol 31, 1581–1592 (2014).
 Dmitry A. Kondrashov and Fyodor A. Kondrashov, ‘‘Topological features of rugged fitness landscapes in sequence space,” Trends Genet 31, 24–33 (2015).
 Michael Levitt, Christian Sander, and Peter S. Stern, “Protein normalmode dynamics: Trypsin inhibitor, crambin, ribonuclease and lysozyme,” J Mol Biol 181, 423–447 (1985).
 M. M. Tirion and D. Benavraham, “Normal mode analysis of gactin,” J Mol Biol 230, 186–195 (1993).
 I. Bahar and A. J. Rader, “Coarsegrained normal mode analysis in structural biology,” Curr Opin Struct Biol 15, 586–592 (2005).
 W. J. Zheng, B. R. Brooks, and D. Thirumalai, “Lowfrequency normal modes that describe allosteric transitions in biological nanomachines are robust to sequence variations,” Proc Natl Acad Sci U S A 103, 7664–7669 (2006).
 JeremyÂ L England, “Allostery in protein domains reflects a balance of steric and hydrophobic effects,” Structure 19, 967–975 (2011).
 N Metropolis, AW Rosenbluth, MN Rosenbluth, AH Teller, and E Teller, “Equation of state calculations by fast computing machines,” J Chem Phys 21, 1087–1092 (1953).
 V. A. Marčenko and L. A. Pastur, “The spectrum of random matrices,” Teor. Funkciĭ Funkcional. Anal. i Priložen. Vyp. 4, 122–145 (1967).
 Thomas Guhr, Axel MüllerGroeling, and Hans A. Weidenmüller, “Randommatrix theories in quantum physics: common concepts,” Phys Rep 299, 189–425 (1998).