Formation of regulatory modules by local sequence duplication

Armita Nourmohammad, Michael Lässig

1 Institute for Theoretical Physics, University of Cologne, Zülpicherstr. 77, 50937 Köln, Germany


Turnover of regulatory sequence and function is an important part of molecular evolution. But what are the modes of sequence evolution leading to rapid formation and loss of regulatory sites? Here, we show that a large fraction of neighboring transcription factor binding sites in the fly genome have formed from a common sequence origin by local duplications. This mode of evolution is found to produce regulatory information: duplications can seed new sites in the neighborhood of existing sites. Duplicate seeds evolve subsequently by point mutations, often towards binding a different factor than their ancestral neighbor sites. These results are based on a statistical analysis of 346 cis-regulatory modules in the Drosophila melanogaster genome, and a comparison set of intergenic regulatory sequence in Saccharomyces cerevisiae. In fly regulatory modules, pairs of binding sites show significantly enhanced sequence similarity up to distances of about bp. We analyze these data in terms of an evolutionary model with two distinct modes of site formation: (i) evolution from independent sequence origin and (ii) divergent evolution following duplication of a common ancestor sequence. Our results suggest that pervasive formation of binding sites by local sequence duplications distinguishes the complex regulatory architecture of higher eukaryotes from the simpler architecture of unicellular organisms.


The importance of regulatory variations as a driving force for phenotypic evolution has been suggested some time ago [1, 2]. However, a quantitative understanding of gene regulation has become possible only after the advent of large-scale genomic sequence and regulatory interaction data. Important building blocks are genome-wide maps of protein-DNA binding, statistical inference methods  [3, 4], high-throughput measurements of sequence-specific binding affinities of transcription factors  [5, 6, 7, 8], and cross-species comparisons of regulatory sequences and regulatory functions [9].

The resulting picture is quite diverse: Core parts of developmental regulatory networks can be conserved over large evolutionary ranges [10], and individual promoters in flies are found to be conserved in function over large evolutionary distances  [11, 12]. Functional changes in promoters have been identified as well, but the relative roles of adaptive and near-neutral evolution remain to be clarified. The sequences in regulatory DNA regions evolve under less constraint than their functional output. This feature can be explained by wide-spread compensatory changes, which have been identified between different nucleotides within individual binding sites as well as between different sites within a promoter [13, 14, 15, 11, 16, 17, 12]. At the promoter level, this dynamics includes loss and gain of binding sites, the rates of which have been estimated in flies and yeast [18, 19, 13]. The observed site turnover is consistent with moderate negative selection acting on individual sites [20, 13], whereas the function of entire promoters is under stronger stabilizing selection [11].

The evolutionary constraint of regulatory sequence and function depends on the level of complexity in promoter architecture. Prokaryotes and unicellular eukaryotes have short intergenic regions, and regulatory functions are often encoded by few binding sites. The more complex cis-regulatory information in higher eukaryotes is organized into regulatory modules, which are typically a few hundred base pairs long and are spatially separated by larger segments of intergenic DNA [21, 22]. Within modules, regulatory functions often depend on clusters of neighboring binding sites for multiple transcription factors, which are coupled by cooperative interaction [23, 24, 25, 26, 27]. Bioinformatic algorithms trace such site clusters to identify regulatory DNA regions [28, 29, 30, 31, 32, 33, 34]. The relative order and spacing of sites within clusters follows a regulatory “grammar”, which distinguishes functionally neutral site changes from rearrangements affecting promoter function [35, 36, 37, 38, 17, 39]. The combinatorial complexity of this grammar ensures the specificity of regulation in the larger genomes of multicellular eukaryotes [40, 41]. At the same time, the grammar is flexible enough to allow substantial sequence evolution in a regulatory module while maintaining its overall functional output.

In addition to point mutations, sequence insertions and deletions (indels) play a significant role in this dynamics. Several studies have noted the prevalence of repetitive sequence elements in promoter regions and their potential influence on regulatory function [42, 43, 44, 45, 46, 47, 48, 49]. In particular, a recent detailed analysis of the evolutionary rates of short tandem repeats in Drosophila has shown a net surplus of insertions, suggesting that these repeats may produce new regulatory sequence [48]. But to what extent is this actually the case? A priori, the link between repeat evolution and regulation is far from obvious: Duplications in repeats can either be part of the neutral background evolution in regulatory sequences, or increase the spacing between existing binding sites of a regulatory module, or contribute to the formation of new sites. Disentangling these roles is subtle, because detected tandem repeats in contemporary sequence overlap with only a small fraction of binding sites, motif size and total length of most repeats are shorter than length and spacing of typical binding sites in a cluster, and repeat lifetimes are much shorter than conservation times of regulatory elements [49]. Hence, the role of repeat dynamics for regulation is an open problem: Do local duplications actually transport and produce regulatory information?

This is the topic of the present paper. We show that local duplications have left a striking signature in the fly genome: the majority of transcription factor binding sites in regulatory modules show evidence of a duplication event in their evolutionary history. We conclude that over long evolutionary times, local duplications are pervasive and crucial for the formation of complex regulatory modules in the fly genome. This mode of evolution sets the speed of regulatory evolution and facilitates adaptive changes of promoter function. We infer site duplications from their traces in the sequence of neighboring binding sites, but most duplication events predate the tandem repeats present in contemporary sequence. This distinguishes our study from comparative analysis of regulatory sequence between closely related species [48, 46, 49, 47, 45], which can detect the insertion-deletion dynamics of contemporary repeats, but cover only a small window in the evolution of regulatory sites.

The importance of binding site evolution by duplication is grounded in the biophysics of transcription factor-DNA interactions: the sequence-dependent probability of binding between factor and site depends in a strongly nonlinear way on the binding energy [3]: it takes values close to 1 in an energy range below the maximum binding energy, then drops rapidly as the energy decreases further, and is close to 0 in the energy range of non-binding sites. This nonlinearity generates strong epistatic effects for point mutations within binding sites [50, 13] and, in turn, an asymmetry in the turnover of binding sites. Functional sites can rapidly lose their binding affinity to a factor by one or two point mutations. Rapid adaptive formation of a site, however, requires a seed sequence with marginal binding, to which positive selection for point mutations towards stronger binding can latch on. Such seeds are contained in random sequence, but at unspecific positions. Estimates of the rate of site formation based on biophysically grounded fitness models suggest that point mutations alone can explain the rapid formation of an individual site in a sufficiently large sequence interval, but not the formation of spatially confined agglomerations of sites characteristic of regulatory modules [51, 50, 52]. As we show in this paper, local sequence duplications generate seeds for new sites specifically in the neighborhood of functional sites.

Our analysis proceeds in three steps. First, we analyze local sequence similarities in regulatory regions of the Drosophila melanogaster genome in a model-independent way. In regulatory modules, we find a significant autocorrelation in nucleotide content for distances up to about 70 bp. This autocorrelation includes the known contributions of tandem repeat sequences, but it extends to a much larger distance range. The signal turns out to be generated by local sequence clusters, a substantial fraction of which are functional transcription factor binding sites with similar sequence motifs. In the second part of the paper, we infer the evolutionary origin of these correlated pairs of binding sites, using a probabilistic model with mutations, genetic drift, and selection. The model compares the likelihood of two alternative histories: a pair of sites evolves either independently or by duplication from a common ancestor sequence. The duplication is followed by diversification under selection for binding of two (in general different) factors. We show that the duplication pathway is the most likely history for pairs of sites with a mutual distance up to about 50 bp, and we find evidence that this pathway is specific to regulatory modules of multicellular eukaryotes. Finally, we show that the duplication mode has adaptive potential: duplicated ancestor sites can act as seeds for the subsequent formation of a novel binding site for the same factor and, notably, even for a different factor.


Sequence autocorrelation in regulatory DNA

The most straightforward measure of local similarity in a sequence segment is the autocorrelation function, which is defined as the difference between the likelihood that two nucleotides at a distance of base pairs are identical and mean identity of two random nucleotides, . This function is straightforward to evaluate from sequence data as given by eq. (2) in Materials and Methods. We have obtained the autocorrelation function in 346 regulatory modules of the D. melanogaster genome with length of more than bp identified by REDfly database [53, 54, 55]. The results are shown in Fig. 1 (a). In the distance range up to about 70 bp, the function takes positive values that decay with in a roughly exponential way; this signal is clearly above the noise level. The mean identity is evaluated in a local window of 500 bp (changing the window length affects the baseline of this function, but not its short-distance behavior). The autocorrelation signal is small and has several potential sources, such as multiple binding sites for similar motifs, microsatellite and minisatellite repeats at short length scales [48, 47, 49, 46], homopolymeric stretches of nucleotides characteristic of nucleosome-depleted regions [56], or other local inhomogeneities in sequence composition. As a next step, we will characterize local sequence similarity in a more specific way: we will show that mutually correlated nucleotide pairs are not evenly distributed over regulatory modules, but occur in local clusters with a characteristic length scale of around 7 bp. This signal will be analyzed from an evolutionary point of view and be linked to cis-regulatory function.

Sequence motifs and information

To motivate the following analysis, assume that a given sequence segment is covered by families of sites belonging to different motifs. By definition, a motif is a probability distribution of genotypes , which describes a specific set of sequence sites with consecutive base pairs and is different from the background distribution . The statistical deviation of a motif from background is measured by the relative entropy between these distributions, , which is given by eq. (4) in Materials and Methods. This quantity determines the average sequence information per site, which is often quoted in units of bits [4]. Multiplying with the number of sites for each motif and summing over all motifs produces a measure of the total sequence information contained in a genomic region.

Well-known motifs in regulatory DNA are the families of binding sites for a given transcription factor. In eukaryotic systems, these sites have a typical length of about 5-10 bp and frequency distributions (called position weight matrices) with a typical information content bits per site; see the recent discussion by  [57]. Other motifs can be defined, for example, in nucleosome-depleted sequences in eukaryotes and for repeat units in tandem repeats. If all motifs occurring in a given sequence segment were known, we could try to predict their sites and evaluate the information content directly. In the present part of the analysis, we proceed differently. We only assume that sequence motifs carry a certain information content over sites of a given length , but we make no further assumptions on position weight matrices, sequence coverage, or evolutionary origin. We can still recover part of this sequence information from those motifs that occur more than once in the sequence segment. A pair of sites of length belonging to the same motif has an average similarity information given by the mutual entropy , which measures the enhanced similarity of aligned nucleotides of the site sequences compared to the background similarity and is given by eq. (5) in Materials and Methods. Clearly, the similarity information between pairs of sites is a somewhat diluted measure of the full information content due to motifs. As a rule of thumb, the mutual entropy per site pair, , recovers about half of the sequence information per site, . For example, binding sites for the same transcription factor are strongly correlated, with a typical similarity and a similarity information bits per site pair.

Here, we want to identify pairs of similar sites at a given distance and relate them to the sequence autocorrelation function discussed above. Thus, we estimate the total similarity information per unit sequence length of all strongly correlated pairs of sites with distance and length in regulatory modules. This quantity can be defined by constructing a set of site pairs for given and with the following properties: (i) Any pair of sites has an average mutual similarity between aligned nucleotides above a certain threshold, . (ii) The left sites (and, hence, also the right sites) of all pairs have no mutual overlaps. This condition is necessary in order to avoid overcounting of mutual similarity in overlapping site pairs. (iii) The sum of the mutual similarities of all pairs in the set is maximal (see Fig. S1 for illustration). This condition is also used to set the similarity threshold . To identify a set of site pairs with properties (i) to (iii), we use a dynamic programming algorithm as explained in Materials and Methods. This method allows for optimization of sequence length similar to the procedure in local sequence alignment algorithms [58]. In the maximum-similarity set, we record the average mutual similarity of aligned nucleotides in site pairs, which determines the mean information content per site pair, (see eq. (5) in Materials and Methods). We also record the number of site pairs and determine the excess over the number expected by chance in background sequence, (see Materials and Methods). The distance-dependent total similarity information per unit length in a sequence segment of size can then be estimated as .

Our inference of is related to recent methods for prediction of unknown regulatory modules based on their enhanced sequence similarity contained in words of length  [32, 33, 34]. But the evaluation of sequence similarity and the goals of the analysis differ: module prediction uses the total similarity in a genomic region, which in our setup is given by summation of over all distances and over different word lengths . Our analysis is limited to known regulatory modules and focuses on the dependence of on and . A specific part of this signal, obtained from sites with distance below 50 bp, will be associated below with local duplications as prevalent evolutionary mode.

Similarity information in regulatory modules of Drosophila

We evaluate the similarity information in the set of 346 regulatory modules of Drosophila melanogaster and in surrounding background sequence. The following features of local sequence similarity can be extracted:

The total information of local sequence similarity is maximal for motifs of length . Fig. 1(b) shows the total similarity information of all detected site pairs in the range of up to 100 bp, , as a function of the site length . The function takes its maximum, that is, the similarity information is most significant, for . The signal falls off at shorter length scales, because typical motif sequences are only partially covered, and at larger length scales, because uncorrelated flanking nucleotides contribute negatively to the similarity information. In this sense, detected motifs cover a characteristic length of about 7 bp. A similar length scale has been observed in tandem repeats [45, 47, 46].

The function takes distance-dependent positive values in the range of up to 50 bp and saturates to a positive asymptotic value for larger distances. Thus, its distance dependence is compatible to that of the sequence autocorrelation function shown in Fig. 1(a). This pattern is due to site pairs with high mutual similarity, .

Correlated binding sites explain a substantial part of the similarity information. We estimate this contribution by masking all functional sites [55, 54, 53] and re-evaluating the function in their sequence complement; see Fig. 1(c). Known binding sites cover about of the regulatory modules, but the signal is reduced by about 50%, indicating that these sites are an important source of similarity information. The binding site-masked signal is comparable to its counterpart in non-regulatory intergenic sequence.

Microsatellite repeats explain only a small part of the similarity information. We identify such repeats using the Tandem Repeat Finder [59]. If we remove about 5 % of the sequence in regulatory modules as repeats, the similarity information is reduced by less than 10%; Fig. 1(c). This is not surprising, because our sequence similarity measure differs from that of repeat analysis. In particular, our measure is sensitive to correlated segments on larger distance scales than typical tandem repeats, because it does not require a contiguous interval of self-similar sequence in between.

Homologous regions in other fly genomes show a consistent form of . We analyze homologous regions of two other Drosophila species, D. yakuba and D. pseudoobscura (see Materials and Methods). As shown in Fig. S2, these putative regulatory modules have patterns of very similar overall amplitude and distance-dependence, with enhanced values in the range of up to 50 bp.

In summary, our model-independent analysis shows that motifs with a characteristic length of about 7 bp play an important part in the distance-dependent sequence autocorrelation of Drosophila regulatory modules. The characteristic length coincides with the typical length of binding sites, and a substantial fraction of the signal can be explained by sequence correlations involving known binding sites. Therefore, we now focus the analysis on this smaller, but experimentally validated set of sites [53, 54, 55] and analyze in detail the evolutionary mechanism generating the sequences of neighboring pairs of sites.

Evolutionary modes of binding sites

Binding sites are ideal objects to study the production of information by sequence evolution. The sequence motif is approximately known for about transcription factors in Drosophila, that is, we can analyze the full position-dependent sequence information of these motifs, not just the similarity information of motif pairs. Furthermore, there is a simple link between sequence statistics and evolution of binding sites: assuming the sequence distribution defines a motif at evolutionary equilibrium, its sequence information is proportional to the average fitness effect of its binding sites, , with a proportionality constant equal to the effective population size [60, 51, 20, 50]. The fitness contribution of a particular binding sequence, , is proportional to its log-likelihood ratio in the distributions and . The ensemble of these fitness values defines an information-based fitness landscape for binding of a specific transcription factor. These relations between sequence statistics and fitness of binding sites quantify our intuition that specific sequences are overrepresented in a motif, because they confer a selective advantage over random sequences [4]. If we write the motif distribution in the product form of a position weight matrix, we obtain an approximate expression for the fitness in terms of the position-specific single-nucleotide frequencies in the motif sequence and their counterparts in background sequence: . This expression, which is in its simplest form already contained in Kimura’s U-shaped equilibrium distribution for a two-allele locus [61], is known as Bruno-Halpern model in the context of protein evolution [62] and has been used to infer fitness effects of mutations in binding sites [60, 51, 20, 50, 63, 52]. Although this additive fitness model neglects fitness interactions between nucleotides within binding sites as well as between sites within a regulatory module, it is justified for the purpose of this study (see below).

The fitness landscape defines the selection coefficient of any change from a state to a state of a binding site, . Together with the effective population size and the mutation rates, these selection coefficients determine the evolutionary dynamics of binding sites. In particular, the probability of evolving from an ancestor site to a descendent site through a series of point substitutions within an evolutionary distance can be evaluated in an analytical way from the underlying substitution matrix [58, 64] (see Materials and Methods).

We can now compare the two modes of binding site evolution introduced above. For any given pair of adjacent sites and that bind transcription factors and , respectively, we want to evaluate the likelihood of two different histories of site formation. In the first mode of evolution, the sites are assumed to evolve to their present sequence states by point substitutions from independent ancestor sequences and under independent selection given by the fitness landscapes and , as illustrated in Fig. 2(a). If the selection for binding is assumed to act over a sufficiently long evolutionary time, the probability of observing the present sequence states and in this independent mode of evolution is simply . This mode of evolution can only result in distance-dependent sequence similarity arising from an increased coverage with pairs of adjacent sites with correlated motifs and (evidence for this effect will be discussed below). However, it does not generate increased similarity of individual pairs of adjacent sites beyond that of their motifs.

In the second mode of evolution, the sites are assumed to evolve from a common ancestor sequence by a local duplication event at a distance from the present, followed by diversification under selection given by separate fitness landscapes and : either the original site is under stationary selection for binding factor and the duplicated site has evolved the new function of binding the factor or vice versa, as illustrated in Fig. 2(b). In this mode, the present sequences and have evolved from their last common ancestor by independent substitution processes with transition probabilities and . The dynamics results in a joint probability of the form where the distribution of the ancestor sequence is given by (see Materials and Methods). In this mode, distance-dependent sequence similarity arises due to common descent, causing the sequences of adjacent sites to be more similar than their motifs and . Importantly, this effect is generic and not tied to any functional properties of the transcription factors and . Fig. 2(c) shows a few examples of enhanced sequence similarity in pairs of adjacent binding sites in regulatory modules of D. melanogaster.

The relative likelihood of common versus independent descent for a specific pair of sites is given by the duplication score . This score measures the similarity in a gapless alignment between the sequences and in a specific way: it gauges matches and mismatches depending on the weights of aligned nucleotides in their respective binding motifs and . The duplication score depends on the parameter , which we choose by a maximum-likelihood procedure (see Materials and Methods). This parameter describes the expected excess similarity of site pairs related by common descent, but it is not a linear clock of divergence time. Simulated evolution of binding site histories shows that the maximum-likelihood duplication score reliably distinguishes between site pairs with common and with independent descent (see Materials and Methods). Below, we use the distribution of duplication scores to infer the mode of evolution prevalent in a given class of site pairs.

This likelihood analysis goes beyond the inference of the sequence similarity introduced above. It can be seen as a decomposition of the distance-dependent similarity between sites into two parts: the similarity between their motifs, and the excess similarity of the actual site pairs beyond that of their motifs. The first part reflects functional correlations within regulatory modules and is assigned to the background model . Only the second part provides evidence for common descent, which is gauged by the scoring function .

Our model scores only the sequence similarity within site pairs and does not incorporate the insertions and deletions between the sites after duplication, which determine their relative distance. This is justified, because the likely divergence times of most duplicated site pairs are much longer than repeat lifetimes. If a site duplicates within a repeat, the relative distance between copies may subsequently undergo rapid evolution due to the high indel rates in these regions [46–49]. Given a surplus of insertions over deletions in regulatory modules, we expect the relative distance to increase on average [48]. The spacing of contemporary sites is then the result of a long-term diffusive insertion/deletion dynamics within the repeats active since duplication, most of which have decayed in today’s sequence. This leaves the similarity of conserved functional sites as the most prominent long-term marker of these dynamics.

Local sequence duplications in Drosophila

Using the duplication score , we have evaluated the sequence similarity of pairs of neighboring binding sites in regulatory modules of the Drosophila melanogaster genome. These sites are experimentally validated and recorded in the REDfly database [55, 54, 53] (see Materials and Methods). We infer the prevalent mode of evolution as a function of the distance between sites and obtain the main result of this paper:

In fly, binding sites with a distance of up to about bp are more likely to share a common ancestor than to have evolved from independent origins. Fig. 3(a) shows the histogram of duplication scores for the set of binding site pairs with bp. The score distribution of these pairs is clearly distinguished from the background distribution , which is obtained from pairs of sites located in the same module at a distance bp and is associated with independent descent. We decompose the score distribution of adjacent sites in the form , attributing the excess of large scores to pairs of sites of common descent with a score distribution . Our best fit of this mixed-descent model to the data distribution has a fraction of adjacent site pairs formed by duplication; see Fig. 3(a). The total log-likelihood of the mixed-descent model relative to the background model is given by multiplying the relative entropy of the distributions and with the number of site pairs, . We estimate , providing significant statistical evidence that the prevalent mode in adjacent sites is evolution from common descent (for details, see Materials and Methods). We note that this significance emerges for the ensemble of the adjacent site pairs, whereas the relative log-likelihood for duplication per site pair, , is of order one: individual site sequences are inevitably too short to reliably discriminate between the two evolutionary modes. Our conclusion that local sequence duplications generate the observed excess similarity of adjacent sites is supported by a number of further controls and a comparison with the yeast intergenic regulatory sequences:

The relative log-likelihood for duplication per site pair decreases with increasing distance between sites. In Fig. 3(b), we evaluate the relative entropy for the score distributions of site pairs with different values of mutual distance . We find a rapid decay up to about bp, that is, the score distribution becomes successively more similar to the background distribution with increasing site distance. This pronounced distance-dependance is comparable to that of the total sequence similarity shown in Fig. 1(c) and is consistent with local duplications as underlying mechanism.

Similarity of neighboring sites is not limited to specific pairs of transcription factors. We partition the 306 site pairs with a mutual distance of less than 50 bp by factor pairs and evaluate the partial score averages . We compare the distribution of these averages with the corresponding distribution of averages evaluated after scrambling the score values of the site pairs, as shown in Fig. 3(c). The two distributions are statistically indistinguishable, which shows that excess sequence similarity is a broad feature of adjacent binding sites and is not limited to a subset of sites for factor pairs with specific functional relationships. This supports our conclusion that the excess sequence similarity reflects common descent and not fitness interactions (epistasis) between sites. Of course, epistasis is common for binding sites in the same regulatory module, because these sites perform a common regulatory function. However, generic interactions couple the binding energies of adjacent sites, not directly their sequences. Epistatic effects generating excess sequence similarity are conceivable for specific factor pairs, but do not appear to be a parsimonious explanation for the broad similarities of adjacent binding sites we observe.

In yeast, binding site duplications are not frequent. For comparison, we have also evaluated a set of pairs of binding sites in the Saccharomyces cerevisiae genome. Fig. 3(d) shows distribution of duplication scores for the set of binding sites with bp. This distribution is strongly peaked around zero (because the maximum-likelihood value of is large, see Materials and Methods) and indistinguishable from the distribution of the control set of random site pairs; both distributions have a negative average. As in Drosophila, most binding sites in the same intergenic region of S. cerevisiae are located within bp from each other. However, we do not observe evidence for local duplications as a mode of binding site formation in yeast. Clearly, this result does not exclude that such duplications take place, but they do not appear to be frequent enough to generate a statistically significant excess similarity of neighboring sites. This is not surprising given the differences in regulatory architecture between yeast and fly: individual sites in S. cerevisiae are more specific than in Drosophila; the average sequence information of a binding motif is bits, compared to bits [57]. Accordingly, a larger part of the regulatory functions in yeast relies on single sites, and there are no regulatory modules which would require frequent duplications for their formation.

Adaptive potential of duplications

Do the inferred site duplications have adaptive potential for the formation of novel binding sites? Here, we use the term adaptive potential to indicate that the duplication itself may be a neutral process, and selection for factor binding may latch on later to duplicated sites. The duplication of a site for a given transcription factor has obvious adaptive potential towards formation of an adjacent site for the same factor. But local duplications also have adaptive potential if the duplicated site is to evolve the new function of binding a different factor, because the binding motifs of transcription factors with adjacent sites are correlated. This correlation quantifies the ability of one factor to recognize the binding sites of another factor, including seed sites generated by sequence duplications. Specifically, we define the binding correlation of a transcription factor with another factor as the average information-based fitness to bind factor in the ensemble of -sites. In Fig. 4, this quantity is evaluated for all factor pairs with adjacent binding sites, together with the range of fitness of known target sites for factor and the average in background sequence (see Materials and Methods). For most such factor pairs, the fitness of a typical -site is seen to be similar to that of weak -sites and significantly larger than the average fitness of background sequence. This binding correlation between motifs is sufficient so that an -site duplicate can act as a seed for a -site, which can subsequently adapt its strength by point mutations. The binding correlation is specific to factors which have adjacent binding sites; we have found no such effect in the control ensemble of all factor pairs (most of which do not have adjacent sites). Furthermore, some highly specific motifs, such as hunchback, twi and z do not show binding correlations with other factors.


Local sequence duplication as a mechanism of regulatory evolution

Local sequence duplications (and deletions) are a generic evolutionary characteristic of intergenic DNA and, in particular, of regulatory sequence [48, 45, 46, 44, 47, 49]. In this study, we have established evidence for local sequence duplications as a mechanism that transports and produces cis-regulatory information. These duplications generate specific, distance-dependent sequence similarity in strongly correlated pairs of sites with a relative distance of up to about 50 bp, which account for a substantial part of the sequence autocorrelation in fly regulatory modules. In particular, they provide a parsimonious explanation for the observed excess sequence similarity of transcription factor binding sites in this range of relative distance. We conclude that the majority of these adjacent site pairs have a common ancestor sequence. The large amplitude of the duplication signal may be the most surprising result of this study. It far exceeds the level expected from the repeats in contemporary sequence, which cover only about 5 percent of binding sites and are typically shorter than the distance between correlated sites. Common-descent site pairs are the cumulative effect of past duplications over macro-evolutionary intervals, whose trace is conserved by selection on site functionality.

This result establishes local duplication as a pervasive formation mode of regulatory sequence, which generates, for example, the known local variations in site numbers between Drosophila species. Of course, our evidence for this mode is statistical and, at this point, is confined to a limited dataset of binding sites with confirmed functionality [55, 54, 53]. The duplication mode appears to be specific to multicellular eukaryotes; we have not found comparable evidence in the yeast genome. Our findings are relevant for genome analysis in two ways: including local duplications should inform inference methods for binding sites as well as alignments of regulatory sequence with improved scoring of indels [46, 47, 48, 49]. With such methods, it may become possible to follow the evolutionary history of binding sites in more detail.

Life cycle of a binding site

We have found evidence that local duplications can confer adaptive potential for the formation of novel binding sites, because they generate seed sequences with marginal binding specifically in the vicinity of existing sites. This mechanism is necessary, because point mutations alone can only lead to rapid loss but not to gain of new sites with positional specificity. Thus, duplications and point mutations complement each other, suggesting that typical binding sites within multicellular eukaryotes have an asymmetric life cycle: formation within a functional cluster by local duplication, adaptation of binding energy by point mutations, evolution of relative distance to neighboring sites by insertions and deletions in flanking sequence, conservation by stabilizing selection on binding energy, and loss by point mutations.

The life cycle of individual binding sites interacts with other levels of genome evolution. Gene duplications with subsequent sub-functionalization have been identified as an important evolutionary mode specifically in higher eukaryotes [65]. If subfunctionalization is initialized at the level of gene regulation, it amounts to a loss of regulatory input for both gene duplicates and provides a mechanism for adaptive loss of binding sites. This process alone would lead to genomes with many genes, but few functions per gene. Maintaining regulatory complexity with multi-functional genes as observed in eukaryotic genomes  [23, 26] requires a converse evolutionary mode: gain of new functions by existing genes. At the regulatory level, this amounts to gain of regulatory input, i.e., adaptive formation of new binding sites.

Sequence evolution and regulatory grammar

Previous studies have identified regulatory modules as important units of transcriptional control, in which clusters of binding sites bind multiple transcription factors with cooperative interactions. The sites in a cluster follow a regulatory grammar resulting from natural selection acting on site order, strength, and relative distances [36, 37, 38]. If sequence duplications play a major role in the formation of such clusters, we may ask how much of their observed structure reflects this mode of sequence evolution, rather than optimization of regulatory function by natural selection. Local duplications generically produce descendant sites, which are weak binding sites for another factor at best, as shown in Fig. 4. (Significant heterogeneity in binding strength between adjacent sites is indeed observed in our sample.) The resulting binding sequences are hardly optimal in terms of specificity and discrimination between different factors. Cooperative binding between transcription factors may have evolved as a secondary mechanism to confer regulatory function to these sequence structures.

In this paper, we have argued that local sequence duplications facilitate the adaptive evolution of gene regulatory interactions. However, the adaptive potential of duplications does not imply that the duplication process itself has to be adaptive or even confined to regulatory sites. Similar to gene duplications [65], many site duplications may be neutral and provide a repertoire of marginal regulatory links. Adaptive diversification can build subsequently on this repertoire, conserving and tuning those links that confer a fitness advantage and discarding others.

Materials and Methods

Regulatory sequences and position weight matrices

The sequence analysis of D. melanogaster is based on the cis-regulatory modules and experimentally validated binding sites collected in the REDfly v database [55, 54, 53], and on the position weight matrices of Dan Pollard’s dataset (http://www.danielpollard.com/matrices.html). To measure the distance-dependent sequence similarity , we use the 346 known regulatory modules with length of more than 1000 bp in D. melanogaster. The analysis in D. yakuba and D. pseudoobscura is based on the 249 well-aligned homologous regions obtained from multiple alignments of 12 Drosophila species (dm3, BDGP release5); see Fig. S2. For the evolutionary inference in the second part of the paper, we use only the experimentally validated binding sites contained in these modules which are not necessarily selected for high similarity to motifs or for high mutual similarity. To avoid biases in our analysis, the set of sites is truncated in three ways: (i) We only use binding sites for transcription factors that occur in at least two different regulatory modules, so that the position weight matrix is not biased by the sequence context of a single module. (ii) We use only sites that have no sequence overlap with other sites in the dataset, because our inferred fitness landscapes describe the selection for a single regulatory function [13]. (iii) We exclude sites in the X chromosome, which could bias the results by its high rate of recent gene duplications and the abundance of repeat sequences [66, 67]. These conditions produce a cleaned set of 506 transcription factor binding site pairs located in 74 cis-regulatory modules. For the analysis in S. cerevisiae, we use sites and position weight matrices from the SwissRegulon database [68]. These footprints do not always match the length of their position weight matrices. To produce a set of site sequences of common length , longer footprints are cut and shorter ones joined with flanking nucleotides, such that the binding affinity is maximized.

Sequence information measures

Sequence autocorrelation is a measure of enhanced mean similarity between the nucleotides of a sequence segment. The distance dependence of the autocorrelation signal provides information about the range, within which the nucleotides appearing in the sequence are correlated. In a given sequence segment , the nucleotide frequencies are given by


where if and otherwise. These determine the mean similarity between two random nucleotides of the segment, . The sequence autocorrelation function is then defined by


We evaluate this function in the 346 regulatory modules of Drosophila melanogaster genome with length of more than 1000 bp identified by REDfly v database [55, 54, 53]. As shown in Fig. 1(a), we find an approximate exponential decay with a characteristic length of about 100 bp as the range of sequence correlation. The mean identity is evaluated in a local window of 500 bp (changing the window length affects the baseline of this function, but not its dependence on distance up to 100 bp). Information about the spatial distribution of these correlated nucleotides along the genome is contained in higher orders of sequence autocorrelation (i.e., reoccurrence of doublets, triplets, etc.). Here, we use information theory to identify such clusters of correlated nucleotides in a sequence region.
We want to detect reoccurring nucleotide patterns or motifs. A motif of length is a probability distribution for sites which differs significantly from the background distribution . If we neglect correlations between nucleotides, we can write these distributions as the product of single-nucleotide frequencies,


and . The matrix of single-nucleotide frequencies (3) is called the position weight matrix of the motif. The sequence information of the motif is defined as the relative entropy (Kullback-Leibler distance) of these distributions [69],


To study the sequence coverage by informative motifs, we use a reduced form of the full frequency distribution by mapping it to the mean similarity of its motif sites. Hence, even without any prior knowledge on frequency distributions, we can recover part of the sequence information for those motifs that occur more than once in the sequence segment. Two sites drawn from the motif have a mean similarity between aligned nucleotides, which is higher than the background mean similarity . The similarity information of the motif is given by the relative entropy


Similarity information between pairs of sites is a somewhat diluted measure of sequence information. As a rule of thumb, the mutual similarity entropy per site pair, , recovers about half of the motif information per site, .

Inference of similarity information by dynamic programming

To estimate the total similarity information of all strongly correlated pairs of sites with distance and length in a sequence segment of length , we construct a set


of site pairs with the following properties:
(i) The left (and also the right) sites of all pairs have no sequence overlap,


(ii) The mean similarity of each pair is greater than a threshold ,


(iii) The sum of mutual similarities is maximal (see Fig. S1)

By the dynamic programming recursion


we obtain the sequence of partial scores with the initial condition . We then use a backtracking procedure (see, e.g., [58]) to determine the set of positions and, hence, the number and the average similarity of the high-similarity pairs (6). To estimate the expected number of pairs in background sequence, , we apply the same procedure to sequences of length , which are generated by a first-order Markov model


with the same single-nucleotide frequencies and conditional frequencies as in the actual sequence. We then evaluate the excess and obtain an estimate of the total information contained in the enhanced autocorrelation of motifs as given by eq. (5),


We infer by maximum likelihood analysis of the total similarity information in the sequence. This method also allows for optimization of the motif length , similar to the procedure in the local sequence alignment algorithms [58].

Evolutionary model for binding sites

Our evolutionary dynamics of binding site sequences for a given transcription factor is determined by the Bruno-Halpern fitness model [62] derived from the position weight matrix () and the background frequencies ,


This relationship between fitness and nucleotide frequencies is valid if binding sites are at evolutionary equilibrium under mutations, genetic drift, and selection, and background sequence is at neutral equilibrium (accordingly, all inferred fitness values are scaled in units of the effective population size ). The relationship of the evolutionary ensembles with the underlying thermodynamics of site-factor interactions is discussed, for example, in ref. [52]. Eq. 12 defines an information-based fitness model: the average fitness of functional binding sites equals the sequence information of the motif,


with and ; see eqs. (1), (3) and (4). We infer from the local background frequency of the region base pairs around each binding site. The rates of point substitutions within binding sites are determined by the scaled selection coefficients derived from this fitness model and the point mutation rates (which are assigned a uniform value for simplicity). Here, we use the standard Kimura-Ohta substitution rates


which are valid in the regime (in which subsequent substitution processes are unlikely to overlap in time) and  [61, 70]. The matrix of these substitution rates then determines the transition probabilities (propagators) from an arbitrary initial sequence to an arbitrary final sequence within an evolutionary distance  [58, 64]. Given the set of transition probabilities, we obtain the joint probability for a pair of sites that bind transcription factors and , respectively, and have evolved from a common ancestor as described in the main text and in Fig. 2(b). First, we assume that the ancestor site is at evolutionary equilibrium under selection to bind factor , that is, the contemporary site has the ancestral function and has evolved a new function after duplication. This gives the contribution


where we have used the detailed balance condition of the substitution dynamics [64]. There is a second contribution describing the case of the ancestor under stationary selection to bind factor . Weighing these cases with equal prior probabilities, we obtain


In our analysis of pairs of adjacent binding sites in Drosophila, there is usually a dominant contribution from one of the terms, from which we can infer the likely function of the ancestor site. In the limit of large , the evolution from a common ancestor becomes indistinguishable from evolution by independent descent, .

Inference of common descent

The duplication score


is a measure of sequence similarity between binding sites. This score depends on the evolutionary distance parameter . We infer the optimal value of by maximizing the likelihood ratio between the score distribution of pairs with mutual distance and the score distribution of pairs with independent origin. In D. melanogaster, we find a finite maximum-likelihood evolutionary distance and significantly positive values of the duplication score for adjacent binding sites. In S. cerevisiae, we find large values , i.e., there is no statistical evidence for evolution by common descent. Our conclusions are largely independent of the values of used in (16) and (17). These values should be regarded as model fit parameters for the observed sequence similarities. Energy-based fitness models [64, 13], which take into account the epistasis between mutations within binding sites, are required to obtain more accurate estimates of , which can be tested against phylogenetic data. Epistasis will increase the inferred values of compared to the additive (Bruno-Halpern) model [64, 13].

We evaluate the score distribution of a given class of site pairs in terms of a mixture model of common and independent descent,


The distribution of scores for independent descent, , is obtained from pairs of sites in a common module with a relative distance bp (Fig.  3(a), dashed line). This distribution is approximately Gaussian and has a width of order one, which is consistent with the simulations reported below. Because we build from sites in a common module, its score average is above that for pairs of sites located in different modules. In this way, the overall sequence similarity within modules, which depends on the local GC-content, is assigned to the background model and does not confound the evidence for common descent. The distribution is the best fit to the the large-score excess of the distribution for adjacent sites with a relative distance bp (Fig. 3(a), violet-shaded). This distribution has larger mean and is broader than the background distribution , which is also consistent with the simulations reported below.

Given a set of site pairs with scores described by the distribution , the log-likelihood of the mixed-descent model (18) relative to the independent-descent background model is given by


it equals the product of the number of sites and the relative entropy . The extensive quantity measures the statistical evidence for the mixture model based on the number and the score distribution of site pairs, whereas quantifies only the shape differences between the distributions and . We evaluate eq. (19) using the conservative estimate with ; see Fig. 3(a).

We have tested our inference procedure by simulations of the sequence evolution for pairs of binding sites with common and with independent descent. For these simulations, we use four pairs of different factors , and two pairs of equal factors . For each factor pair, we obtain an ensemble of 25000 pairs of binding sites with a duplication in their evolutionary histories, as described by Eqs. (15, 16) and Fig. 2(b). We first obtain 500 duplication events : the last common ancestor sequence is drawn with equal likelihood from the ensemble or , and the divergence time is drawn from an exponential distribution with mean . For each duplication event, we draw 50 site pairs from the distribution describing evolution under selection for binding of factors and , respectively. We then apply our scoring procedure to this set of site pairs. As for the real sequence data, we infer a single maximum-likelihood parameter by maximization of the total duplication score . As shown in Fig. S3(a), has a pronounced maximum at a value , which is close to the mean divergence time of the input data. We conclude that the constraint of a fixed does not confound the inference of common descent. We also obtain separate score distributions for sites binding the pairs of equal factors and of different factors listed above; see Fig. S3(b) and Fig. S3(c). These distributions are similar and clearly distinguish duplicated site pairs from pairs with independent ancestries for both factor groups. We conclude that our method can infer common descent of binding sites, independently of their functional characteristics.

Binding correlation of transcription factors

We define the binding correlation for each ordered pair of factors as the average information-based fitness of -sites for the -factor,


This value is an estimate for the compatibility of the -sites with the transcription factor and equals, up to a constant, the information-theoretic cross entropy between the distributions and . In Fig. 4, this quantity is compared to (i) the sequence information of the motif , which equals the average fitness of -sites for the -factor by eq. (4),


and (ii) to the average fitness of background sequence for the -factor,



  •  1. Monod J, Jacob F (1961) Teleonomic mechanisms in cellular metabolism, growth, and differentiation. Cold Spring Harb Symp Quant Biol 26: 389-401.
  •  2. King MC, Wilson AC (1975) Evolution at two levels in humans and chimpanzees. Science 188: 107-116.
  •  3. Berg O, von Hippel P (1987) Selection on DNA-binding sites by regulatory proteins. statistical mechanical theory and application to operators and promoters. Journal of Molecular Biology 193: 723-50.
  •  4. Stormo GD, Fields DS (1998) Specificity, free energy and information content in protein-DNA interactions. Trends Biochem Sci 23: 109-13.
  •  5. Fields DS, He Y, Al-Uzri AY, Stormo GD (1997) Quantitative specificity of the Mnt repressor. Journal of Molecular Biology 271: 178-94.
  •  6. Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, et al. (2004) Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat Genet 36: 1331-9.
  •  7. Maerkl SJ, Quake SR (2007) A systems approach to measuring the binding energy landscapes of transcription factors. Science 315: 233-7.
  •  8. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, et al. (2009) Diversity and complexity in DNA recognition by transcription factors. Science 324: 1720-3.
  •  9. Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, et al. (2007) Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450: 219–232.
  •  10. Davidson EH, Erwin DH (2006) Gene regulatory networks and the evolution of animal body plans. Science 311: 796-800.
  •  11. Ludwig MZ, Bergman C, Patel NH, Kreitman M (2000) Evidence for stabilizing selection in a eukaryotic enhancer element. Nature 403: 564-7.
  •  12. Hare EE, Peterson BK, Iyer VN, Meier R, Eisen MB (2008) Sepsid even-skipped enhancers are functionally conserved in Drosophila despite lack of sequence conservation. PLoS Genet 4: e1000106.
  •  13. Mustonen V, Kinney J, Callan CG Jr, Lässig M (2008) Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites. Proc Natl Acad Sci U S A 105: 12376-81.
  •  14. Ludwig MZ, Kreitman M (1995) Evolutionary dynamics of the enhancer region of even-skipped in Drosophila. Mol Biol Evol 12: 1002-11.
  •  15. Ludwig MZ, Patel NH, Kreitman M (1998) Functional analysis of eve stripe 2 enhancer evolution in Drosophila: rules governing conservation and change. Development 125: 949-58.
  •  16. Ludwig MZ, Palsson A, Alekseeva E, Bergman CM, Nathan J, et al. (2005) Functional evolution of a cis-regulatory module. PLoS Biol 3: e93.
  •  17. Arnosti DN, Kulkarni MM (2005) Transcriptional enhancers: Intelligent enhanceosomes or flexible billboards? J Cell Biochem 94: 890-8.
  •  18. Moses AM, Pollard DA, Nix DA, Iyer VN, Li XY, et al. (2006) Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS Comput Biol 2: e130.
  •  19. Doniger SW, Fay JC (2007) Frequent gain and loss of functional transcription factor binding sites. PLoS Comput Biol 3: e99.
  •  20. Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB (2004) MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biol 5: R98.
  •  21. Ondek B, Gloss L, Herr W (1988) The SV40 enhancer contains two distinct levels of organization. Nature 333: 40-5.
  •  22. Bergman CM, Pfeiffer BD, Rincón-Limas DE, Hoskins RA, Gnirke A, et al. (2002) Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome. Genome Biol 3: RESEARCH0086.
  •  23. Ptashne M, Gann A (2002) Genes and signals. Cold Spring Harbor, New York: Cold Spring Harbor Laboratory Press.
  •  24. Sinha S, Schroeder MD, Unnerstall U, Gaul U, Siggia ED (2004) Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila. BMC Bioinformatics 5: 129.
  •  25. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature 431: 99-104.
  •  26. Davidson EH (2006) The regulatory genome: gene regulatory networks in development and evolution. Burlington, MA: Academic.
  •  27. Lynch M (2006) The origins of eukaryotic gene structure. Mol Biol Evol 23: 450-68.
  •  28. Rajewsky N, Vergassola M, Gaul U, Siggia ED (2002) Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 3: 30.
  •  29. Halfon MS, Grad Y, Church GM, Michelson AM (2002) Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. Genome Res 12: 1019-28.
  •  30. Sinha S, van Nimwegen E, Siggia ED (2003) A probabilistic method to detect regulatory modules. Bioinformatics 19 Suppl 1: i292-301.
  •  31. Berman BP, Pfeiffer BD, Laverty TR, Salzberg SL, Rubin GM, et al. (2004) Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol 5: R61.
  •  32. Abnizova I, te Boekhorst R, Walter K, Gilks WR (2005) Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test. BMC Bioinformatics 6: 109.
  •  33. Li L, Zhu Q, He X, Sinha S, Halfon MS (2007) Large-scale analysis of transcriptional cis-regulatory modules reveals both common features and distinct subclasses. Genome Biol 8: R101.
  •  34. Leung G, Eisen MB (2009) Identifying cis-regulatory sequences by word profile similarity. PLoS One 4: e6901.
  •  35. Stanojevic D, Small S, Levine M (1991) Regulation of a segmentation stripe by overlapping activators and repressors in the Drosophila embryo. Science 254: 1385-7.
  •  36. Small S, Arnosti DN, Levine M (1993) Spacing ensures autonomous expression of different stripe enhancers in the even-skipped promoter. Development 119: 762-72.
  •  37. Markstein M, Markstein P, Markstein V, Levine MS (2002) Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc Natl Acad Sci U S A 99: 763-8.
  •  38. Kulkarni MM, Arnosti DN (2005) cis-regulatory logic of short-range transcriptional repression in Drosophila melanogaster. Mol Cell Biol 25: 3411-20.
  •  39. Lusk RW, Eisen MB (2010) Evolutionary mirages: selection on binding site composition creates the illusion of conserved grammars in Drosophila enhancers. PLoS Genet 6: e1000829.
  •  40. Buchler NE, Gerland U, Hwa T (2003) On schemes of combinatorial transcription logic. Proc Natl Acad Sci U S A 100: 5136-41.
  •  41. Levine M, Tjian R (2003) Transcription regulation and animal diversity. Nature 6945: 147-51.
  •  42. Britten RJ (1996) DNA sequence insertion and evolutionary variation in gene regulation. Proc Natl Acad Sci U S A 93: 9374-7.
  •  43. Hancock JM, Shaw PJ, Bonneton F, Dover GA (1999) High sequence turnover in the regulatory regions of the developmental gene hunchback in insects. Mol Biol Evol 16: 253-65.
  •  44. Vinces MD, Legendre M, Caldara M, Hagihara M, Verstrepen KJ (2009) Unstable tandem repeats in promoters confer transcriptional evolvability. Science 324: 1213-6.
  •  45. Boeva V, Regnier M, Papatsenko D, Makeev V (2006) Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 22: 676-84.
  •  46. Tanay A, Siggia ED (2008) Sequence context affects the rate of short insertions and deletions in flies and primates. Genome Biol 9: R37.
  •  47. Messer PW, Arndt PF (2007) The majority of recent short DNA insertions in the human genome are tandem duplications. Mol Biol Evol 24: 1190-7.
  •  48. Sinha S, Siggia ED (2005) Sequence turnover and tandem repeats in cis-regulatory modules in Drosophila. Mol Biol Evol 22: 874-85.
  •  49. Gruen D (2006) Statistics and evolution of functional genomic sequence. Ph.D. thesis, Univeristät zu Köln.
  •  50. Berg J, Willmann S, Lässig M (2004) Adaptive evolution of transcription factor binding sites. BMC Evol Biol 4: 42.
  •  51. Berg J, Lässig M (2003) Stochastic evolution of transcription factor binding sites. Biophysics Moscow 48: S36-S44.
  •  52. Lässig M (2007) From biophysics to evolutionary genetics: statistical aspects of gene regulation. BMC Bioinformatics 8 Suppl 6: S7.
  •  53. Bergman CM, Carlson JW, Celniker SE (2005) Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster. Bioinformatics 21: 1747-9.
  •  54. Gallo SM, Li L, Hu Z, Halfon MS (2006) REDfly: a regulatory element database for Drosophila. Bioinformatics 22: 381-3.
  •  55. Halfon MS, Gallo SM, Bergman CM (2008) REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila. Nucleic Acids Res 36: D594-8.
  •  56. Segal E, Widom J (2009) Poly(dA:dT) tracts: major determinants of nucleosome organization. Curr Opin Struct Biol 19: 65-71.
  •  57. Wunderlich Z, Mirny LA (2009) Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet 25: 434-40.
  •  58. Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabalistic models of proteins and nucleic acids. Cambridge, UK: Cambridge University Press.
  •  59. Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27: 573-80.
  •  60. Moses AM, Chiang DY, Kellis M, Lander ES, Eisen MB (2003) Position specific variation in the rate of evolution in transcription factor binding sites. BMC Evol Biol 3: 19.
  •  61. Kimura M (1962) On the probability of fixation of mutant genes in a population. Genetics 47: 713-9.
  •  62. Halpern AL, Bruno WJ (1998) Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol 15: 910-7.
  •  63. Mustonen V, Lässig M (2009) From fitness landscapes to seascapes: non-equilibrium dynamics of selection and adaptation. Trends Genet 25: 111-9.
  •  64. Mustonen V, Lässig M (2005) Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc Natl Acad Sci U S A 102: 15936-41.
  •  65. Lynch M, Conery J (2003) The origins of genome complexity. Science 5649: 1401-4.
  •  66. Katti MV, Ranjekar PK, Gupta VS (2001) Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol Biol Evol 18: 1161-7.
  •  67. Thornton K, Long M (2002) Rapid divergence of gene duplicates on the Drosophila melanogaster X chromosome. Mol Biol Evol 19: 918-25.
  •  68. Pachkov M, Erb I, Molina N, van Nimwegen E (2007) SwissRegulon: a database of genome-wide annotations of regulatory sites. Nucleic Acids Res 35: D127-31.
  •  69. Kullback S, Leibler RA (1951) On information and sufficiency. Annals of mathematical statistics 22: 79-86.
  •  70. Kimura M, Ohta T (1969) The average number of generations until fixation of a mutant gene in a finite population. Genetics 61: 763-71.
Figure 1: Sequence similarity in regulatory modules of the fly genome. (a) Sequence autocorrelation as a function of distance , obtained from regulatory modules in D. melanogaster (gray: unbinned data, blue: binned in intervals of variable length). The autocorrelation values are positive and depend on in a roughly exponential way up to about bp. (b) Total similarity information as a function of motif length for all pairs of strongly correlated sites with mutual distance bp in the same set of regulatory modules. This function takes its maximum at a characteristic motif length of bp. (c) Distance-dependent similarity information for motif length evaluated in all sequence (red), binding site-masked sequence (green), repeat-masked sequence (blue) in regulatory modules, and in generic intergenic sequence (black). Repeat-masked sequence is generated using the Tandem Repeat Finder [59] with match-mismatch-indel penalty parameters (2,3,5). Insert: Total similarity information for the same sequence categories. Binding sites, but not tandem repeats, account for a substantial fraction of the similarity information.
Figure 2: Evolutionary modes of transcription factor binding sites. The figure shows alternative formation histories for two adjacent binding sites, whose present sequences bind transcription factors and , respectively. The color coding indicates the evolution of binding function for factor (red) and (blue) with evolutionary time . (a) Evolution from independent ancestor sequences. The sites evolve to their present states by independent evolutionary processes under stationary selection given by different fitness landscapes and (see text). In this mode, adjacent sites will show no enhanced average sequence similarity compared to the similarity of their motifs. (b) Evolution by duplication of a common ancestor sequence. Left panel: The original site evolves in the stationary fitness landscape . At a distance from the present, this site undergoes a duplication. The duplicated site evolves its new function of binding in the fitness landscape . Right panel: The same process with the roles of and interchanged. In the duplication mode, the sites retain an enhanced sequence similarity, which reflects their common descent. (c) Examples of adjacent functional binding sites with enhanced sequence similarity in the D. melanogaster genome. The sites of each pair are aligned. The color background of nucleotide at position indicates its contributions to fitness (binding affinity) for factor and , i.e., (level of red) and (level of blue). The sequence similarity leads to hybrid binding characteristics: some nucleotides of the -site (top row) have binding characteristics of the -motif, and vice versa. Examples from top to bottom (factor / factor , genomic positions, duplication score): (i) Kruppel / hunchback, chr3L: 8639822 / 8639878, , (ii) zeste / Trithorax-like, chr3R: 12560236 / 12560218, , (iii) Kruppel / tailless, chr3L: 8639586 / 8639596, , (iv) pangolin / apterous, chr3R: 22997722 / 22997752, .
Figure 3: Common vs. independent descent of binding sites in fly and yeast. (a) Histogram of the duplication score for pairs of binding sites with a mutual distance of up to bp in the genome of D. melanogaster (sum of grey-shaded and violet-shaded part). Decomposition of counts according to the mixed-descent model (see Materials and Methods): 43% of the site pairs are of independent descent and have the score distribution (obtained from pairs with relative distance bp, dashed line), 57% of the site pairs of are of common descent and have the score distribution (violet-shaded). (b) Relative log-likelihood for duplication per site pair, i.e., relative entropy obtained from the score distribution of site pairs in the relative distance range bp (evaluated from a total of sites). The rapid decay of this function suggests a local mechanism generating excess similarity between adjacent sites. (c) Histogram of partial score averages for all factor pairs binding the site pairs of (a) (grey-shaded) and corresponding distribution of averages obtained after scrambling the score values of site pairs (normalized to the same number of total counts, dashed line). The two distributions are statistically indistinguishable (KS-test p-value = 0.8378), which shows that positive duplication scores are not limited to a subset of factor pairs. (d) Histogram of the duplication score for pairs of binding sites with a mutual distance of up to 50 bp in the genome of S. cerevisiae (grey-shaded). The distribution is not significantly different from the null distribution obtained from random site pairs (normalized to the same number of total counts, dashed line), i.e., there is no evidence for common descent as prevalent evolutionary mode.
Figure 4: Adaptive potential of binding site duplications. The binding correlation of all pairs of Drosophila transcription factors which have adjacent binding sites in a common regulatory module is evaluated as the average information-based fitness of -sites for factor and plotted against the sequence information of the binding motif of factor (blue crosses); see eqs. (20) and (21) in Materials and Methods. The binding correlation is compared to the distribution of fitness values of the -sites (red dots, the average fitness for each factor is shown as diamond and equals the abscissa ) and to the average fitness in background sequence (green dots); see eq. (22) in Materials and Methods. The binding correlation is significantly larger than the background average of and is comparable to the fitness of weak -sites in a substantial fraction of cases. Some highly specific motifs, such as hunchback, twi and z do not show binding correlations with other factors.
Figure S1: Motif detection in sequence segments (schematic). The figure shows a configuration of correlated sequence sites of length bp and distance bp from each other. Pairs of correlated sites have the following properties: (i) The average mutual similarity between aligned nucleotides is larger than a given threshold, . (ii) The left sites (and, hence, also the right sites) of all pairs have no common nucleotides. This condition is necessary in order to avoid overcounting of mutual similarity in overlapping site pairs. (iii) The sum of the mutual similarities of all pairs in the set is maximal. In the example shown, there are three different motifs with reoccurring sequence patterns marked by different colors (red, blue, green). To illustrate the alignment of the site pairs, we shift the whole sequence by bp in the second row. The left and right site of each motif are shown in boldface in the first and the second row, respectively. Mismatches between aligned sites of the same motif are shown in boldface gray letters. The flanking regions separating the correlated sequence pairs are shown in smaller font.
Figure S2: Sequence similarity in regulatory modules of 3 Drosophila species. Distance-dependent similarity information for motif length in regulatory modules (red) and in generic intergenic sequence (black), evaluated in D. melangaster and in the homologous regions of D.  yakuba and D. pseudoobscura (see Materials and Methods). These data show a consistent pattern of overall amplitudes and of decay lengths.
Figure S3: Tests of the duplication inference method. We simulate binding site pairs evolving by common descent or by independent descent, as described in Materials and Methods. (a) Dependence of the total duplication score on the time parameter for an ensemble of 150000 site pairs of common descent. This function has a pronounced maximum at a value , which is close to the mean divergence time since duplication. (b) Distributions of the score (with ) for pairs of sites binding different factors. The distribution for sites of common descent (filled curve) is distinguished from the distribution for sites with independent descent (solid curve) by its increased score average, , and by its increased width. (c) Same as (b) for pairs of sites binding the same factor. The distribution for sites of common descent (filled curve) has again an increased average, , and an increased width.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description