Exact scaling in the expansion-modification system
This work is devoted to the study of the scaling, and the consequent power-law behavior, of the correlation function in a mutation-replication model known as the expansion-modification system. The latter is a biology inspired random substitution model for the genome evolution, which is defined on a binary alphabet and depends on a parameter interpreted as a mutation probability. We prove that the time-evolution of this system is such that any initial measure converges towards a unique stationary one exhibiting decay of correlations not slower than a power-law. We then prove, for a significant range of mutation probabilities, that the decay of correlations indeed follows a power-law with scaling exponent smoothly depending on the mutation probability. Finally we put forward an argument which allows us to give a closed expression for the corresponding scaling exponent for all the values of the mutation probability. Such a scaling exponent turns out to be a piecewise smooth function of the parameter.
In recent years, several models have been introduced (such as -step Markov chains or hidden Markov chains, among others [8, 9, 16, 17]) to describe the evolution of nucleotide sequences as well as the patterns and correlations occurring in the genome. In this paper we are concerned with one of those models, proposed by W. Li , which consists of a sequence (or chain) of symbols that evolve according to a given discrete-time stochastic dynamics. Such a dynamics captures the essential processes which are assumed to be responsible of the genome evolution: the random expansion and modification of symbols (hence the name of expansion-modification system). Originally introduced as a simple model exhibiting some spatial scaling properties, a behavior ubiquitous in natural phenomena , it was subsequently used to understand the scaling properties and the long-range correlations found in real DNA sequences [2, 6, 7, 8, 13]. Recently, the expansion-modification system has also been used to investigate the universality of the rank-ordering distributions [1, 11].
From the mathematical point of view, the expansion-modification system belongs to the class of random substitution dynamical systems, which attracted some attention in recent years for their possible applications in genome evolution studies (see  and references therein). A related class of stochastic processes, inspired by randomly generated grammars, were formalized and studied by Toom and coworkers (see ). Previously, Godrèche and Luck used random substitutions to study the robustness of quasiperiodic structures , in particular structures associated to random perturbations of Fibonacci sequences and Penrose tilings. They observed that the Fourier spectrum of the structures thus obtained are of mixed type: they contain both singular and continuous parts. Fourier spectra of mixed type appear in structures corresponding to random perturbation of quasicrystals. For instance, Zaks  observed mixed spectrum in structures generated by randomized Thue-Morse sequences. In our case, the structure generated by the expansion-modification system cannot be seen as a random perturbation of a quasicrystal, as the corresponding Fourier spectrum turns out to be continuous. What the expansion-modification system shares with those randomly perturbed quasicrystals is the scaling property and the consequent power law behavior of the correlations and the Fourier spectrum. For those randomly perturbed quasicrystals, the scaling of the perturbed structure can be very easily deduced from the scaling already present in the underlying quasicrystal, by means of obvious recurrence relations derived from the inflation rules. As we will show, the scaling in the expansion-modification system derives from recurrence relations implied by the underlying dynamics, and contrary to the quasicrystal case, these recurrence relations grow in complexity in such a way that its treatment demands the implementation of nontrivial techniques. The rigorous study of this scaling behavior and the consequent power laws is the main contribution of this work.
The expansion-modification system can be described as follows. Consider the random substitution
in the binary set , and extend it coordinate-wise to the set of finite binary strings. Starting at time zero with a seed in , and iterating the above substitution, we obtain a sequence
of finite strings of non-decreasing length. Since the applied substitution is a random map, the sequence we obtain by successive iterations is a random sequence which is nevertheless supposed to converge, in a certain statistical sense, to a random string . It is easy to see that the probability of having a finite string after infinitely many iterations is zero, therefore it is more convenient to study the evolution of infinite strings under the infinite extension of the above substitution. This is precisely the point of view we will follow throughout this work.
The paper is organized as follows. In Section 2 we set up the mathematical framework where the expansion-modification system is defined. In Section 3 we state our main results, which we proof in Section 5. In Section 4 we compute a closed expression for the scaling exponent which presumably holds in the whole range of mutation probabilities. We finish the paper with some final remarks and comments.
2. The Expansion-Modification Dynamics.
In order to review the expansion-modification system, let us start fixing the relevant notation and terminology. Let , which we endow with the -algebra generated by the cylinder sets. Elements of are called configurations, and will be denoted by boldface characters like , with . Finite sequences of symbols, also called words, will be also denoted by boldfaced letters while their size will be denoted by , i.e., for we have . A word occurs as prefix of , which we denote by , if . We will also use this notation when is replaced by a finite word. Given a configuration , and integers , we denote with the word . Product of words will be understood as concatenation: given two words and , we let denote the word of size satisfying and . Consider , where the symbols e and m stand for expansion and modification respectively. The space , which we will refer to as the space of substitutions, is endowed with the -algebra generated by the cylinder sets as well. We will use the same convention to denote the elements of , words and concatenation of words, as for the symbolic space .
Let us now define the local substitutions , which are given by
A configuration of local substitutions defines the global substitution given by
Here stands for concatenation of words. Notice that replaces the -th symbol of according to the -th local substitution, i.e, if then is expanded, otherwise is modified.
The expansion-modification dynamics is a random dynamical system whose orbits depend on an initial condition and a choice of global substitutions to be applied to that initial condition. To be more precise, an initial condition and a sequence of configurations in , define the orbit in where and for each . At each time step , the global substitution is randomly chosen according to the Bernoulli measure such that and . The parameter is the mutation probability.
In terms of distributions, the expansion-modification system can be defined as follows. If is the measure according to which the time- configurations are distributed, then the distribution of time- configurations is completely determined by and according to the following expression:
for each and . As mentioned before, means that the word occurs as a suffix of the word . Hence, the evolution of the -marginal is nothing but a Markov chain. Indeed, considering the -marginal of a measure as a probability vector of dimension , the -marginal , of the time- distribution is given by matrix product , where is the -stochastic matrix given by
Our first result states the existence and uniqueness of the stationary distribution, which turns out be a global attractor for the expansion-modification dynamics.
Theorem 1 (Existence and Uniqueness).
For each there exists a unique measure on which is invariant under the expansion-modification dynamics. Furthermore, starting from any measure determining the distribution of the initial conditions, the measure , corresponding to the distribution at time , converges in the *-weak sense to .
It is not hard to see that uniqueness does not hold for . For , when only expansion is possible, each convex combinations of the Dirac measures at all-zeros and all-ones, is an admissible invariant distribution. On the other hand, in the case , the dynamic of each initial distributions enters a period-two cycle, excepting for the measures that are flip-invariant (). In both cases, the asymptotic regime depends on the initial distribution.
In  Toom and coworkers prove the existence of invariant measures for substitution operators similar to, but not including, the expansion-modification dynamics. In their case, due to random deletion of words, the dynamics cannot be reduced to the action of stochastic matrices over finite length marginals.
The main result of this paper establishes the power-law decay of correlations exhibited by the unique stationary measure . Before stating this theorem, let us remind the main definitions.
The two-sites correlation function, , is given by
where denotes, as usual, the projection of on the -th coordinate. Following the usual practice, we say that has decay of correlations if .
Let , and for each , let
Theorem 2 (Power-law Decay of Correlations).
For each there exist and constants such that
for all .
The invariance of the expansion-modification dynamics under coordinatewise negation, or flip invariance, , implies that for all . Therefore
Now, flip invariance also implies and , for each . With this, we obtain a very simple expression for the two-sites correlation
This expression and the invariance under the expansion-modification dynamics will allow to deduce a recurrence formula for the two-sites correlation function.
Let us denote by the length of the words obtained by applying the substitution , and for each let
Since is left invariant under the expansion-modification dynamics, and since is a Bernoulli measure, then
and similarly for . From the previous equation and its analogous for , it readily follows that
with , , and . This relation can be rewritten as
with defined as
We will make extensive use of the previous relations in the proof of Theorem 2.
4. The scaling exponent
According to Theorem 2, the correlation function follows an asymptotic scaling law for mutation probabilities in the range . It also establishes an expression for the scaling exponent . The following argument leads us to conjecture that the scaling property, and the expression for the corresponding scaling exponent, extend to the whole interval .
Let us consider the recursive relation
deduce above. The distribution , which is unimodal with maximum at , steepens around this maximum as goes to infinity in such a way that
where , which we define below in Subsection 5.2, are such that both and tend to 1 as goes to infinity. Hence, assuming a slow variation in , we have
where . From this we finally obtain the approximate scaling relation
which traduces into the scaling law , with
We have used the recurrence relation in Equation (4) to numerically compute the two-sites correlation function for different values of the mutation probability. As shown in Figure 1, the numerical computations confirm that the two-point correlation function approximatively follows a power-law behavior. Furthermore, according to Figure 2, the theoretically predicted exponents, , fit very well the ones obtained by linear regression from the numerically computed correlation functions.
The argument developed above suggests that the stationary measure varies in a piecewise smooth way with . This variation is reflected on the behavior of the two-sites correlation function , which appears to follow a power law decay which prevails in the whole interval , except for the two singularities located at and . At precisely those values of , the two-sites correlation function appears to decay faster than any power law.
Our proof of Theorem 1 requires the following.
For each there exists , such that .
For and , let whenever , otherwise let . Clearly, for each we have
where is such that for each . We claim that for each there exists such that
which readily implies the result.
Since and , the claim holds for . Assuming the claim for we have, for all , a sequence such that . If even and or odd and , by taking for , we have
On the other hand, if even and or odd and , by taking for , and , we have
and the lemma follows. ∎
Proof of Theorem 1. Let us assume that for each , the stochastic matrix is primitive. This implies, by the Perron-Frobenius Theorem, that there is a unique probability vector such that
for every initial probability vector . Hence, for any measure specifying the distribution of the initial conditions, for each , and for all we have
If in addition the probability vectors satisfy the compatibility condition
for each and , then Kolmogorov’s representation theorem implies the existence of a measure on such that for each and . Finally, Equation (6) ensures the convergence of towards in the *-weak sense.
The primitivity of follows straightforwardly from the following argument. As proved in Lemma 1, for each pair of words , there exist a sequence of substitutions such that applied to produces a word having as prefix. Now, since all words in have positive probability, then the previous claim implies that for some , which proves that is irreducible. Now, since the word occurs as the prefix of , then , which implies that is aperiodic, and so is primitive.
To prove the compatibility condition (7), we should notice that it is inherited from the analogous compatibility condition satisfied by all the marginals at each time . Indeed, for we obviously have
for each and . Here stands for the disjoint union. Now, from Equation (1) it follows that
Since , the statement is equivalent to , and we have
for each and . The compatibility condition (7) follows by taking the limit on both sides of the equation, which completes the proof of the theorem.
This subsection is devoted to the proof of Theorem 2, which relies in Lemmas 2 and 3 and other auxiliary propositions which we state without proof. The proofs of these auxiliary results is referred to Subsections 5.3, 5.4 and 5.5.
Lemma 2 (Upper bound).
For each , the stationary measure has decay of correlations bounded above by a power law. Indeed, for each there exist positive constants and such that
for all .
Lemma 3 (Lower bound).
There are constants and , such that for all and all .
This lemma has the following straightforward consequences that we will use below.
For each and , there exists such that
for each .
Let be as in Theorem 3. Then for each define the function as follows
With this, define the functions given by
Finally, to simplify the expressions that will appear, define
Let us remind that, for and each ,
with , , and and
We have the following.
If and , then .
A simple computation shows that
Since , then for . The result follows from the fact that and have the same sign. ∎
For each there exists such that
for all .
For each and we have
There exists such that, for each there are constants such that for every we have
Furthermore, as .
Proof of Theorem 2. Fix and , and let , and be as above. Let be as in Equation (5). Note that implies . Note also that for all , and that for all . From this, using Equation (3) and Proposition 2, we obtain
for each .
A simple computation shows that for , there exists such that for all . Hence, according to Proposition 1, for all and . In this case we can define a probability distribution proportional to , in the interval .