From the divergence between two measures to the shortest path between two observables

# From the divergence between two measures to the shortest path between two observables

###### Abstract

We consider two independent and stationary measures over , where is a finite or countable alphabet. For each pair of -strings in the product space we define as the length of the shortest path connecting one of them to the other. Here the paths are generated by the underlying dynamic of the measures. If they are ergodic and have positive entropy we prove that, for almost every pair of realizations , concentrates in one, as diverges. Under mild extra conditions we prove a large deviation principle. We also show that the fluctuations of converge (only) in distribution to a non-degenerated distribution. These results are all linked to a quantity that compute the similarity between those two measures. It is the so-called divergence between two measures, wich is also introduced. Along this paper, several examples are provided.

Running head: The shortest path between two strings.

Subject class: 37xx, 41A25, 60Axx , 60C05, 60Fxx.

Keywords: Poincaré recurrence, shortest path, large deviations.

## 1 Introduction: the shortest-path function

Suppose one has to built a communication net consisting in nodes and links between nodes. A question of major interest is how to design the net such that it is easy to communicate from each node to the other without paying the cost of constructing a large number of links.

In this paper we study a quantity which describes the structural complexity of the net. Given two nodes, it gives the length of the shortest path from one node to another one. We consider the case where the nodes are given by the partition in -cylinders or -strings of the phase space: Specifically we consider a finite or countable set . For each , the nodes correspond to the partition of -cylinders or -strings of . We consider also two independent probability measures over . The address node is chosen according to a measure and the source node according to a measure . We assume that both measures are ergodic and that is absolutely continuous with respect to , otherwise the communication could be impossible. We denote with the function that gives the length of the shortest path that communicates two -strings. The link between this two strings is driven by the shift operator over . That is for one gets .

Let us introduce the cornerstone for this paper. It is the quantity that gives the minimum number of steps to get from a string to another one, and will be given nextly.

###### Definition 1.1.

The shortest-path function is defined by

 T(2)n(x,y)=inf{k≥1 : yn−10∩σ−k(xn−10)≠∅}.

Here and ever after we write as shorthand of for any . To illustrate this definition, let us take a look in the word ABRACADABRA in three different languages: If are such that ABRACADABRA (english), AVRAKEHDABRA (aramaic) and ABBADAKEDABRA (chaldean). Then since, considering the firsts 11 letters of and , we have to shift 8 times to be able to connect it with . Similarly . Further we have and .

The random variable is a two-dimensional version of the shortest return function . That is, it gives the length of the shortest path starting from and arriving to the same node. Its concentration phenomena has been already studied in [19, 5]. A large deviation principle was related to the Rényi entropy in [4, 10, 1]. Limiting theorems for its fluctuations where presented in [2, 3]. Since considers starting and target sets being the same and allows them coming from different measures, and have different nature. In topological terms: describes a local, while describes a global characteristic of the connection net.

In this paper we prove three fundamental theorems which describes the net trough the statistical properties of : Concentration, large deviations and fluctuations.

Firstly we prove that converges almost surely to one, as diverges. Our result holds when has positive entropy and verifies some specification property which prevents the net to be extremely sparse.

The concentration of leads us to study its large deviation properties. Namely, the decaying to zero rate of the probability of this ratio deviating from one. We compute this rate under the additional condition that the measures verify certain regularity condition. A similar condition was introduced and already related to the existence of a large deviation principle for the shortest return function in [1].

The limiting rate of the large deviation function of is determined by a quantity that deserves attention on its own. It gives a measure of similarity (or difference) between two measures (see definition 2.1). In words, it is the expectation of the marginal distribution of order of one of them with respect to the other. Since it is symmetric, they role are exchangeable in this definition. We call it the divergence of order . We also study some of its properties that are used later on in the large deviation principle for above mentioned. We provide several examples. In many cases the divergence results on an exponentially decreasing sequence on and this leads to consider its limiting rate. One of our main results establishes the existence of the limiting rate function which is far from being evident. We use a kind of sub-additivity property but with a telescopic technique rather the classical linear one. We show that in particular, when the two measures coincide, this limit corresponds to the Rényi entropy of the measure at argument (see item of examples (2.1)).

To describe the complexity of the net, we study the distribution of the shortest path function. We compute the distribution of a re-scaled version of (namely, ) and prove that it converges to a non-degenerated distribution which depends on the stationary measures and . The limiting distribution may depend on an infinite number of parameters if the measures do. This limiting distribution also depends on the divergence between the measures. As an application of this theorem we compute the proportion of pairs of -strings which do not overlap (wich we call the avoiding pairs set).

When the subject is the distribution of we are not aware of any work which consider its behaviour in the context of stationary measures. There are some works which consider models of random graphs and present empirical data which adjust the distribution of the shortest path to Weibull or Gamma distributions [6, 24]. But even for classical models, for instance Erdös-Rényi graph, its full distribution, in a theoretical sense, has never been considered in the literature [7, 12, 22].

Since the random variables are defined on the same probability space, we further ask about a stronger convergence. Our last result shows that does not even converge in probability, and a lower bound for the distance between two consecutive terms of the sequence is given.

Finally we think it is important also to highlight the connection of the shortest path function with the study of the Poincaré recurrence statistics. The waiting time function introduced by Wyner and Ziv in [27] is a well-studied quantity in the literature. Given two realizations , it is the time expected until appears in the realization of another process. That is

 Wn(x,y)=inf{k≥1 : yk+n−1k=xn−10} .

Now, we have that the shortest path function is the minimum of the waiting times of , taking the minimum over all the realizations that begin with . That is

 T(2)n(x,y)=infz:zn−10=yn−10Wn(x,z) ,

A number of classical results are known for . When both strings are chosen with the same measure, Shields showed that for stationary ergodic Markov chains for almost every pair of realizations, as diverges and is the Shannon entropy of the measure [18]. Nobel and Wyner [15] had proven a convergence in probability to the same limit. This result holds for mixing processes with a certain rate function . Marlon and Shields extended it to weak Bernoulli processes [14]. Yet, Shields [18] constructed an example of a very weak Bernoulli process in which the limit does not hold. Finally Wyner [26] proved that the distribution of converges to the exponential law for -mixing measures. When both strings are chosen with possibly different measures, and the second one is a Markov chain, Kontoyiannis [13] showed that for the relative entropy of with respect to .

This paper is organized as follows. In section 2 we introduce the divergence concept. Properties, examples and the proof of its existence are also included in this section. In section 3 we prove the concentration phenomena of the shortest path function. A large deviation principle is proved in section 4. The convergence of the shortest path distribution is presented in section 5. An application to calculate the self-avoiding pairs of strings appears also here. The non convergence in probability is shown in section 6.

## 2 The divergence between two measures

Let and be two probability measures over the same measurable space . It is natural to ask if this two measures are related in any sense, and if there is some function to scale this relation. There are in the literature several quantities devoted to answer somehow these questions. We highlight here the mutual information and the relative entropy (or Kullback-Leibler divergence), which was been extensively discussed in the literature (see for instance [9]).

The present section is dedicated to a quantity wich describes the degree of similarity between two given measures. As far as we know, it was never considered in the literature. Its definition will be given as follows.

###### Definition 2.1.

The divergence between and is defined by:

 Eμ,ν(k)=∑ω∈χkμν(ω) ,

(here and ever after, by we mean ).

Let () be the projection of over the first coordinates of the space. The -divergence is the mean of with respect to , or vice-versa. It is also the inner product of the -vectors with entries given by the probabilities and (in any arbitrary ordering of the strings ). Notice that is symmetric (), and that it is not null if, and only if, the support of the two measures have non-empty intersection. The previous sentence can be interpreted as follows: if the two measures do not communicate, the similarity between them is zero.

The next result says that the operation of opening a gap does not produce a smaller result. As a corollary, we conclude that the divergence is not increasing in . For simplicity, hereafter for and we denote by the -string constructed by concatenation of and . Formally .

###### Lemma 2.1.

Let be non-negative integers. Then

 Eμ,ν(k)≤∑ω∈χi; ζ∈χjμν(ω∩σ−(i+g)ζ) . (1)
###### Proof.

Consider . Let us write the cylinder by concatenating the three cylinders above. By removing the string in

 ∑ξ∈χgμν(ωξζ) ≤ ∑ξ∈χgμ(ω∩σ−(i+g)ζ)ν(ωξζ) = μν(ω∩σ−(i+g)ζ) .

Summing over and in the last display we get (1). ∎

As a direct consequence of the above proposition, we get that the divergence is monotonic in .

If , then .

###### Proof.

This follows by taking . ∎

The above corollary proves that the divergence is a non-increasing function in . In many cases, it decreases at an exponential rate. It is natural to ask about the existence of the limiting rate function, that is

 R––=liminfk→∞(−1klogEμ,ν(k))  ;  ¯¯¯¯¯R=limsupk→∞(−1klogEμ,ν(k)) .

If both limits are equal, we denote it by . 111Throughout this paper logarithms can be taken in any base.

In what follows, we provide some examples. They illustrate cases for the existence (or not) of . Finally we state the main result of this section: a general condition in which the limiting rate exists. Further, in section 4.1, we will relate this limiting rate with a large deviation principle for .

###### Examples 2.1.

Let and two independent measures with disjoint supports. Then for all , and therefore

Suppose that and concentrate their mass in a unique realization of the process. Then, for any , if, and only if, (and zero otherwise). Thus we get

If both measures and have independent and identically distributed marginals we get

 Eμ,ν(k)=∑ω∈χkμν(ω)=[∑x0∈χμν(x0)]k=Eμ,ν(1)k .

Therefore, the limit exists and is given by

 R=−logEμ,ν(1) .

Let be a product of Bernoulli measures with parameter and a product of Bernoulli measures with parameter . Then

 Eμ,ν(k)=∑xk−10∈χkk−1∏i=0μν(xi)=2kpk(1−p)k .

Then we get

 R=−log[2p(1−p)] .

If , we get that , where

 Hμ(β)=−limk→∞1k(β−1)log∑ω∈χkμβ(ω) ,

is the Rényi entropy of the measure (provided that it exists).

A case where the limiting rate does not exist: a sequence that doesn’t satisfy the law of large numbers. Let , and let be a measure concentrated on the realization:

 x=02122023124⋯02k12k+1⋯

where means the -string . On the other hand, let be a product of Bernoulli measures with . By a direct computation, we get that Since the proportion of ’s and ’s in does not converge as goes to infinity, we get

Let be an ergodic, positive entropy measure. By the Shannon-McMillan-Breiman Theorem, converges to , for almost every ,where is the entropy of . Let be one of such sequences. Let be a measure concentrated on . Then and

The following theorem gives sufficient conditions for the existence of the limiting rate . Its proof uses a kind of sub-aditive property. But here, instead of the classical linear iteration of the sub-additivity property, we use a geometric iteration. To prove the existence of the limiting rate function we use a kind of -regular condition which is a version of the condition introduced in [1]. This condition was used to prove a large deviation principle for the shortest return function of a string to itself. That principle related the deviations of to the Rény entropies of the measure. Examples which show its generality and also other properties can be also found there.

In what follows, we present two quantities that will be very usefull troughout this section. Let be a fixed non-negative integer. For the measure , (resp. ) set

 ψ+μ,g(i,j)=supω∈χi, ξ∈χjμ(σ−(i+g)ξ | ω)μ(ξ) , (2)

and then

 ψ+g=max{ψ+μ,g,ψ+ν,g} .

Now we are ready to state the main result of the present section. It provides a general condition for the existence of the limiting rate .

###### Theorem 2.1.

Suppose there exist positive constants and such that

 logψ+g(i,j)≤Ki+j[log(i+j)]1+ϵ . (3)

Then does exist.

For instance if and have independent marginals, we take . Immediate calculations give for all and , and condition (3) is satisfied. Moreover, if and are stationary measures of irreducible, aperiodic, positive recurrent Markov Chain in a finite alphabet then, the Markov property gives

 ψ+μ,0(i,j)=supω∈χ,ξ∈χμ(σ−1ξ | ω)μ(ξ) ,

(resp. ) which is finite since is finite and (3) is verified. Abadi and Cardeño [1] constructed several examples of processes of renewal type, with equal zero and one with exponential or sub-exponential measure of cylinders which verifies (3). Measures which verify the classical -mixing condition, for each fixed, have constant and thus (3) holds.

Now we present the proof for the Theorem.

###### Proof.

Let us take . As in the proof of Lemma 2.1, Further, by (2)

 μ(ω∩σ−(i+g)ζ)≤ψ+g(i,j)μ(ω)μ(ζ) .

And the same holds for . Call . Also, call . Summing up in and and taking logarithm, by the inequalities above, we conclude that for all

 f(i+g+j) ≤ log(ψ+g(i,j)ψ+g(i,j))+log∑ω∈χiμν(ω)+log∑ζ∈χjμν(ζ) (4) = cg(i,j)+f(i)+f(j) .

Now we use a kind of sub-additivity argument. Let an increasing sequence of non-negative integers such that

 liminfn→∞f(n)n=limt→∞f(nt)nt . (5)

Consider the sequence with . Fix . Firstly, for any positive integer write with positive integers such that . Apply (4) with , and gap to get

 f(n)≤cg(~ntm−g,r)+f(~ntm−g)+f(r). (6)

Now, we write in base 2. For this, there exist a positive integer and non-negative integers such that . Iterating (4) with and , for , we have that for the middle term in the right hand side of (6)

 f(~ntm−g)≤ℓ(m)∑u=2 cg(~ntu−1∑s=12ℓs−g,~nt2ℓu−g)+ℓ(m)∑u=1f(~nt2ℓu−g). (7)

The first sum in the righthand side is zero in case . Finally, we decompose the argument in the last summation. For any of the form we apply (4) with , to get

 f(~nt2ℓ−g)≤cg(~nt2ℓ−1−g,~nt2ℓ−1−g)+2f(~nt2ℓ−1−g).

An iteration of the above inequality leads to

 f(~nt2ℓ−g)≤ℓ−1∑s=02ℓ−s−1cg(~nt2s−g,~nt2s−g)+2ℓf(~nt−g). (8)

Observe that . Collecting (6), (7), (8) we conclude that the limit superior of is upper bounded by the limit superior of where

 I = cg(~ntm−g,r)~ntm−g+r , II = 1~ntmℓ(m)∑u=2 cg(~ntu−1∑s=12ℓs−g,~nt2ℓu−g) , III = 1~ntmℓ(m)∑u=1ℓu−1∑s=02ℓu−s−1cg(~nt2s−g,~nt2s−g) , IV = f(r)~ntm+∑ℓ(m)u=12ℓuf(nt)~ntm .

As diverges, the first term in vanishes since . The second one is bounded by . goes to zero by (3). We recall that in case . Otherwise we also use condition (3) to get the following upper bound

 Kmℓ(m)∑u=2∑us=12ℓs[log(~nt∑us=12ℓs−2g)]1+ϵ .

The inner summation is trivially bounded by . Since , it follows that . Lastly using also (3) we get that is upper bounded by

 Kmℓ(m)∑u=12ℓuℓu−1∑s=01[log~nt2s+1−2g)]1+ϵ .

Changing the constant we can take here logarithm base 2. The argument in the logarithm is lower bounded by . Now we use that the sum of a decreasing sequence is bounded above by its first term plus the integral definite by the first and last terms in the sum. Thus, the rightmost sum in the above display is bounded by

 ∞∑s=11[s+lognt]1+ϵ≤1[1+lognt]1+ϵ+1ϵ[1+lognt]ϵ ,

which goes to zero as diverges. Summarizing we conclude that

 limsupn→∞f(n)n≤f(nt)nt+K[lognt]1+ϵ+1ϵ[1+lognt]ϵ .

The inequality holds for every . If we take then we finish the proof since converges by hypothesis.

## 3 Concentration

The present section is dedicated to study asymptotics for . Intuition says that the bigger is , the more difficult is to connect two -strings. Thus we expect increasing. The question is at which rate. The main result of this section says that converges almost surely to one. The proof is divided in two parts. First part proves that the limit inferior is lower-bounded by one and the second one states that the limit superior is upper-bounded by one. For the last one we assume that the process verifies the very weak specification property (see def. (3.1) below).

There are several definition of specification in the literature of dynamical systems. The first one was introduced by Bowen[8]. Many others appeared, mainly following him with some divergence in the nomenclature [20, 11], or in weaker forms (see for instance [21, 23, 25]). Basically they mean that, for any given set of strings, they can be observed (at least in one single realization of the process) with bounded gaps between them. Sometimes it is required the realization to be periodic. for simplicity to the reader, we present our condition here. It is easy to see that it is verified for a large class of stochastic processes and is less restrictive than the previous ones. Examples are provided below.

###### Definition 3.1.

is said to have the very weak specification property (VWSP) if there exists a function with that verifies the following: For any pair of strings , there exists a such that,

 xn−10=ωandx2n+g(n)−1n+g(n)=ξ .
###### Examples 3.1.

Any process with complete grammar verifies definition 3.1 with . We recall that a probability measure defined over is said to have complete grammar if, for all we get for all .

An irreducible and aperiodic Markov chain over a finite alphabet and stationary measure verifies the VWSP with .

We first construct a renewal process as an image of the House of Cards Markov chain with irreducible and aperiodic transition matrix given by

 Q(y,0) = 1−qy, Q(y,y+1) = qy,

. Figure 1 represents the transitions of this process.

Let if , and if , indicating the "renewal" of . Take , for all for some , and any other for the remaining coefficients to warrant that the Markov chain is positive recurrent. Obviously has not complete grammar. It is easy to see that and that this bound is actually sharp, taking . Thus verifies the VWSP. On the other hand, the stationary measure of the House of Cards Markov chain itself is an example that does not verify the VWSP.

Now we can state the concentration theorem, which is the main result of this section.

###### Theorem 3.1.

If has positive entropy, then

• In addition, if verifies the VWSP, then

 limn→∞T(2)nn=1,μ×ν−a.e.

Before proving the above result, let us introduce a family of sets and a result that will be useful for the proof.

###### Definition 3.2.

For each define the set of pairs such that the firsts symbols of coincide exactly with the last symbols of . Namely

 R(2)n(k)={(x,y)∈χN×χN : yn−1n−k=xk−10} .

For instance, if are such that and , then .

The following result establishes a connection between the shortest-path function and the - sets.

###### Lemma 3.1.

For , it holds

 {T(2)n≤k}⊆n−1⋃i=n−kR(2)n(i) .

In addition, if satisfies the complete grammar condition, then the equality holds.

###### Proof.

By definition, belongs to , if and only if, there is and such that and . In particular, since , we have , which in turns says that . For the equality, notice that for any pair , we get that . The complete grammar condition assures that there exists such that and . This concludes the proof. ∎

The next lemma gives the key connection with the divergence of and .

From now on we mean by the product measure .

###### Lemma 3.2.

For , and a stationary measure, it holds

 P(R(2)n(k))=Eμ,ν(k) .
###### Proof.

If then

 P(R(2)n(k))=P(xk−10=yn−1n−k)=∑ω∈χkμν(ω) .

Since the last term above is equal to , we finish the proof. ∎

Now we are able to prove theorem 3.1.

###### Proof of theorem 3.1.

For item , let be the entropy of . Since is ergodic, the Shannon-Mcmillan-Breiman Theorem says that

 −limn→∞1nlogμ(xn−10)=h ,

along -almost every . By Egorov’s Theorem, for every , there exists a subset of , where this convergence is uniform and . That is, for all , there exists a such that for all

 e−k(h+ϵ)<μ(xk−10)

for all . Making the product with and using lemmas 3.1 and 3.2 we get

 P({T(2)n≤(1−ϵ)n} ∩ Ωϵ×Ω) ≤ n−1∑j=⌊ϵn⌋P(R(2)n(j) ∩ Ωϵ×Ω) = n−1∑j=⌊ϵn⌋∑ω∈χj∩Ωϵμν(ω) ≤ n−1∑j=⌊ϵn⌋e−j(h−ϵ) ,

where the last inequality was obtained using (9). A direct computation gives

 ∞∑n=1P(T(2)n≤(1−ϵ)n) ≤ 11−e−(h−ϵ) .

By Borel-Catelli’s Lemma, occurs only finitely many times. We conclude

 liminfn→∞T(2)nn≥1−ϵ ,   μ×ν−a.s. in  Ωϵ×Ω . (10)

Since is arbitrary, this finishes the proof of .

Now we prove item . Since definition 3.1 implies that , we divide both sides by and get

 limsupn→∞T(2)nn≤1,μ×ν−a.e.

Combining this with , we finish the proof of item .

## 4 Large deviations

In the previous section we showed that concentrates its mass in , as diverges. Here, we present the deviation rate for this limit. Since the VWSP implies that

 P(T(2)nn>1+ϵ)=0,  ∀n>n0(ϵ) ,

it is only meaningful to consider the lower deviation.

###### Definition 4.1.

We define the and for the lower deviation rate, respectively, as

 Δ––(ϵ)=liminfn→∞1n∣∣ ∣∣logP(T(2)nn<1−ϵ)∣∣ ∣∣ ,

and

 ¯¯¯¯¯Δ(ϵ)=limsupn→∞1n∣∣ ∣∣logP(T(2)nn<1−ϵ)∣∣ ∣∣ .

If we write simply .

We recall that the complete grammar condition assures that .

###### Theorem 4.1.

Let and be two stationary probability measures defined over . Then

• and

• Suppose that has complete grammar. Then the equalities hold in .

The regularity of the measure assures the existence of .

###### Corollary 4.1.

Under conditions of theorem 4.1, suppose yet that has complete grammar. Then .

###### Proof of theorem 4.1.

By lemma 3.1, we have

 {T(2)nn<1−ϵ}⊆n−1⋃j=⌈nϵ⌉R(2)n(j) ,

with equality if the process has complete grammar. In this case, considering just the first set in the union, we also have

 R(2)n(⌈nϵ⌉)⊆ {T(2)nn<1−ϵ} .

Thus, by lemma 3.2

 Eμ,ν(⌈nϵ⌉)≤P(T(2)nn<1−ϵ)≤n−1∑j=⌈nϵ⌉Eμ,ν(j) .

Now we take logarithm, divide by , take limit and use that the divergence is non increasing. An exchange of variables ends the proof. ∎

## 5 Convergence in law

In this section we prove the convergence of the normalized distribution of to a non-degenerate distribution. We also present several examples and provide an application for the main result of the section.

To state the result we need first to introduce the coefficients that appears in the theorem.

###### Definition 5.1.

Set and for every , define:

•  .

Here, we define by a set which is a one dimensional version of . Namely

 Rn(k)={xn−10∈χn : xn−1n−k=xk−10} .

Now we can state the main result of this section.

###### Theorem 5.1.

Suppose has complete grammar. Then, for all , it holds:

•  .

•  .

In the next examples we discuss several cases of applications of the above theorem.

###### Examples 5.1.

The theorem does not warrant that the limiting object actually defines a distribution law. For instance, take concentrated on the unique sequence . Let be a product of Bernoulli measures with success probability . Let and define . Clearly, is absolutely continuous with respect to , which has complete grammar. It is easy to compute . Thus and does not converge to a limiting distribution.

Under mild conditions one gets that the limiting object is actually a distribution. For that, it is enough to give conditions in which . Directly from its definition . Thus, the limiting function defines a distribution if goes to zero as diverges. It holds if is summable. Notice that this is not the case in the example above.

When , we recover in the limit, the same limit distribution of the re-scaled shortest return function . In particular, if is a product measure, we recover the limit distribution obtained in [2].

(d) The following example shows a process that has complete grammar, and then converges. On the contrary, the re-scaled shortest return function does not converge as shown in [3]. This is due to the fact that the process is not mixing. The process is defined over in the following way. Let be uniformly chosen over and independent of everything. The remaining variables are conditionally independent given and defined by