[

# Exponential inequalities for empirical unbounded context trees

## Abstract

In this paper we obtain non-uniform exponential upper bounds for the rate of convergence of a version of the algorithm Context, when the underlying tree is not necessarily bounded. The algorithm Context is a well-known tool to estimate the context tree of a Variable Length Markov Chain. As a consequence of the exponential bounds we obtain a strong consistency result. We generalize in this way several previous results in the field.

variable memory processes, unbounded context trees, algorithm Context
\numberwithin

equationsection \theoremstyleplain \theoremstyledefinition \theoremstyleremark Inequalities for Empirical Context Trees] Exponential inequalities for empirical unbounded context trees A. Galves]Antonio Galves

1

F. Leonardi]Florencia Leonardi

\subjclass

62M09, 60G99

## 1 Introduction

In this paper we present an exponential bound for the rate of convergence of the algorithm Context for a class of unbounded variable memory models, taking values on a finite alphabet . From this it follows a strong consistency result for the algorithm Context in this setting. Variable memory models were first introduced in the information theory literature by Rissanen (11) as a universal system for data compression. Originally called by Rissanen finite memory source or probabilistic tree, this class of models recently became popular in the statistics literature under the name of Variable Length Markov Chains (VLMC) (1).

The idea behind the notion of variable memory models is that the probabilistic definition of each symbol only depends on a finite part of the past and the length of this relevant portion is a function of the past itself. Following Rissanen we call context the minimal relevant part of each past. The set of all contexts satisfies the suffix property which means that no context is a proper suffix of another context. This property allows to represent the set of all contexts as a rooted labeled tree. With this representation the process is described by the tree of all contexts and a associated family of probability measures on , indexed by the tree of contexts. Given a context, its associated probability measure gives the probability of the next symbol for any past having this context as a suffix. From now on the pair composed by the context tree and the associated family of probability measures will be called probabilistic context tree.

Rissanen not only introduced the notion of variable memory models but he also introduced the algorithm Context to estimate the probabilistic context tree. The way the algorithm Context works can be summarized as follows. Given a sample produced by a chain with variable memory, we start with a maximal tree of candidate contexts for the sample. The branches of this first tree are then pruned until we obtain a minimal tree of contexts well adapted to the sample. We associate to each context an estimated probability transition defined as the proportion of time the context appears in the sample followed by each one of the symbols in the alphabet. From Rissanen (11) to Galves et al. (10), passing by Ron et al. (12) and Bühlmann and Wyner (1), several variants of the algorithm Context have been presented in the literature. In all the variants the decision to prune a branch is taken by considering a cost function. A branch is pruned if the cost function assumes a value smaller than a given threshold. The estimated context tree is the smallest tree satisfying this condition. The estimated family of probability transitions is the one associated to the minimal tree of contexts.

In his seminal paper Rissanen proved the weak consistency of the algorithm Context in the case where the contexts have a bounded length, i.e. where the tree of contexts is finite. Bühlmann and Wyner (1) proved the weak consistency of the algorithm also in the finite case without assuming a priori known bound on the maximal length of the memory, but using a bound allowed to grow with the size of the sample. In both papers the cost function is defined using the log likelihood ratio test to compare two candidate trees and the main ingredient of the consistency proofs was the chi-square approximation to the log likelihood ratio test for Markov chains of fixed order. A different way to prove the consistency in the finite case was introduced in (10), using exponential inequalities for the estimated transition probabilities associated to the candidate contexts. As a consequence they obtain an exponential upper bound for the rate of convergence of their variant of the algorithm Context.

The unbounded case as far as we know was first considered by Ferrari and Wyner (8) who also proved a weak consistency result for the algorithm Context in this more general setting. The unbounded case was also considered by Csiszár and Talata (3) who introduced a different approach for the estimation of the probabilistic context tree using the Bayesian Information Criterion (BIC) as well as the Minimum Description Length Principle (MDL). We refer the reader to this last paper for a nice description of other approaches and results in this field, including the context tree maximizing algorithm by Willems et al. (14). With exception of Weinberger et al. (13), the issue of the rate of convergence of the algorithm estimating the probabilistic context tree was not addressed in the literature until recently. Weinberger et al. proved in the bounded case that the probability that the estimated tree differs from the finite context tree generating the sample is summable as a function of the sample size. Duarte et al. in (6) extends the original weak consistency result by Rissanen (11) to the unbounded case. Assuming weaker hypothesis than (8), they showed that the on-line estimation of the context function decreases as the inverse of the sample size.

In the present paper we generalize the exponential inequality approach presented in (10) to obtain an exponential upper bound for the algorithm Context in the case of unbounded probabilistic context trees. Under suitable conditions, we prove that the truncated estimated context tree converges exponentially fast to the tree generating the sample, truncated at the same level. This improves all results known until now.

The paper is organized as follows. In section 2 we give the definitions and state the main results. Section 3 is devoted to the proof of an exponential bound for conditional probabilities, for unbounded probabilistic context trees. In section 4 we apply this exponential bound to estimate the rate of convergence of our version of the algorithm Context and to prove its consistency.

## 2 Definitions and results

In what follows will represent a finite alphabet of size . Given two integers , we will denote by the sequence of symbols in . The length of the sequence is denoted by and is defined by . Any sequence with represents the empty string and is denoted by . The length of the empty string is .

Given two finite sequences and , we will denote by the sequence of length obtained by concatenating the two strings. In particular, . The concatenation of sequences is also extended to the case in which denotes a semi-infinite sequence, that is .

We say that the sequence is a suffix of the sequence if there exists a sequence , with , such that . In this case we write . When or we write . Given a sequence we denote by the largest suffix of .

In the sequel will denote the set of all sequences of length over and represents the set of all finite sequences, that is

 A∗=∞⋃j=1Aj.
###### Definition 2.1

A countable subset of is a tree if no sequence is a suffix of another sequence . This property is called the suffix property.

We define the height of the tree as

 h(T)=sup{ℓ(w):w∈T}.

In the case it follows that has a finite number of sequences. In this case we say that is bounded and we will denote by the number of sequences in . On the other hand, if then has a countable number of sequences. In this case we say that the tree is unbounded.

Given a tree and an integer we will denote by the tree truncated to level , that is

 T|K={w∈T:ℓ(w)≤K}∪{w:ℓ(w)=K and w≺u, for some u∈T}.

We will say that a tree is irreducible if no sequence can be replaced by a suffix without violating the suffix property. This notion was introduced in (3) and generalizes the concept of complete tree.

###### Definition 2.2

A probabilistic context tree over is an ordered pair such that

1. is an irreducible tree;

2. is a family of transition probabilities over .

Consider a stationary stochastic chain over . Given a sequence we denote by

 p(w)=P(Xj1=w)

the stationary probability of the cylinder defined by the sequence . If we write

 p(a|w)=P(X0=a | X−1−j=w).
###### Definition 2.3

A sequence is a context for the process if and for any semi-infinite sequence such that is a suffix of we have that

 P(X0=a | X−1−∞=x−1−∞)=p(a|w),for all a∈A,

and no suffix of satisfies this equation.

###### Definition 2.4

We say that the process is compatible with the probabilistic context tree if the following conditions are satisfied

1. if and only if is a context for the process .

2. For any and any , .

Define the sequence as

 α0 :=∑a∈Ainfw∈T{p(a|w)}, αk :=infu∈Ak∑a∈Ainfw∈T,w≻u{p(a|w)}.

From now on we will assume that the probabilistic context tree satisfies the following assumptions.

###### Assumption 2.5

Non-nullness, that is for any .

###### Assumption 2.6

Summability of the sequence . In this case denote by

 α:=∑k∈N(1−αk)<+∞.

For a probabilistic context tree satisfying Assumptions 2.5 and 2.6, the maximal coupling argument used in (7), or alternatively the perfect simulation scheme presented in (2), imply the uniqueness of the law of the chain compatible with it.

Given an integer , we define

 Ck={u∈T|k:p(a|u)≠p(a|suf(u)) for some a∈A}

and

 Dk=minu∈Ckmaxa∈A{|p(a|u)−p(a|suf(u))|}.

We denote by

 ϵk=min{p(w):ℓ(w)≤k and p(w)>0}.

In what follows we will assume that is a sample of the stationary stochastic chain compatible with the probabilistic context tree .

For any finite string with , we denote by the number of occurrences of the string in the sample; that is

 Nn(w)=n−ℓ(w)∑t=01{Xt+ℓ(w)−1t=w}.

For any element , the empirical transition probability is defined by

 ^pn(a|w)=Nn(wa)+1Nn(w⋅)+|A|\/. (2.7)

where

 Nn(w⋅)=∑b∈ANn(wb).

This definition of is convenient because it is asymptotically equivalent to and it avoids an extra definition in the case .

A variant of Rissanen’s algorithm Context is defined as follows. First of all, let us define for any finite string :

 Δn(w)=maxa∈A|^pn(a|w)−^pn(a|suf(w))|\/.

The operator computes a distance between the empirical transition probabilities associated to the sequence and the one associated to the sequence .

###### Definition 2.8

Given and , the tree estimated with the algorithm Context is

where denotes the set of all sequences of length at most . In the case we have .

It is easy to see that is an irreducible tree. Moreover, the way we defined in (2.7) associates a probability distribution to each sequence in .

###### Theorem 2.9

Let be a probabilistic context tree satisfying Assumptions 2.5 and 2.6 and let be a stationary stochastic chain compatible with . Then for any integer , any satisfying

 d>maxu∉T,ℓ(u)≤Kmin{k:∃w∈Ck,w≻u} (2.10)

any and any

 n>2(|A|+1)min(δ,Dd−δ)ϵd+d

we have that

 P(^Tδ,dn|K≠T|K)≤4e1e|A|d+2exp[−(n−d)[min(δ2,Dd−δ2)−|A|+1(n−d)ϵd]2ϵ2dC4|A|2(d+1),

where

 C=α08e(α+α0).

As a consequence we obtain the following strong consistency result.

###### Corollary 2.11

Under the conditions of Theorem 2.9 we have

 ^Tδ,dn|K=T|K,

eventually almost surely as .

## 3 Exponential inequalities for empirical probabilities

The main ingredient in the proof of Theorem 2.9 is the following exponential upper bound

###### Theorem 3.1

For any finite sequence , any symbol and any the following inequality holds

 P(|Nn(wa)−(n−ℓ(w))p(wa)|>t)≤e1eexp[−t2C(n−ℓ(w))ℓ(wa)],

where

 C=α08e(α+α0). (3.2)

As a direct consequence of Theorem 3.1 we obtain the following corollary.

###### Corollary 3.3

For any finite sequence with , any symbol , any and any the following inequality holds

 P(|^pn(a|w)−p(a|w)|>t)≤2|A|e1eexp[−(n−ℓ(w))[t−|A|+1(n−ℓ(w))p(w)]2p(w)2C4|A|2ℓ(wa)],

where is given by .

To prove Theorem 3.1 we need a mixture property for processes compatible with a probabilistic context tree satisfying Assumptions 2.5 and 2.6. This is the content of the following lemma.

###### Lemma 3.4

Let be a stationary stochastic chain compatible with the probabilistic context tree satisfying Assumptions 2.5 and 2.6. Then, there exists a summable sequence , satisfying

 ∑l∈Nρl≤1+2αα0, (3.5)

such that for any , any , any and any finite sequence , the following inequality holds

 supxi1∈Ai|P(Xk+j−1k=wj1 | Xi1=xi1)−p(wj1)|≤j−1∑l=0ρk−i−1+l. (3.6)
{proof}

First note that

 infu∈A∞P(Xk+j−1k=wj1 | Xi−∞ =u0−∞xi1)≤P(Xk+j−1k=wj1 | Xi1=xi1) ≤supu∈A∞P(Xk+j−1k=wj1 | Xi−∞=u0−∞xi1).

where denotes the set of all semi-infinite sequences . The reader can find a proof of the inequalities above in ((7), Proposition 3). Using this fact and the condition of stationarity it is sufficient to prove that for any ,

 supx∈A∞|P(Xk+j−1k=wj1 | X−1−∞=x−1−∞)−p(wj1)|≤j−1∑l=0ρk+l.

Note that for all pasts we have

 ∣∣P(Xk+j−1k=wj1 | X−1−∞=x−1−∞)−p(wj1)∣∣ =∣∣∫u∈A∞[P(Xk+j−1k=wj1 | X−1−∞=x−1−∞) −P(Xk+j−1k=wj1 | X−1−∞=u−1−∞)]dp(u)∣∣ ≤∫u∈A∞∣∣P(Xk+j−1k=wj1 | X−1−∞=x−1−∞) −P(Xk+j−1k=wj1 | X−1−∞=u−1−∞)∣∣dp(u).

Therefore, applying the loss of memory property proved in ((2), Corollary 4.1) we have that

 ∣∣P(Xk+j−1k=wj1 | X−1−∞=x−1−∞)−P(Xk+j−1k=wj1 | X−1−∞=u−1−∞)∣∣≤j−1∑l=0ρk+l,

where is defined as the probability of return to the origin at time of the Markov chain on starting at time zero at the origin and having transition probabilities

 p(x,y)=⎧⎨⎩αx, if y = x+1,1−αx, if y=0,0, otherwise. (3.7)

This concludes the proof of (3.6). To prove (3.5), let be the Markov chain with probability transitions given by (3.7). By definition we have

 ∏l≥1(1−ρl) =∏l≥1l∑j=1P(Zl=j | Zl−1=j−1)P(Zl−1=j−1) ≥∏l≥1αl−1l−2∏i=0αi≥∏l≥0α2l.

From this, using the inequality which holds for any , it follows that

 ∑l≥1ρl≤−2∑l≥0logαl≤2∑l≥01−αlα0.

This concludes the proof of the lemma.

We are now ready to prove Theorem 3.1.

{proof}

[Proof of Theorem 3.1] Let be a finite sequence and any symbol in . Define the random variables

 Uj=1{Xj+ℓ(w)j=wa}−p(wa),

for . Then, using ((4), Proposition 4) we have that, for any

 ∥Nn (wa)−(n−ℓ(w))p(wa)∥p ≤(2pn−ℓ(wa)∑i=0n−ℓ(wa)∑k=i∥E(Uk | U0,…,Ui)∥∞)12 ≤(2pn−ℓ(wa)∑i=0n−ℓ(wa)∑k=isupu∈Ai+ℓ(wa)|P(Xk+ℓ(w)k=wa | Xi+ℓ(w)0=u)−p(wa)|)12 ≤(2pℓ(wa)(n−ℓ(w))2(α+α0)α0)12.

Then, as in ((5), Proposition 5) we also obtain that, for any ,

 Missing or unrecognized delimiter for \bigl

where

 C=α08e(α+α0).
{proof}

[Proof of Corollary 3.3] First observe that

 ∣∣p(a|w)−(n−ℓ(w))p(wa)+1(n−ℓ(w))p(w)+|A|∣∣≤|A|+1(n−ℓ(w))p(w)\/.

Then, for all we have that

 P(∣∣^pn(a|w) −p(a|w)∣∣>t) ≤P(∣∣Nn(wa)+1Nn(w⋅)+|A|−(n−ℓ(w))p(wa)+1(n−ℓ(w))p(w)+|A|∣∣>t−|A|+1(n−ℓ(w))p(w))

Denote by . Then

 P(∣∣Nn(wa)+1Nn(w⋅)+|A| −(n−ℓ(w))p(wa)+1(n−ℓ(w))p(w)+|A|∣∣>t′) ≤P(∣∣Nn(wa)−(n−ℓ(w))p(wa)∣∣>t′2[(n−ℓ(w))p(w)+|A|]) +∑b∈AP(∣∣Nn(wb)−(n−ℓ(w))p(wb)∣∣>t′2|A|[(n−ℓ(w))p(w)+|A|]).

Now, we can apply Theorem 3.1 to bound above the last sum by

 2|A|e1eexp[−(n−ℓ(w))[t−|A|+1(n−ℓ(w))p(w)]2p(w)2C4|A|2ℓ(wa)],

where

 C=α08e(α+α0).

This finishes the proof of the corollary.

## 4 Proof of the main results

{proof}

[Proof of Theorem 2.9]

Define

 Oδ,dn=⋃w∈Tℓ(w)δ}\/,

and

 Uδ,dn=⋃w∈^Tδ,dnℓ(w)

Then, if we have that

 {^Tδ,dn|K≠T|K}=Oδ,dn∪Uδ,dn.

The result follows from a succession of lemmas.

###### Lemma 4.1

For any , for any with and for any we have that

 P(Δn(uw)>δ)≤4|A|2e1eexp[−(n−d)[δ2−|A|+1(n−d)ϵd]2ϵ2dC4|A|2(d+1)],

where is given by (3.2).

{proof}

Recall that

 Δn(uw)=maxa∈A|^pn(a|uw)−^pn(a|suf(uw))|.

Note that the fact implies that for any finite sequence with and any symbol we have . Hence,

 P(Δn(uw)>δ)≤∑a∈A[ P(|^pn(a|w)−p(a|w)|>δ2) +P(|^pn(a|uw)−p(a|uw)|>δ2)].

Using Corollary 3.3 we can bound above the right hand side of the last inequality by

 4|A|2e1eexp[−(n−d)[δ2−|A|+1(n−d)ϵd]2ϵ2dC4|A|2(d+1)],

where is given by (3.2).

###### Lemma 4.2

For any and for any with we have that

 P(⋂uw∈T|d{Δn(uw)≤δ})≤4|A|e1eexp[−(n−d)[Dd−δ2−|A|+1(n−d)ϵd]2ϵ2dC4|A|2(d+1)],

where is given by .

{proof}

As satisfies (2.10) there exists such that for some . Then

 P(⋂uw∈T|d{Δn(uw)≤δ})≤P(Δn(¯uw)≤δ).

Observe that for any ,

 |^pn(a|suf(¯uw))−^pn(a| ¯uw)|≥|p(a|suf(¯uw))−p(a|¯uw)| −|^pn(a|suf(¯uw))−p(a|suf(¯uw))|−|^pn(a|¯uw)−p(a|¯uw)|.

Hence, we have that for any

 Δn(¯uw)≥Dd−|^pn(a|suf(¯uw))−p(a|suf(¯uw))|−|^pn(a|¯uw)−p(a|¯uw)|\/.

Therefore,

 P(Δn(¯uw)≤δ) ≤ P(⋂a∈A{|^pn(a|suf% (¯uw))−p(a|suf(¯uw))|≥Dd−δ2}) +P(⋂a∈A{|^pn(a|¯uw)−p(a|¯uw)|≥Dd−δ2})\/.

As and we can use Corollary 3.3 to bound above the right hand side of this inequality by

 4|A|e1eexp[−(n−d)[Dd−δ2−|A|+1(n−d)ϵd]2ϵ2dC4|A|2(d+1)],

where is given by . This concludes the proof of the lemma.

Now we can finish the proof of Theorem 2.9. We have that

 P(^Tδ,dn|K≠T|K)=P(Oδ,dn)+P(Uδ,dn).

Using the definition of and we have that

 P(^Tδ,dn|K≠T|K)≤∑w∈Tℓ(w)δ)+∑w∈^Tδ,dnℓ(w)

Applying Lemma 4.1 and Lemma 4.2 we can bound above the last expression by

 P(^Tδ,dn|K≠T|K)≤4e1e|A|d+2exp[−(n−d)[min(δ2,Dd−δ2)−|A|+1(n−d)ϵd]2ϵ2dC4|A|2(d+1)],

where is given by (3.2). We conclude the proof of Theorem 2.9.

{proof}

[Proof of Corollary 2.11] It follows from Theorem 2.9, using the first Borel-Cantelli Lemma and the fact that the bounds for the error estimation of the context tree are summable in for a fixed satisfying (2.10) and .

## 5 Final remarks

The present paper presents an upper bound for the rate of convergence of a version of the algorithm Context, for unbounded context trees. This generalizes previous results obtained in (10) for the case of bounded variable memory processes. We obtain an exponential bound for the probability of incorrect estimation of the truncated context tree, when the estimator is given by Definition (2.8). Note that the definition of the context tree estimator depends on the parameter , and this parameter appears in the exponent of the upper bound. To assure the consistency of the estimator we need to choose a sufficiently small, depending on the transition probabilities of the process. Therefore, our estimator is not universal, in the sense that for any fixed it fails to be consistent for any process having . The same happens with the parameter . In order to choose and not depending on the process, we can allow these parameters to be a function of , in such a way goes to zero and goes to as diverges. When we do this, we loose the exponential property of the upper bound.

As an anonymous referee has pointed out, Finesso et al. (9) proved that in the simpler case of estimating the order of a Markov chain, it is not possible to obtain pure exponential bounds for the overestimation event with a universal estimator. The above discussion illustrates this fact.

## 6 Acknowledgments

We thank Pierre Collet, Imre Csiszár, Nancy Garcia, Aurélien Garivier, Bezza Hafidi, Véronique Maume-Deschamps, Eric Moulines, Jorma Rissanen and Bernard Schmitt for many discussions on the subject. We also thank an anonymous referee that attracted our attention to the interesting paper (9).

### Footnotes

1. thanks: This work is part of PRONEX/FAPESP’s project Stochastic behavior, critical phenomena and rhythmic pattern identification in natural languages (grant number 03/09930-9) and CNPq’s projects Stochastic modeling of speech (grant number 475177/2004-5) and Rhythmic patterns, prosodic domains and probabilistic modeling in Portuguese Corpora (grant number 485999/2007-2). AG is partially supported by a CNPq fellowship (grant 308656/2005-9) and FL is supported by a FAPESP fellowship (grant 06/56980-0)

### References

1. P. Bühlmann and A. J. Wyner. Variable length Markov chains. Ann. Statist., 27:480–513, 1999.
2. F. Comets, R. Fernández, and P. Ferrari. Processes with long memory: Regenerative construction and perfect simulation. Ann. Appl. Probab., 12(3):921–943, 2002.
3. I. Csiszár and Z. Talata. Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inform. Theory, 52(3):1007–1016, 2006.
4. J. Dedecker and P. Doukhan. A new covariance inequality and applications. Stochastic Process. Appl., 106(1):63–80, 2003.
5. J. Dedecker and C. Prieur. New dependence coefficients. examples and applications to statistics. Probab. Theory Relatated Fields, 132:203–236, 2005.
6. D. Duarte, A. Galves, and N.L. Garcia. Markov approximation and consistent estimation of unbounded probabilistic suffix trees. Bull. Braz. Math. Soc., 37(4):581–592, 2006.
7. R. Fernández and A. Galves. Markov approximations of chains of infinite order. Bull. Braz. Math. Soc., 33(3):295–306, 2002.
8. F. Ferrari and A. Wyner. Estimation of general stationary processes by variable length Markov chains. Scand. J. Statist., 30(3):459–480, 2003.
9. L. Finesso, C-C. Liu, and P. Narayan. The optimal error exponent for Markov order estimation. IEEE Trans. Inform. Theory, 42(5):1488–1497, 1996.
10. A. Galves, V. Maume-Deschamps, and B. Schmitt. Exponential inequalities for VLMC empirical trees. ESAIM Prob. Stat. (accepted), 2006.
11. J. Rissanen. A universal data compression system. IEEE Trans. Inform. Theory, 29(5):656–664, 1983.
12. D. Ron, Y. Singer, and N. Tishby. The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25(2-3):117–149, 1996.
13. M. J. Weinberger, J. Rissanen, and M. Feder. A universal finite memory source. IEEE Trans. Inform. Theory, 41(3):643–652, 1995.
14. F. M. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: basic properties. IEEE Trans. Inform. Theory, IT-44:653–664, 1995.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters