Exponential inequalities for empirical unbounded context trees
Abstract
In this paper we obtain nonuniform exponential upper bounds for the rate of convergence of a version of the algorithm Context, when the underlying tree is not necessarily bounded. The algorithm Context is a wellknown tool to estimate the context tree of a Variable Length Markov Chain. As a consequence of the exponential bounds we obtain a strong consistency result. We generalize in this way several previous results in the field.
equationsection \theoremstyleplain \theoremstyledefinition \theoremstyleremark Inequalities for Empirical Context Trees] Exponential inequalities for empirical unbounded context trees A. Galves]Antonio Galves
F. Leonardi]Florencia Leonardi
62M09, 60G99
1 Introduction
In this paper we present an exponential bound for the rate of convergence of the algorithm Context for a class of unbounded variable memory models, taking values on a finite alphabet . From this it follows a strong consistency result for the algorithm Context in this setting. Variable memory models were first introduced in the information theory literature by Rissanen (11) as a universal system for data compression. Originally called by Rissanen finite memory source or probabilistic tree, this class of models recently became popular in the statistics literature under the name of Variable Length Markov Chains (VLMC) (1).
The idea behind the notion of variable memory models is that the probabilistic definition of each symbol only depends on a finite part of the past and the length of this relevant portion is a function of the past itself. Following Rissanen we call context the minimal relevant part of each past. The set of all contexts satisfies the suffix property which means that no context is a proper suffix of another context. This property allows to represent the set of all contexts as a rooted labeled tree. With this representation the process is described by the tree of all contexts and a associated family of probability measures on , indexed by the tree of contexts. Given a context, its associated probability measure gives the probability of the next symbol for any past having this context as a suffix. From now on the pair composed by the context tree and the associated family of probability measures will be called probabilistic context tree.
Rissanen not only introduced the notion of variable memory models but he also introduced the algorithm Context to estimate the probabilistic context tree. The way the algorithm Context works can be summarized as follows. Given a sample produced by a chain with variable memory, we start with a maximal tree of candidate contexts for the sample. The branches of this first tree are then pruned until we obtain a minimal tree of contexts well adapted to the sample. We associate to each context an estimated probability transition defined as the proportion of time the context appears in the sample followed by each one of the symbols in the alphabet. From Rissanen (11) to Galves et al. (10), passing by Ron et al. (12) and Bühlmann and Wyner (1), several variants of the algorithm Context have been presented in the literature. In all the variants the decision to prune a branch is taken by considering a cost function. A branch is pruned if the cost function assumes a value smaller than a given threshold. The estimated context tree is the smallest tree satisfying this condition. The estimated family of probability transitions is the one associated to the minimal tree of contexts.
In his seminal paper Rissanen proved the weak consistency of the algorithm Context in the case where the contexts have a bounded length, i.e. where the tree of contexts is finite. Bühlmann and Wyner (1) proved the weak consistency of the algorithm also in the finite case without assuming a priori known bound on the maximal length of the memory, but using a bound allowed to grow with the size of the sample. In both papers the cost function is defined using the log likelihood ratio test to compare two candidate trees and the main ingredient of the consistency proofs was the chisquare approximation to the log likelihood ratio test for Markov chains of fixed order. A different way to prove the consistency in the finite case was introduced in (10), using exponential inequalities for the estimated transition probabilities associated to the candidate contexts. As a consequence they obtain an exponential upper bound for the rate of convergence of their variant of the algorithm Context.
The unbounded case as far as we know was first considered by Ferrari and Wyner (8) who also proved a weak consistency result for the algorithm Context in this more general setting. The unbounded case was also considered by Csiszár and Talata (3) who introduced a different approach for the estimation of the probabilistic context tree using the Bayesian Information Criterion (BIC) as well as the Minimum Description Length Principle (MDL). We refer the reader to this last paper for a nice description of other approaches and results in this field, including the context tree maximizing algorithm by Willems et al. (14). With exception of Weinberger et al. (13), the issue of the rate of convergence of the algorithm estimating the probabilistic context tree was not addressed in the literature until recently. Weinberger et al. proved in the bounded case that the probability that the estimated tree differs from the finite context tree generating the sample is summable as a function of the sample size. Duarte et al. in (6) extends the original weak consistency result by Rissanen (11) to the unbounded case. Assuming weaker hypothesis than (8), they showed that the online estimation of the context function decreases as the inverse of the sample size.
In the present paper we generalize the exponential inequality approach presented in (10) to obtain an exponential upper bound for the algorithm Context in the case of unbounded probabilistic context trees. Under suitable conditions, we prove that the truncated estimated context tree converges exponentially fast to the tree generating the sample, truncated at the same level. This improves all results known until now.
The paper is organized as follows. In section 2 we give the definitions and state the main results. Section 3 is devoted to the proof of an exponential bound for conditional probabilities, for unbounded probabilistic context trees. In section 4 we apply this exponential bound to estimate the rate of convergence of our version of the algorithm Context and to prove its consistency.
2 Definitions and results
In what follows will represent a finite alphabet of size . Given two integers , we will denote by the sequence of symbols in . The length of the sequence is denoted by and is defined by . Any sequence with represents the empty string and is denoted by . The length of the empty string is .
Given two finite sequences and , we will denote by the sequence of length obtained by concatenating the two strings. In particular, . The concatenation of sequences is also extended to the case in which denotes a semiinfinite sequence, that is .
We say that the sequence is a suffix of the sequence if there exists a sequence , with , such that . In this case we write . When or we write . Given a sequence we denote by the largest suffix of .
In the sequel will denote the set of all sequences of length over and represents the set of all finite sequences, that is
Definition 2.1
A countable subset of is a tree if no sequence is a suffix of another sequence . This property is called the suffix property.
We define the height of the tree as
In the case it follows that has a finite number of sequences. In this case we say that is bounded and we will denote by the number of sequences in . On the other hand, if then has a countable number of sequences. In this case we say that the tree is unbounded.
Given a tree and an integer we will denote by the tree truncated to level , that is
We will say that a tree is irreducible if no sequence can be replaced by a suffix without violating the suffix property. This notion was introduced in (3) and generalizes the concept of complete tree.
Definition 2.2
A probabilistic context tree over is an ordered pair such that

is an irreducible tree;

is a family of transition probabilities over .
Consider a stationary stochastic chain over . Given a sequence we denote by
the stationary probability of the cylinder defined by the sequence . If we write
Definition 2.3
A sequence is a context for the process if and for any semiinfinite sequence such that is a suffix of we have that
and no suffix of satisfies this equation.
Definition 2.4
We say that the process is compatible with the probabilistic context tree if the following conditions are satisfied

if and only if is a context for the process .

For any and any , .
Define the sequence as
From now on we will assume that the probabilistic context tree satisfies the following assumptions.
Assumption 2.5
Nonnullness, that is for any .
Assumption 2.6
Summability of the sequence . In this case denote by
For a probabilistic context tree satisfying Assumptions 2.5 and 2.6, the maximal coupling argument used in (7), or alternatively the perfect simulation scheme presented in (2), imply the uniqueness of the law of the chain compatible with it.
Given an integer , we define
and
We denote by
In what follows we will assume that is a sample of the stationary stochastic chain compatible with the probabilistic context tree .
For any finite string with , we denote by the number of occurrences of the string in the sample; that is
For any element , the empirical transition probability is defined by
(2.7) 
where
This definition of is convenient because it is asymptotically equivalent to and it avoids an extra definition in the case .
A variant of Rissanen’s algorithm Context is defined as follows. First of all, let us define for any finite string :
The operator computes a distance between the empirical transition probabilities associated to the sequence and the one associated to the sequence .
Definition 2.8
Given and , the tree estimated with the algorithm Context is
where denotes the set of all sequences of length at most . In the case we have .
It is easy to see that is an irreducible tree. Moreover, the way we defined in (2.7) associates a probability distribution to each sequence in .
The main result in this article is the following
Theorem 2.9
As a consequence we obtain the following strong consistency result.
Corollary 2.11
3 Exponential inequalities for empirical probabilities
The main ingredient in the proof of Theorem 2.9 is the following exponential upper bound
Theorem 3.1
For any finite sequence , any symbol and any the following inequality holds
where
(3.2) 
As a direct consequence of Theorem 3.1 we obtain the following corollary.
Corollary 3.3
For any finite sequence with , any symbol , any and any the following inequality holds
where is given by .
To prove Theorem 3.1 we need a mixture property for processes compatible with a probabilistic context tree satisfying Assumptions 2.5 and 2.6. This is the content of the following lemma.
Lemma 3.4
First note that
where denotes the set of all semiinfinite sequences . The reader can find a proof of the inequalities above in ((7), Proposition 3). Using this fact and the condition of stationarity it is sufficient to prove that for any ,
Note that for all pasts we have
Therefore, applying the loss of memory property proved in ((2), Corollary 4.1) we have that
where is defined as the probability of return to the origin at time of the Markov chain on starting at time zero at the origin and having transition probabilities
(3.7) 
This concludes the proof of (3.6). To prove (3.5), let be the Markov chain with probability transitions given by (3.7). By definition we have
From this, using the inequality which holds for any , it follows that
This concludes the proof of the lemma.
We are now ready to prove Theorem 3.1.
[Proof of Theorem 3.1] Let be a finite sequence and any symbol in . Define the random variables
for . Then, using ((4), Proposition 4) we have that, for any
Then, as in ((5), Proposition 5) we also obtain that, for any ,
where
4 Proof of the main results
{proof}[Proof of Theorem 2.9]
Define
and
Then, if we have that
The result follows from a succession of lemmas.
Lemma 4.1
Recall that
Note that the fact implies that for any finite sequence with and any symbol we have . Hence,
Using Corollary 3.3 we can bound above the right hand side of the last inequality by
where is given by (3.2).
Lemma 4.2
For any and for any with we have that
where is given by .
As satisfies (2.10) there exists such that for some . Then
Observe that for any ,
Hence, we have that for any
Therefore,
As and we can use Corollary 3.3 to bound above the right hand side of this inequality by
where is given by . This concludes the proof of the lemma.
Now we can finish the proof of Theorem 2.9. We have that
Using the definition of and we have that
Applying Lemma 4.1 and Lemma 4.2 we can bound above the last expression by
where is given by (3.2). We conclude the proof of Theorem 2.9.
5 Final remarks
The present paper presents an upper bound for the rate of convergence of a version of the algorithm Context, for unbounded context trees. This generalizes previous results obtained in (10) for the case of bounded variable memory processes. We obtain an exponential bound for the probability of incorrect estimation of the truncated context tree, when the estimator is given by Definition (2.8). Note that the definition of the context tree estimator depends on the parameter , and this parameter appears in the exponent of the upper bound. To assure the consistency of the estimator we need to choose a sufficiently small, depending on the transition probabilities of the process. Therefore, our estimator is not universal, in the sense that for any fixed it fails to be consistent for any process having . The same happens with the parameter . In order to choose and not depending on the process, we can allow these parameters to be a function of , in such a way goes to zero and goes to as diverges. When we do this, we loose the exponential property of the upper bound.
As an anonymous referee has pointed out, Finesso et al. (9) proved that in the simpler case of estimating the order of a Markov chain, it is not possible to obtain pure exponential bounds for the overestimation event with a universal estimator. The above discussion illustrates this fact.
6 Acknowledgments
We thank Pierre Collet, Imre Csiszár, Nancy Garcia, Aurélien Garivier, Bezza Hafidi, Véronique MaumeDeschamps, Eric Moulines, Jorma Rissanen and Bernard Schmitt for many discussions on the subject. We also thank an anonymous referee that attracted our attention to the interesting paper (9).
Footnotes
 thanks: This work is part of PRONEX/FAPESP’s project Stochastic behavior, critical phenomena and rhythmic pattern identification in natural languages (grant number 03/099309) and CNPq’s projects Stochastic modeling of speech (grant number 475177/20045) and Rhythmic patterns, prosodic domains and probabilistic modeling in Portuguese Corpora (grant number 485999/20072). AG is partially supported by a CNPq fellowship (grant 308656/20059) and FL is supported by a FAPESP fellowship (grant 06/569800)
References
 P. Bühlmann and A. J. Wyner. Variable length Markov chains. Ann. Statist., 27:480–513, 1999.
 F. Comets, R. Fernández, and P. Ferrari. Processes with long memory: Regenerative construction and perfect simulation. Ann. Appl. Probab., 12(3):921–943, 2002.
 I. Csiszár and Z. Talata. Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inform. Theory, 52(3):1007–1016, 2006.
 J. Dedecker and P. Doukhan. A new covariance inequality and applications. Stochastic Process. Appl., 106(1):63–80, 2003.
 J. Dedecker and C. Prieur. New dependence coefficients. examples and applications to statistics. Probab. Theory Relatated Fields, 132:203–236, 2005.
 D. Duarte, A. Galves, and N.L. Garcia. Markov approximation and consistent estimation of unbounded probabilistic suffix trees. Bull. Braz. Math. Soc., 37(4):581–592, 2006.
 R. Fernández and A. Galves. Markov approximations of chains of infinite order. Bull. Braz. Math. Soc., 33(3):295–306, 2002.
 F. Ferrari and A. Wyner. Estimation of general stationary processes by variable length Markov chains. Scand. J. Statist., 30(3):459–480, 2003.
 L. Finesso, CC. Liu, and P. Narayan. The optimal error exponent for Markov order estimation. IEEE Trans. Inform. Theory, 42(5):1488–1497, 1996.
 A. Galves, V. MaumeDeschamps, and B. Schmitt. Exponential inequalities for VLMC empirical trees. ESAIM Prob. Stat. (accepted), 2006.
 J. Rissanen. A universal data compression system. IEEE Trans. Inform. Theory, 29(5):656–664, 1983.
 D. Ron, Y. Singer, and N. Tishby. The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25(23):117–149, 1996.
 M. J. Weinberger, J. Rissanen, and M. Feder. A universal finite memory source. IEEE Trans. Inform. Theory, 41(3):643–652, 1995.
 F. M. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The contexttree weighting method: basic properties. IEEE Trans. Inform. Theory, IT44:653–664, 1995.