Context Tree Selection: A Unifying View

# Context Tree Selection: A Unifying View

A. Garivier F. Leonardi aurelien.garivier@telecom-paristech.fr, LTCI, CNRS, Telecom ParisTech florencia@usp.br, Instituto de Matemática e Estatística, Universidade de São Paulo
###### Abstract

Context tree models have been introduced by Rissanen in rissanen1983 () as a parsimonious generalization of Markov models. Since then, they have been widely used in applied probability and statistics. The present paper investigates non-asymptotic properties of two popular procedures of context tree estimation: Rissanen’s algorithm Context and penalized maximum likelihood. First showing how they are related, we prove finite horizon bounds for the probability of over- and under-estimation. Concerning over-estimation, no boundedness or loss-of-memory conditions are required: the proof relies on new deviation inequalities for empirical probabilities of independent interest. The under-estimation properties rely on classical hypotheses for processes of infinite memory. These results improve on and generalize the bounds obtained in duarte2006 (); galves2006 (); galves2008 (); leonardi2009 (), refining asymptotic results of buhlmann1999 (); csiszar2006 ().

###### keywords:
algorithm Context, penalized maximum likelihood, model selection, variable length Markov chains, Bayesian information criterion, deviation inequalities

## 1 Introduction

Context tree models (CTM), first introduced by Jorma Rissanen in rissanen1983 () as efficient tools in Information Theory, have been successfully studied and used since then in many fields of Probability and Statistics, including Bioinformatics bejerano2001a (); busch2009 (), Universal Coding willems1995 (), Mathematical Statistics buhlmann1999 () or Linguistics galves2009 (). Sometimes also called Variable Length Markov Chain (VLMC), a context tree process is informally defined as a Markov chain whose memory length depends on past symbols. This property makes it possible to represent the set of memory sequences as a tree, called the context tree of the process.

A remarkable tradeoff between expressivity and simplicity explains this success: no more difficult to handle than Markov chains, they appear to be much more flexible and parsimonious, including memory only where necessary. Not only do they provide more efficient models for fitting the data: it appears also that, in many applications, the shape of the context tree has a natural and informative interpretation. In Bioinformatics, the contexts trees of a sample have been useful to test the relevance of protein families databases busch2009 () and in Linguistics, tree estimation highlights structural discrepancies between Brazilian and European Portuguese galves2009 ().

Of course, practical use of CTM requires the possibility of constructing efficient estimators of the model generating the data. It could be feared that, as a counterpart of the model multiplicity, increased difficulty would be encountered in model selection. Actually, this is not the case, and soon several procedures have been proposed and proved to be consistent. Roughly speaking, two families of context tree estimators are available. The first family, derived from the so-called algorithm Context introduced by Rissanen in rissanen1983 (), is based on the idea of tree pruning. They are somewhat reminiscent of the CART breiman:al:84:cart () pruning procedures: a measure of discrepancy between a node’s children determines whether they have to be removed from the tree or not. The second family of estimators are based on a classical approach of mathematical statistics: Penalized Maximum Likelihood (PML). For each possible model, a criterion is computed which balances the quality of fit and the complexity of the model. In the framework of Information Theory, these procedures can be interpreted as derivations of the Minimum Description Length principle barron1998 ().

In the case of bounded memory processes, the problem of consistent estimation is clear: an estimator is strongly consistent if it is equal to eventually almost surely as the sample size grows to infinity. As soon as 1983, Rissanen proved consistency results for the algorithm Context in this case. But later, the possibility of handling infinite memory processes was also addressed. In csiszar2006 (), an estimator is called strongly consistent if for every positive integer , its truncation at level is equal to the truncation of eventually almost surely. With this definition, PML estimators are shown to be strongly consistent if the penalties are appropriately chosen and if the maximization is restricted to a proper set of models. This last restriction was proven to be unnecessary in the finite memory case garivier2006 ().

More recently, the problem of deriving non-asymptotic bounds for the probability of incorrect estimation was considered. In galves2006 (), non-universal inequalities were derived for a version of the algorithm Context in the case of finite context trees. These results were generalized to the case of infinite trees in galves2008 (), and to PML estimators in leonardi2009 (). Using recent advances in weak dependence theory, all these results strongly rely on mixing hypotheses of the process.

For all these results, a distinction has to be made between two potential errors: under- and over-estimation. A context of is said to be under-estimated if one of its proper suffixes appears in the estimated tree , whereas it is called over-estimated if it appears as an internal node of . Over- and under-estimation appear to be of different natures: while under-estimation is eventually avoided by the existence of a strictly positive distance between a process and all processes with strictly smaller context trees, controlling over-estimation requires bounds on the fluctuations of empirical processes.

In this article, we present a unified analysis of the two families of context tree estimators. We contribute to a completely non-asymptotic analysis: we show that for appropriate parameters and measure of discrepancy, the PML estimator is always smaller than the estimator given by the algorithm Context. To our knowledge, this is the first result comparing this two context tree selection methods.

Without restrictions on the (possibly infinite) context tree , we prove that both methods provide estimators that are with high probability sub-trees of (i.e., a node that is not in does not appear in ). These bounds are more precise and do not require the conditions assumed in galves2006 (); galves2008 (); leonardi2009 (). for this purpose, we derive “self-normalized” non-asymptotic deviation inequalities, using martingale techniques inspired from proofs of the Law of the Iterated Logarithm neveu72 (); csiszar2002 (). These inequalities prove interesting in other fields, as for instance in reinforcement learning garivier2008 (); filippiCappeGarivier10KLUCRL (). On the other hand, we derive upper bounds on the probability of under-estimation by assuming classical mixing conditions on the process generating the sample: with high probability, contains every node of at moderate height. This result is based on exponential inequalities derived for a wider class of processes than in galves2006 (); galves2008 (); leonardi2009 ().

Our upper bounds on the probability of over- and under-estimation imply strong consistency of the PML estimators for a larger class of penalizing functions than in leonardi2009 (). Similarly, in the case of the algorithm Context the strong consistency can also be derived for suitable threshold parameters, generalizing the convergence in probability for this estimator obtained previously in duarte2006 ().

The paper is organized as follows. In Section 2 we set notation and definitions, we describe in detail the algorithms and we state our main results. The proof of these results is given in Section 3. In Section 4 we briefly discuss our results. Appendix A contains the statement and proof of the self-normalized deviation inequalities and Appendix B is devoted to the presentation of exponential inequalities for weak dependent processes.

## 2 Notations and results

In what follows, is a finite alphabet; its size is denoted by . denotes the set of all sequences of length over , in particular has only one element, the empty sequence. We denote by the set of all finite sequences on alphabet and will denote the set of all semi-infinite sequences of symbols in . The length of the sequence is . For , we denote and denotes the semi-infinite sequence . Given and , we denote by the sequence obtained by concatenating the two sequences and . We say that the sequence is a suffix of the sequence if there exists a sequence such that . In this case we write or . When we say that is a proper suffix of and we write or .

A set is a tree if no sequence is a proper suffix of another sequence . The height of the tree is defined as

 h(T)=sup{|w|:w∈T}.

If we say that is bounded and we denote by the cardinality of . If we say that is unbounded. The elements of are also called the leaves of . An internal node of is a proper suffix of a leaf. For any sequence and for any tree , we define the tree as the set of leaves in which have as a suffix, that is

 Tw={u∈T:u⪰w}.

Given a tree and an integer we will denote by the tree truncated to level , that is

 T|K={w∈T:|w|≤K}∪{w∈AK:w≺u for some% u∈T}.

Given two trees and we say that is included in (denoted by or ) if for any sequence there exists a sequence such that ; in other words, all leaves of are either leaves or internal nodes of .

Consider a stationary ergodic stochastic process over . Given a sequence we denote by

 p(w)=P(X|w|1=w)

the stationary probability of the cylinder defined by the sequence . If we write

 p(a|w)=P(X0=a|X−1−|w|=w).
###### Definition 2.1.

A sequence is a finite context for the process if it satisfies

1. ;

2. for any sequence such that and ,

 P(X0=a|X−1−|v|=v)=p(a|w),for all a∈A;
3. no proper suffix of satisfies 1. and 2.

An infinite context is a semi-infinite sequence such that any of its finite suffixes , is a context. In what follows the term context will refer to a finite or infinite context.

It can be seen that the set of all contexts of the process is a tree. This is called the context tree of the process. For example, the context tree of an i.i.d. process is and the context tree of a generic Markov chain of order is . In what follows, we will denote by the context tree of the process .

Let be positive integers. Let be a sequence distributed according to . For any sequence and any symbol we denote by the number of occurrences of symbol in that are preceded by an occurrence of , that is:

 Nn(w,a)=n∑t=1\mathbbm1{Xt−1t−|w|=w,Xt=a}. (2.2)

The sum is denoted by .

We will denote by the set of all sequences that appear at least once in the sample, that is

 Vn={w∈A∗:Nn(w)≥1}.
###### Definition 2.3.

We will say that a tree is acceptable if it satisfies the following conditions:

1. ; and

2. every sequence such that belongs to or has a proper suffix that belongs to .

Then, our set of candidate trees, denoted by , will be the set of all acceptable trees. Our goal is to select a tree as close as possible to , in some sense that will be formally given below. Note that may depend on , so that the set of candidate trees is allowed to grow with the sample size. The symbols are only observed to ensure that, for every candidate tree , the context of in is well defined, for every . Alternatively, if were not assumed observed, similar results would be obtained by using quasi-maximum likelihood estimators galves:garivier:gassiat:2010 (). Given a tree , the maximum likelihood of the sequence is given by

 ^P\textupML,T(Xn1)=∏w∈T∏a∈A^pn(a|w)Nn(w,a), (2.4)

where the empirical probabilities are

 ^pn(a|w)=Nn(w,a)Nn(w) (2.5)

if and otherwise. For any sequence we define

 ^P\textupML,w(Xn1)=∏a∈A^pn(a|w)Nn(w,a).

Hence, we have

 ^P\textupML,T(Xn1)=∏w∈T^P\textupML,w(xn1).

In order to measure discrepancy between two probability measures over we use the Küllback-Leibler divergence, defined for two probability measures and on by

 D(P;Q)=∑a∈AP(a)logP(a)Q(a)

where, by convention, if and if .

### 2.1 The algorithm Context

The algorithm Context introduced by J. Rissanen in rissanen1983 () computes, for each node of a given tree, a discrepancy measure between the transition probability associated to this context and the corresponding transition probabilities of the nodes obtained by concatenating a single symbol to the context. Beginning with the largest leaves of a candidate tree, if the discrepancy measure is greater than a given threshold, the contexts are maintained in the tree; otherwise, they are pruned. The procedure continues until no more pruning of the tree can be performed.

For all sequences let

 Δn(w)=∑b:bw∈VnNn(bw)D(^pn(⋅|bw);^pn(⋅|w)).
###### Remark 2.6.

We use here the original choice of divergence proposed by J. Rissanen in rissanen1983 (), but other possibilities have been proposed in the literature (see for instance buhlmann1999 (); galves2006 ()).

We will denote the threshold used in algorithm Context on samples of length by , where is a sequence of positive real numbers such that and when . For a sequence , let be an indicator function defined for all by the following induction:

 Cw(Xn1)={0, if Nn(w)≤1 or % |w|≥d,max{\mathbbm1{Δn(w)≥δn},maxb∈ACbw(Xn1)}, if Nn(w)>1 and |w|

With these definitions, the context tree estimator is the set given by

 ^TC(Xn1)={w∈Vn:Cw(Xn1)=0 and Cu(Xn1)=1 for all u≺w} (2.8)

### 2.2 The penalized maximum likelihood criterion

The penalized maximum likelihood criterion for the sequence is defined by

 ^TPML(Xn1)=argmaxT∈Tn{log^P\textupML,T(Xn1)−|T|f(n)}, (2.9)

where is some positive function such that and when .

This class of context tree estimators was first considered by Csiszár and Talata in csiszar2006 (), who introduced the Bayesian Information Criterion (BIC) for context trees and proved its consistency. The BIC leads to the choice of the penalty function . It may first appear practically impossible to compute , because the maximization in (2.9) must be performed over the set of all candidate trees. Fortunately, Csiszár and Talata showed in their article csiszar2006 () how to adapt the Context Tree Maximizing (CTM) method willems1995 () in order to obtain a simple and efficient algorithm computing . As the representation of the estimator given by this algorithm is important for the proof of our results, we briefly present it here. Define recursively, for any , with , the value

 Vw(Xn1)=max{e−f(n)^P\textupML,w(Xn1),∏b∈A:bw∈VnVbw(Xn1)} (2.10)

and the indicator

 Xw(Xn1)=\mathbbm1{∏b∈A:bw∈VnVbw(Xn1)>e−f(n)^P\textupML,w(Xn1)}. (2.11)

By convention, if or if then and . As shown in csiszar2006 (), it holds that

 ^TPML(Xn1)={w∈Vn:Xw(Xn1)=0 and Xu(Xn1)=1 for all u≺w}. (2.12)

### 2.3 Results

In this subsection we present the main results of this article. First, we show that the empirical tree given by the algorithm Context is always included in the tree given by the penalized maximum likelihood estimator, if the threshold is smaller than the penalization function .

###### Proposition 2.13.

For any and all sequences , if then

 ^TPML(Xn1)⪯^TC(Xn1).

In the sequel we will assume that the cutoff sequence of the algorithm Context equals the penalization term of the penalized maximum likelihood estimator, in order to allow a unified treatment of the two algorithms. That is, we will assume that for any .

We now state a new bound on the probability of over-estimation that does not require any mixing hypotheses on the underlying process.

###### Theorem 2.14.

For every it holds that

 P(^T(Xn1)⪯T0)≥1−e(δnlog(n)+|A|2)n2exp(−δn|A|2), (2.15)

where or .

###### Remark 2.16.

Theorem 2.14 is proven without assuming any bound on the height of the hypothetical trees. That is, the result remains valid even if . But if the candidate trees have only a limited number of nodes, possibly depending on (see, e.g, rissanen1983 (); csiszar2006 ()), a straightforward modification of the proof shows that

 P(^T(Xn1)⪯T0)≥1−2e(δnlog(n)+|A|2)k(n)exp(−δn|A|2),

where is the maximal number of nodes of a candidate tree. In particular, if the height of the trees is smaller than a function (possibly constant) then .

The problem of under-estimation in context tree models is very different, and requires additional hypotheses on the process . For any with define the coefficient

 β(w,r)=maxu∈Armaxa∈A{|p(a|w)−p(a|uw)|}.

The continuity rate of the process is the sequence where

 βk=maxw∈Aksupr≥1{β(w,r)}.

Define also the non-nullness coefficient

 α0:=∑a∈Ainfw∈T0{p(a|w)}. (2.17)

Our underestimation error bounds will rely on the following assumption.

###### Assumption 1.

The process satisfies the following conditions

1. (weakly non-nullness) and

2.   (summable continuity rate).

These are classical hypotheses for processes of infinite memory, which are also referred to as chains of type A, see for instance fernandez2002 () and references therein.

To establish upper bounds for the probability of under-estimation we will consider the truncated tree , for any given constant . Note that in the case is a finite tree, coincides with for a sufficiently large constant . The bounds are stated in the following theorem.

###### Theorem 2.18.

Assume the process satisfies Assumption 1. Let and let be such that

 minw≺u∈T0|Kmaxr≤d−|w|{β(w,r)}≥ϵ>0. (2.19)

Then, there exists such that for any it holds that

 P(T0|K⪯^T(Xn1)|K)≥1−3eα0/32e2|A|2(|A|β+2α0)|A|2+Kexp[−nϵ2[pdmin−8|A|df(n)ϵ2n]216(d+1)], (2.20)

where or and .

###### Remark 2.21.

It can be seen that for any there is always a value of such that (2.19) holds. This hypothesis can be avoided by letting increase with the sample size and by controlling the upper bounds in (2.20). Extensions of Theorem 2.18 can also be obtained by allowing to be a function of the sample size . In this case, the rate at which increases must be controlled together with the rate at which and decrease with the sample size. This leads to a rather technical condition, see for instance talata2009 ().

Finally, the next theorem states the strong consistency of the estimators and for appropriate threshold parameters and penalizing functions, respectively.

###### Theorem 2.22.

Assume the hypotheses of Theorem 2.18 are met. Then for any threshold parameter such that

 ∑n∈Nexp(−δn|A|2+log(δnlog(n)))<+∞

we have eventually almost surely as . Similarly, if we choose we have eventually almost surely as .

## 3 Proofs

### 3.1 Proof of Proposition 2.13.

We must prove that a leaf in is always a leaf or an internal node in . By the characterization of and given by equations (2.8) and (2.12), respectively, this is equivalent to proving that for all with . In fact, assume that implies , and take ; then, either and , or it holds that for all , , which implies by assumption that . Now, if , then ; otherwise, is a proper suffix of a sequence . In any case, is a leaf or an internal node of .

Assume there exists , , such that and . Note that by (2.7), implies for all , ; hence, can be chosen such that for any , . In this case we have, by the definitions (2.10) and (2.11) that

 e−f(n)^P\textupML,w(Xn1) <∏b:bw∈VnVbw(Xn1) (3.1) =∏b:bw∈Vne−f(n)^P\textupML,bw(Xn1). (3.2)

The equality in the second line of the last expression follows by the fact that for any , ; therefore we must have for any , .

Now, observe that for any , and . If not, would be equal to for some and for all , implying that ; hence

 ∏b:bw∈VnVbw(Xn1)=Vcw(Xn1)=e−f(n)^P\textupML,cw(Xn1)=e−f(n)^P\textupML,w(Xn1)

and thus, by definition, . Using these facts, and taking logarithm on both sides of Inequality (3.1), we obtain

 (∣∣{b:bw∈Vn}∣∣−1)f(n) <∑b:bw∈Vn∑a∈ANn(bw,a)log^pn(a|bw)^pn(a|w) =∑b:bw∈VnNn(bw)D(^pn(⋅|bw);^pn(⋅|w))=Δn(w).

Therefore, if we have which contradicts the fact that . This concludes the proof of Proposition 2.13.

### 3.2 Proof of Theorem 2.14.

We will prove the result for the case . The case follows straightforwardly from Proposition 2.13 and equality .

Let be the event . Overestimation occurs if at least one internal node of has a (non necessarily proper) suffix in ; that is, if there exists a (possibly empty) sequence such that . Thus, with a little abuse of notation can be written as

 On=⋃s∈T0⋃u∈A∗{Δn(us)>δn}.

For any sequence we have that are the maximum likelihood estimators of the transition probabilities , therefore we have that

 Δn(w) =∑b∈ANn(bw)D(^pn(⋅|bw);^pn(⋅|w)) =∑b∈ANn(bw)∑a∈A(^p(a|bw)log^p(a|bw)−^p(a|bw)log^p(a|w)) =(∑b∈ANn(bw)∑a∈A^p(a|bw)log^p(a|bw))−∑b∈A∑a∈ANn(bw,a)log^p(a|w) =(∑b∈ANn(bw)∑a∈A^p(a|bw)log^p(a|bw))−∑a∈ANn(w,a)log^p(a|w) ≤(∑b∈ANn(bw)∑a∈A^p(a|bw)log^p(a|bw))−∑a∈ANn(w,a)logp(a|w) =(∑b∈ANn(bw)∑a∈A^p(a|bw)log^p(a|bw))−∑b∈A∑a∈ANn(bw,a)logp(a|w) =∑b∈ANn(bw)∑a∈A(^p(a|bw)log^p(a|bw)−^p(a|bw)logp(a|w)) =∑b∈ANn(bw)D(^pn(⋅|bw);p(⋅|w))

Hence, as for all it holds that we obtain

 P(Δn(w)>δn)≤P(∑b∈ANn(bw)D(^pn(⋅|bw);p(⋅|bw))>δn).

Using Theorem A.7, stated in Appendix A, it follows that

 P(On) ≤∑s∈T0∑u∈A∗P(Δn(us)>δn) ≤∑s∈T0∑u∈A∗P(∑b∈ANn(bus)D(^pn(⋅|bus);p(⋅|bus))>δn) ≤∑s∈T0∑u∈A∗∑b∈AP(Nn(bus)D(^pn(⋅|bus);p(⋅|bus))>δn|A|∣∣Nn(bus)>0)P(Nn(bus)>0) ≤2e(δnlogn+|A|2)exp(−δn|A|2)∑s∈T0∑u∈A∗∑b∈AP(Nn(bus)>0) ≤2e(δnlogn+|A|2)exp(−δn|A|2)E[Cn],

where denotes the number of different contexts of the symbols in . But is always upper-bounded by the number of (non-necessarily distinct) contexts of , and the result follows.

### 3.3 Proof of Theorem 2.18.

In this case we will prove the result for the case . The case follows again from Proposition 2.13 and the assumption that .

If denotes the event then

 Un⊂⋃w≺u∈T0|K{Xw(Xn1)=0}.

Let . Then we have

 P(Xw(Xn1)=0)=P(∏a∈A:aw∈VnVaw(Xn1)≤e−f(n)^P\textupML,w(Xn1)). (3.3)

By hypothesis, there exists and such that

 maxa∈A|p(a|w)−p(a|sw)|≥ϵ.

If , denote by and let be the tree given by

 T=∪ri=2∪b∈Ai{bsriw}∪{sw}.

By definition, for any it can be shown recursively that

 Vaw(Xn1)=maxT′∈Tn∏v∈T′awe−f(n)^P\textupML,v(Xn1)

see for example Lemma 4.4 in csiszar2006 (). Therefore,

 P(∏a∈A:aw∈Vn Vaw(Xn1)≤e−f(n)^P\textupML,w(Xn1)) ≤P(∏u∈Te−f(n)^P\textupML,u(Xn1)≤e−f(n)^P\textupML,w(Xn1)) (3.4)

by noticing that

 ∏a∈A:aw∈VnmaxT′∈Tn∏v∈T′awe−f(n)^P\textupML,v(Xn1) ≥∏a∈A:aw∈Vn∏v∈Tawe−f(n)^P\textupML,v(Xn1) ≥∏u∈Te−f(n)^P\textupML,u(Xn1).

Applying logarithm and using that for any we can write the probability in (3.3) by

 P(∑u∈TNn(u) D(^pn(⋅|u);^pn(⋅|w))≤(|T|−1)f(n)) ≤P(Nn(sw)D(^pn(⋅|sw);^pn(⋅|w))≤(|T|−1)f(n)). (3.5)

Define the events and by

 As,wn ={Xn1:Nn(sw)D(^pn(⋅<