Nonparametric statistical inference for the context tree of a stationary ergodic process\thanksreft1
Abstract
We consider the problem of estimating the context tree of a stationary ergodic process with finite alphabet without imposing additional conditions on the process. As a starting point we introduce a Hamming metric in the space of irreducible context trees and we use the properties of the weak topology in the space of ergodic stationary processes to prove that if the Hamming metric is unbounded, there exist no consistent estimators for the context tree. Even in the bounded case we show that there exist no twosided confidence bounds. However we prove that onesided inference is possible in this general setting and we construct a consistent estimator that is a lower bound for the context tree of the process with an explicit formula for the coverage probability. We develop an efficient algorithm to compute the lower bound and we apply the method to test a linguistic hypothesis about the context tree of codified written texts in European Portuguese.
t1This article was produced as part of the activities of FAPESP Research, Innovation and Dissemination Center for Neuromathematics, grant 2013/076990, São Paulo Research Foundation. It has also received financial support from the projects Stochastic systems: equilibrium and nonequilibrium, limits in scale and percolation, grant CNPq 474233/20120, and Stochastic chains of long range, grant FAPESP 2015/090943.
Nonparametric statistical inference for context trees
t2Partially supported by a CNPqBrazil fellowship 304836/20125 and a L’Oréal Fellowship for Women in Science.
class=MSC] \kwd[Primary ]62M09 \kwd62G15 \kwd62G20 \kwd[; secondary ]60G10 \kwd60J10
variable length markov chain \kwdcontext tree \kwdconfidence bounds \kwdconsistent estimation \kwdnonparametric inference
1 Introduction
In this work we address the issue of whether or not there exist consistent estimators (and confidence bounds) for the context tree of a discrete time stationary ergodic process with finite alphabet. In words, the context tree of a stochastic process is a set of finite strings or leftinfinite sequences that determines the portion of the past the process has to look at in order to decide the distribution of its next symbol. For example, an i.i.d. process has the empty string as context tree since it has no dependence on the past. A steps Markov chain has a context tree containing at least one string of length , and a nonMarkovian chain (sometimes coined infinite memory process) has a context tree having at least one leftinfinite sequence.
Finite context trees were introduced by Rissanen (1983) as an efficient tool for data compression. The corresponding processes were originally called Variable length Markov Chains (VLMC) and its estimation was first addressed in Bühlmann and Wyner (1999). Recently, they have received increasing attention in the applied statistics literature, being used in a wide range of problems from different areas (Bejerano and Yona, 2001; Dalevi, Dubhashi and Hermansson, 2006; Busch et al., 2009; Galves et al., 2012, for instance). Its success in real word applications seems to stem from its parsimony (including memory only where data needs) and its capacity to capture structural dependencies in the data. The counterpart of the model, when compared to finite step Markov models for instance, is that estimation is a much complicated task. When he introduced the model, Rissanen (1983) also provided an algorithm for recovering the context tree out of a given sample. Since then, a large part of the related statistical literature has focussed on consistent estimation of the context tree in the finite and infinite memory case, an incomplete list includes Bühlmann and Wyner (1999); Galves and Leonardi (2008); Collet, Galves and Leonardi (2008); Csiszár and Talata (2006); Garivier and Leonardi (2011).
Most of the above cited works make some assumptions on the processes, such as lower bounding the transition probabilities or imposing mixing conditions, additionally to ergodicity. In the present paper, we precisely refer to our statistical inference problem as nonparametric because we make no further assumptions concerning the distribution of the process, else than ergodicity. In this nonparametric setting, Csiszár and Talata (2006) proved the consistency of the Bayesian Information Criterion (BIC) when the context trees are truncated to a given finite length (the truncation being necessary only for infinite context trees). Interestingly, nothing has been done concerning confidence bounds as far as we know.
Given a sample of a stationary ergodic process, it is natural to wonder whether this process has a finite or infinite context tree. This cannot be consistently decided in this general class (Bailey, 1976; Morvai and Weiss, 2005). That is, there exists no twovalued function of the sample which, as the sample increases, stabilizes to the value “yes” for every process having a finite context tree and “no” for every process having an infinite context tree. Thus, when considering the discrete metric in the space of trees, the existence of a universal consistent estimator relies on assumptions that cannot be checked empirically.
This situation has its counterpart in nonparametric statistics for i.i.d observations. For instance, Fraiman and Meloche (1999) observed that it is impossible to decide, out of a random sample, whether or not the underlying distribution has a finite number of modes. Assuming a priori that the number of modes is finite, they can be consistently estimated.
In the present work the space of irreducible context trees with finite alphabet is equipped with the Hamming distance. Using only topological arguments we prove that if this metric in the space of trees is unbounded, there exists no consistent estimator of the context tree in the class of stationary ergodic processes. In the bounded metric case, we construct an estimator that is consistent and also a nonparametric lower bound with an explicit coverage probability, based on a result of Garivier and Leonardi (2011). Finally, following Donoho (1988), we also prove that it is not possible to obtain nonparametric upper bounds even in the smaller class of processes having finite context trees. To our knowledge, this is the first work considering the problem of construction of nonparametric confidence bounds for context trees.
Notation, definitions and main results are given in the next section. In Section 3 we show how to compute the lower confidence bound and we present a practical application, testing a linguistic hypothesis about the memory of stressed and nonstressed syllables in European Portuguese written texts. The proofs of the results are given in Section 4.
2 Definitions and results
In this section we present the main definitions and theoretical results of this paper. We begin by describing the notion of irreducible tree and we introduce a Hamming distance in the set of all irreducible trees over a finite alphabet. Then we proceed by defining the context tree of a stationary ergodic process and by establishing some topological properties of the set of all stationary ergodic probability measures with respect to the weak topology. The last part of the section is dedicated to the statements of the main results of the paper.
2.1 Metric tree space
Let be a finite set called alphabet. For any , we denote by the string of symbols in with length . This notation is also valid for in which case we obtain a leftinfinite sequence . If we let denote the empty string . The length of a string will be denoted by . For any , we let denote the set of strings in having length , in particular . We also let denote the set of all finite strings on and we denote by the set of all leftinfinite sequences with symbols in .
We will need to concatenate strings; for instance, if and are strings of length and respectively, then denotes the string of length obtained by putting the symbols in after the ones in . We also extend concatenation to the case where is an infinite string on the left. We say that is a suffix of the sequence if there exists a sequence such that . When we say that is a proper suffix of .
A tree is any set of strings or perhaps of leftinfinite sequences, called leaves, such that no is a proper suffix of any other . This property enables us to represent the set as a graphical rooted tree by identifying the elements in with paths from the terminal nodes of the tree to the root. As an example of finite tree, consider the set over the alphabet . On the other hand, an example of an infinite tree over is given by , which has a unique infinite element, the leftinfinite sequence . The graphical representation of these trees can be found in Fig. 1. Special cases of trees are given by the entire set of leftinfinite sequences, denoted in this paper by , and the tree consisting of the unique empty string , denoted by .
We say that the tree is irreducible if no can be replaced by a proper suffix without violating the tree property. Both trees in Fig. 1 are irreducible, as well as and . An example of a nonirreducible tree is , because substituting 000 by 00 leads to that satisfies the tree property.
We will call a node of any finite string that is a suffix of some . Sometimes it will be convenient to identify with the set of its nodes . In fact it is easy to verify that uniquely determines and vice versa. In the case of given before, the set is the set of all strings represented in Fig. 1, that is . In the case of we have .
Let denote the set of all irreducible trees on , with the following partial order
Given a tree and a constant , we denote by the truncated tree at level , defined by the set of its nodes
Finally, is equipped with the Hamming distance defined by
(2.1) 
where . In the summable case we have that is a bounded metric space.
2.2 Context tree of a stationary ergodic process
Let be a stationary and ergodic process assuming values in the alphabet . We denote by the stationary probability of the string , that is
If is such that we write
with the convention that if then .
A process as above is said to have law, or measure, .
Definition 2.1.
We say that the string is a context for a process with measure if it satisfies

or .

For all and all such that is suffix of
(2.2) 
No proper suffix of satisfies 2.
An infinite context is a leftinfinite sequence such that its finite suffixes have positive probability but none of them is a context.
By this definition, the set of contexts of a process with measure is an irreducible tree, it will be denoted by .
Example 2.2.
Consider the stationary Markov chain of order 3 over the alphabet defined by the transition probabilities
0.2  0.8  
0.5  0.5  
0.3  0.7  
0.7  0.3 
where are arbitrary. This is an example of what is called a Variable Length Markov Chain (VLMC). By Definition 2.1, the only contexts of this process are the strings 1, 00, 010 and 110. The context tree is the tree represented in Fig. 1.
Example 2.3.
Suppose that the process takes values in , and in order to decide the probability distribution of the next symbol based on the past realization, we only need to know the distance to the last occurrence of a . Then, for any , any and any
According to Definition 2.1, the strings , , as well as the semiinfinite sequence are context of this process. Therefore, the context tree is shown in Fig. 1.
2.3 The weak topology in the space of stationary ergodic processes
Let be the algebra on obtained as the product of the discrete algebra on . Let denote the set of all stationary ergodic probability measures over .
Define the following distance in
where
is the th order variational distance. This distance is known in the literature as the weak distance, and the topology induced by it is known as the weak topology (Shields, 1996, Section I.9).
We now state a basic lemma about the topological properties of the space with respect to the weak topology.
Lemma 2.4.
The space is a Baire space.
2.4 Consistent estimation and confidence bounds
As mentioned in the Introduction, in this paper we are interested in
the estimation of properties of the context tree from samples of size of the corresponding stationary and ergodic process .
Up to now this problem has been reduced to the consistent identification of the set of contexts (in the finite case) or of a truncated version of the context tree (in the infinite case). The latter corresponds to a special case of our distance ; for instance when the interest is in estimating contexts of length at most we can consider for all . In the sequel we define the notion of consistency of a sequence of estimators in a general setting.
Let be a functional with values in some metric space .
Definition 2.5.
We say that is consistently estimable on (in probability) if there exists a sequence of statistics, with , such that for all
In this case we say that is a consistent estimator for on . We say that is strongly consistent on if the convergence takes place almost surely with respect to the probability measure , and in this case we say that is a strongly consistent estimator for on .
The following result establishes a necessary condition for the existence of consistent estimators of a bounded real functional defined on .
Proposition 2.6.
Assume is bounded (that is there exists such that for all ). If is consistently estimable on then must be continuous on a dense subset of .
In this paper we are concerned with the functional that assigns to any measure its associated context tree . The first question we address here is if it is possible to decide, out from a finite sample, if the sum of the function over the nodes of the context tree is finite or not.
Theorem 2.7.
If then the functional
is not consistently estimable on .
This result states, in particular, that the functional that attributes the value if the measure is Markovian, and otherwise, is not consistently estimable when is not summable. This is a known result; see Morvai and Weiss (2005) and references therein. However, our proof is completely different from theirs and it is mainly based on topological properties of .
Our main result about consistent estimation for the context tree on is given in the following theorem.
Theorem 2.8.
is consistently estimable on if and only if is finite.
The only if part of this theorem is a direct consequence of Theorem 2.7. The if part is proved constructively later, because the estimator defined by (2.3) below will be proved to be consistent when is summable.
As mentioned before, the present work is also concerned with the obtention of confidence bounds for the context tree of a stationary and ergodic process. We use the following general definition of upper and lower confidence bounds, taken from Donoho (1988). Suppose is equipped with a partial order with supremum and infimum.
Definition 2.9.
Given , a statistic is called a nontrivial upper confidence bound for on with coverage probability at least if
and
Analogously we say that is a nontrivial lower confidence bound for on with coverage probability at least if is a nontrivial upper confidence bound for on with coverage probability at least .
Our first theorem concerning confidence bounds is a negative result stating that the
functional does not admit a nontrivial upper confidence bound neither on nor in the class of stationary ergodic measures with finite context tree.
Theorem 2.10.
If is an upper bound that satisfies then the coverage probability . This is also satisfied even in the smaller class of stationary ergodic measures having finite context tree.
The functional does however admit nontrivial lower confidence bounds on . In what follows, we construct a sequence of statistics which will be proved to be a nontrivial lower confidence bound and a consistent estimator of on , when .
We will first define a discrepancy measure between a sample and a measure . To do so, we need to introduce some more notation and definitions. Given a string , denote by the number of occurrences of in the sample ; that is
If , we define for any the estimated transition probability
Denote also by the set of children of that appear in the sample at least once, that is
and by the set of all such strings ; that is
Finally, for any context tree , let
Now, we can define our discrepancy measure as a function
if . If we define .
We are now ready to introduce the lower bound for the functional . Given a constant , for any let be defined by
(2.3) 
where the infimum is taken with respect to the order between trees, and the logarithm is taken in base 2. Note that since the tree is the smallest element of with respect to , this infimum always exists. In Section 3 we show how to practically compute .
We now state the main result of this paper.
Theorem 2.11.
Given and , for any satisfying
(2.4) 
we have that the statistic is a nontrivial lower confidence bound for on , with nonparametric coverage probability of at least . Moreover, if , for any the sequence is a consistent estimator of on and if then is strongly consistent.
3 Computation and application of the lower confidence bound
In this section we show how to compute the confidence bound (2.3) and we present a practical application of Theorem 2.11 to linguistic data.
3.1 Tree lower bound algorithm
Let be a given sample and let be a fixed constant. To compute the tree , we will identify its nodes, i.e. the set . By definition, we know that if and only if every process satisfying has context tree with as a node. The following proposition gives a simple criteria to check whether or not we have to include a string in the set . It relies on two quantities, and , which are defined for any and any by
(3.1)  
(3.2) 
Proposition 3.1.
Let be a finite string with . Then there exists a process satisfying and having as a context if and only if the following conditions hold

For any , .

.
We now give a simple algorithm (see Fig 2) to construct the estimated tree. Let us explain how it works. Since every context tree has the root as node, then , and we can inicialize the algorithm with . We then proceed iteratively as follows, until we exhaust the set .
Suppose that a string has been included in . If and at least one of the conditions of Proposition 3.1 is not satisfied for , this means that there do not exist processes satisfying and having as a context. In other words, all processes such that has as a proper suffix of their contexts. Thus the set must belong to . On the other hand, if both conditions of Proposition 3.1 are satisfied for , then there exists at least one process such that and having as a context. In this case we let be a context of and we stop checking its descendants (strings of the form , with ).
3.2 Onesided test of hypotheses for context trees
In this subsection we present an application of the lower confidence bound introduced in (2.3) to test a hypothesis about the context tree of codified texts written in European Portuguese. This dataset, that is publicly available, was first analyzed in Galves et al. (2012) where a method to estimate a context tree was proposed and then applied to solve a linguistic conjecture about the rhythmic distinction between European and Brazilian Portuguese. The written texts were codified into the alphabet taking into account the stressed syllables and the boundaries of words; see Galves et al. (2012) for details. The European Portuguese context tree obtained in the cited work is the one shown in Fig. 3(a). Another analysis of the same dataset with similar results can be found in Belloni and Oliveira (2015).
An interesting difference with a corresponding linguistic interpretation between the two languages observed from the codified data was the ramification of string “2” into the set of contexts “02”, “12”, “32”and “42” that appears in the European Portuguese context tree in Fig. 3(a) (in the Brazilian Portuguese context tree this ramification did not occur and the string “2” was identified as a context). A natural idea is then to test if there is enough evidence in the data supporting that the European Portuguese context tree ramifies from the sequence “2” or not.
It is well known that tests of hypotheses can be constructed using confidence bounds. Let be a tree and suppose we want to test the hypotheses
Given and , consider the test that rejects if and only if , with . By Theorem 2.11 we have that
Thus, the test defined by the rejection region has significance level for the hypotheses vs. .
In our application the null hypothesis is defined by a tree having the string as a context. Since we impose no further condition, we let be the unique finite context of , that is
This tree is represented in Fig. 3(b). We set the significance level , and as our sample size is we have . The estimated tree with the TLB algorithm of Fig 2 is given in Fig. 3(c). We see that belongs to the rejection region therefore we reject the null hypothesis at the significance level , confirming in this way the results of Galves et al. (2012) about the ramification of sequence “2” in the European Portuguese context tree.
The algorithm described in Fig. 2 was coded in the R language and is available upon request.
4 Proofs
Proof of Lemma 2.4.
With respect to the weak topology, the set of all stationary probability measures over is a compact Hausdorff space (Shields, 1996) and the subspace of all stationary and ergodic probability measures over is a set (Parthasarathy, 1961, Theorem 2.1). Therefore, is a Baire space with the induced topology. ∎
Proof of Proposition 2.6.
The proof uses the same arguments of Lemma 1.1 in Fraiman and Meloche (1999). The difference is that here we do not have independent random variables and the space is not a complete metric space with respect to . But the same result can be obtained in our setting, as we show in the sequel. Recall that in the conditions of the proposition, there exists such that for all , and assume that is a consistent estimator for . Define
where is the indicator function and is the sign of . It is not hard to show that is also a consistent estimator for , for details see (Fraiman and Meloche, 1999, Lemma 1.1). As for any the function is bounded by we have that the convergence in probability to implies convergence in mean. Therefore we have that
as . Moreover,
Therefore, for each , is uniformly continuous with respect to the weak topology (induced by ) on . Then, by Lemma 2.4 and the Baire’s Cathegory Theorem, the function must be continuous on a dense subset of . ∎
To continue we need two basic lemmas that constitute the core of all our negative results.
Lemma 4.1.
Any measure can be approximated with respect to by a sequence of measures in each of which have as context tree a given tree , with . In particular, can be infinite.
Proof.
We proceed in two steps, first we define a sequence of Markov measures converging to and then for any , we construct a sequence of stationary ergodic measures each of which have context tree and that converges to . The conclusion of the proof then follows by a diagonal argument, since convergence in (or in the weak topology) corresponds to convergence of the measure of cylinders (Shields, 1996, Section I.9).
For any , let be the steps canonical Markov approximation of , which is a Markov chain of order with transition probabilities
(4.1) 
An important observation is that , since for any semiinfinite sequence the length of the context of along is at most the length of the context of . Moreover, it is well known that the sequence converges weakly to (see Rudolph and Schwarz (1977) for instance), then the first step is proven.
To continue, let us introduce the continuity rate of a process along a given past , which is the nonincreasing sequence defined as
Observe that means and therefore for all means that the infinite sequence . Let be a measure in satisfying the following three conditions:

.

For any and any , .

.
It should be clear to the reader that such a measure can always be selected. An example of this is the observable chain in a Hidden Markov Model, that under simple assumptions satisfy conditions (i)(iii) above, see for instance Collet and Leonardi (2014).
Now consider any context tree such that . For all define the kernel
We have for any . Thus, this kernel satisfies (i) and let us show that it also satisfies property (iii). For any with we have
for all or if , for all and all . Conditions (i) and (iii) ensure that there exists a unique stationary ergodic measure having kernel ; see for instance Bressaud, Fernández and Galves (1999); Fernández and Galves (2002).
By the above observations, the contexts of are exactly the sequences in , since . Now, since converges uniformly to as we also have converging in to as diverges. ∎
Lemma 4.2.
Any measure can be approximated with respect to by a sequence of measures in each of which have a finite context tree.
Proof.
We prove this lemma using a similar twosteps argument as in the previous one. First, we use the same sequence of canonical Markov approximations defined in (4.1) to approximate . Second, as we do not know whether these Markov measures are ergodic or not, we construct, for any , a sequence of ergodic Markov measures converging to when . The conclusion of the proof also follows from a diagonal argument.
The construction of is also carried as in the previous lemma, by specifying the kernel of transition probabilities
Is is easy to see that this definition leads to a Markovian (i.e. with finite context tree) ergodic measure, and that the sequence converges to when . As before we have that converges to when and this concludes the proof. ∎
We are now ready to prove Theorem 2.7.
Proof of Theorem 2.7.
Assume that . Then Lemmas 4.1 and 4.2 imply that any having (respectively ) is limit in of a sequence of measures in satisfying (respectively ) for all . In other words, the functional is discontinuous (with respect to the distance) at any point of . Together with Proposition 2.6, this proves that is not consistently estimable on when . ∎
Proof of Theorem 2.8.
As we already mentioned, the proof of the if part of the theorem follows from Theorem 2.11 which states that is actually a consistent estimator of . It remains to prove the only if part. Assume and suppose there exists , a consistent estimator of on . Define by . We will prove that is a consistent estimator of , which is a contradiction with Theorem 2.7, concluding the proof of the theorem.
Fix . As is consistent we have that for any
We will prove that for any the ball of center and radius contains only trees where is constant and equal to . By the definition of , see (2.1), for ,
Then if we have if and only if . Therefore
which proves that is consistently estimable on . But by Theorem 2.7, is not consistently estimable on , which is a contradiction. ∎
Proof of Theorem 2.10.
Suppose satisfies
Given choose a measure such that
(4.2) 
This can always be done because the set is dense in (see Lemma 4.2). Let be an increasing sequence of finite trees and denote by the event
We have for all and . Therefore
Now let be such that
and denote by the finite tree given by . We have
By Lemma 4.1, there exists a measure with such that
Moreover we have
Therefore
As is arbitrary we have just proved that
In order to prove Theorem 2.11 we will need the following lemma.
Lemma 4.3.
Given , let be a sample of size with law . Then for any constant we have