Minimax Learning of Ergodic Markov Chains

# Minimax Learning of Ergodic Markov Chains

## Abstract

We compute the finite-sample minimax (modulo logarithmic factors) sample complexity of learning the parameters of a finite Markov chain from a single long sequence of states. Our error metric is a natural variant of total variation. The sample complexity necessarily depends on the spectral gap and minimal stationary probability of the unknown chain — for which, at least in the reversible case, there are known finite-sample estimators with fully empirical confidence intervals. To our knowledge, this is the first PAC-type result with nearly matching (up to logs) upper and lower bounds for learning, in any metric in the context of Markov chains.

————Geoffrey Wolfer and Aryeh Kontorovich \ShortHeadingsLearning Markov ChainsWolfer and Kontorovich \firstpageno1

\editor

{keywords}

Markov chain, learning, minimax

## 1 Introduction

Approximately recovering the parameters of a discrete distribution is a classical problem in computer science and statistics (see, e.g., Han et al. (2015); Kamath et al. (2015); Orlitsky and Suresh (2015) and the references therein). Total variation (TV) is a natural and well-motivated choice of approximation metric (Devroye and Lugosi, 2001), and the metric we use throughout the paper will be derived from TV (but see Waggoner (2015) for results on other norms). The minimax sample complexity for obtaining an -approximation to the unknown distribution in TV is well-known to be , where is the support size (see, e.g., Anthony and Bartlett (1999)).

This paper deals with learning the transition probability parameters of a finite Markov chain in the minimax setting. The Markov case is much less well-understood than the iid one. The main additional complexity introduced by the Markov case on top of the iid one is that not only the state space size and the precision parameter , but also the chain’s mixing properties must be taken into account.

#### Our contribution.

Up to logarithmic factors, we compute (apparently the first, in any metric) finite sample PAC-type minimax sample complexity for the learning problem in the Markovian setting, which seeks to recover, from a single long run of an unknown Markov chain, the values of its transition matrix up to a tolerance of in a certain natural TV-based metric we define below. We obtain upper and lower bounds on the sample complexity (sequence length) in terms of , the number of states, the stationary distribution, and the spectral gap of the Markov chain.

## 2 Main results

Our definitions and notation are mostly standard, and are given in Section 3. Since the focus of this paper is on statistical rather than computational complexity, we defer the (straightforward) analysis of the runtime of our learner to the Appendix, Section A.

### 2.1 Minimax learning results

{theorem}

[Learning sample complexity upper bound] There exists an -learner (provided in Algorithm 1), which, for all , , satisfies the following. If receives as input a sequence of length at least drawn according to an unknown -state Markov chain , then it outputs such that

 ∣∣∣∣∣∣M−^M∣∣∣∣∣∣<ε

holds with probability at least . The sample complexity is upper-bounded by

where is a universal constant, is the pseudo-spectral gap (3.4) of , the minimum stationary probability (3.1) of , and is defined in (3.2).

The proof shows that for reversible , the bound also holds with the spectral gap (3.3) in place of the pseudo-spectral gap.

{theorem}

[Learning sample complexity lower bound] For every , , and , , there exists a -state Markov chain with pseudo-spectral gap and stationary distribution such that every -learner must require in the worst case a sequence drawn from the unknown of length at least

 m{\tiny{LB}}:=Ω(max{dε2π∗,logdγpsπ∗}),

where are as in Theorem 2.1.

The proof of Theorem 2.1 actually yields a bit more than claimed in the statement. For any , a Markov chain can be constructed that achieves the component of the bound. Additionally, the component is achievable by a class of reversible Markov chains with spectral gap , and uniform stationary distribution.

Although the sample complexity depends on the pseudo-spectral gap and minimal stationary probability of the unknown chain, in the reversible case these can be efficiently estimated with finite-sample data-dependent confidence intervals from a single sequence of length (Hsu et al., 2017). The form of the lower bound indicates that in some regimes, estimating the spectral gap is as difficult as learning the entire transition matrix (for our choice of metric ). We stress that our learner only requires ergodicity (and not, say, reversibility) to work.

Our results also indicate that the transition matrix may be estimated to precision with sample complexity , which is already relevant for slowly mixing Markov chains. For this level of precision in the reversible case, in light of Hsu et al. (2017), one also obtains estimates on and with no increase in sample complexity.

Finally, even though the upper bound formally depends on the unknown (and, in our setting, not learnable) initial distribution , we note that (i) this dependence is logarithmic and (ii) an upper bound on in terms of the learnable quantity is available.

### 2.2 Overview of techniques

The upper bound for learning in Theorem 2.1 is achieved by the mildly smoothed maximum-likelihood estimator given in Algorithm 1. If the stationary distribution is bounded away from , the chain will visit each state a constant fraction of the total sequence length. Exponential concentration (controlled by the spectral gap) provides high-probability confidence intervals about the expectations. A technical complication is that the empirical distribution of the transitions out of a state , conditional on the number of visits to that state, is not binomial but actually rather complicated — this is due to the fact that the sequence length is fixed and so a large value of “crowds out” other observations. We overcome this via a matrix version of Freedman’s inequality. The factor in the bounds quantifies the price one pays for not assuming (as we do not) stationarity of the unknown Markov chain.

Our chief technical contribution is in establishing the sample complexity lower bounds for the Markov chain learning problem. We do this by constructing two different independent lower bounds.

The lower bound in is derived successively by a covering argument and a classical reduction scheme to a collection of testing problems using a class of Markov chains we construct, with a carefully controlled spectral gap.1 The latter can be estimated via Cheeger’s inequality, which gives sharp upper bounds but suboptimal lower bounds (Lemma D). To get the correct order of magnitude, we use a contraction-based argument. The Dobrushin contraction coefficient , defined in (3.6), is in general a much cruder indicator of the mixing rate than the spectral gap , defined in (3.3). Indeed, holds for all reversible (Brémaud, 1999, pp. 237-238), and for some ergodic , we have (in which case it yields no information, since the latter holds for non-ergodic as well). This is in fact the case for the families of Markov chains we construct in the course of proving Theorem 2.1. Fortunately, in both cases, even though , it turns out that , and coupled with the contraction property (3.7), our bound on actually yields an optimal estimate of . Although the calculation of in Lemma B is computationally intensive, the contraction coefficient is, in general, more amenable to analysis than the eigenvalues directly, and hence this technique may be of independent interest.

The lower bound in arises from the idea that learning the whole transition is at least as hard as learning the conditional distribution of one of its states. From here, we design a class of matrices where one state is both hard to reach and difficult to learn, by constructing mixture of indistinguishable distributions for that particular state, and indexed by a large subset of the binary hypercube. We express the statistical distance between words of length distributed according to different matrices of this class in terms of and the KL divergence between the conditional distributions of the hard-to-reach state, by taking advantage of the structure of the class, and invoke an argument from Tsybakov to conclude ours.

### 2.3 Related work

Our Markov chain learning setup is a natural extension of the PAC distribution learning model of Kearns et al. (1994). Despite the plethora of literature on estimating Markov transition matrices (see, e.g., Billingsley (1961); Craig and Sendi (2002); Welton and Ades (2005)) we were not able to locate any rigorous finite-sample PAC-type results.

The minimax problem has recently received some attention, and Hao et al. (2018) have, in parallel to us, shown the first minimax learning bounds, in expectation, for the problem of learning the transition matrix of a Markov chain under a certain class of divergences. The authors consider the case where , essentially showing that for some family of smooth -divergences, the expected risk is . The metric used in this paper is based on TV, which corresponds to the -divergence induced by , which is not differentiable at . The results of Hao et al. and the present paper are complementary and not directly comparable. We do note that (i) their guarantees are in expectation rather than with high-confidence, (ii) our TV-based metric is not covered by their smooth -divergence family, and most important (iii) their notion of mixing is related to contraction as opposed to the spectral gap. In particular the -minorization assumption implies (but is not implied by) a bound of on the Dobrushin contraction coefficient (defined in (3.6); see Kontorovich (2007, Lemma 2.2.2) for the latter claim). Thus, the family of -minorized Markov chains is strictly contained in the family of contracting chains, which in turn is a strict subset of the ergodic chains we consider.

## 3 Definitions and notation

We define and use to denote the size of the sample received by the Markov learner. The simplex of all distributions over will be denoted by , and the collection of all row-stochastic matrices by . For , we will write either or , as dictated by convenience. All vectors are rows unless indicated otherwise. We assume familiarity with basic Markov chain concepts (see, e.g., Kemeny and Snell (1976); Levin et al. (2009)). A Markov chain on states is specified by an initial distribution and a transition matrix in the usual way. Namely, by , we mean that

 P((X1,…,Xm)=(x1,…,xm))=μ(x1)m−1∏t=1M(xt,xt+1).

We write to denote probabilities over sequences induced by the Markov chain , and omit the subscript when it is clear from context.

The Markov chain is stationary if for , and ergodic if entrywise for some . If is ergodic, it has a unique stationary distribution and moreover , where

 π∗=mini∈[d]π(i). (3.1)

Unless noted otherwise, is assumed to be the stationary distribution of the Markov chain in context. To any Markov chain , we associate

 Πμ:=∑i∈[d]μ(i)2/π(i), (3.2)

which is finite iff , and always .

A reversible satisfies detailed balance for some distribution : for all ,  — in which case is necessarily the unique stationary distribution. The eigenvalues of a reversible lie in , and these may be ordered (counting multiplicities): . The spectral gap is

 γ=γ(M)=1−λ2(M). (3.3)

Paulin (2015) defines the pseudo-spectral gap by

 γps:=maxk≥1{γ((M∗)kMk)k}, (3.4)

where is the time reversal of , given by ; the expression is called the multiplicative reversiblization of .

We use the standard norm , which, in the context of distributions (and up to a convention-dependent factor of ) corresponds to the total variation norm. For , define

 (3.5)

(we note, but do not further exploit, that corresponds to the operator norm (Horn and Johnson, 1985)). For any , define its Dobrushin contraction coefficient

 κ(M)=12maxi,j∈[d]∥M(i,⋅)−M(j,⋅)∥1; (3.6)

this quantity is also associated with Döblin’s name. The term “contraction” refers to the property

 ∥∥(μ−μ′)M∥∥1≤κ(M)∥∥μ−μ′∥∥1,μ,μ′∈Δd, (3.7)

which was observed by Markov (1906, 5).

Finally, we use standard , and order-of-magnitude notation, as well as their tilde variants , , where lower-order log factors are suppressed.

{definition}

An -learner for Markov chains with sample complexity function is an algorithm that takes as input drawn from some unknown Markov chain , and outputs such that holds with probability at least .

The probability is over the draw of and any internal randomness of the learner. Note that by Theorem 2.1, the learner’s sample complexity must necessarily depend on the properties of the unknown Markov chain.

## 4 Proofs

{proof}

[Proof of Section 2.1] We proceed to analyze Algorithm 1, and in particular, the random variable it constructs, where

 Ni=|{t∈[m−1]:Xt=i}|,Nij=|{t∈[m−1]:Xt=i,Xt+1=j}|.

To do so, we make use of an adaptation of Freedman’s inequality (Freedman, 1975) to random matrices (Tropp et al., 2011), which has been reported for convenience in the appendix as Appendix E. Define the row vector sequence for a fixed by

 Y0=0,Yt=1√2(1[Xt−1=i](1[Xt=j]−M(i,j)))j∈[d],

and notice that . We also have from the Markov property that , so that defines a valued martingale difference, and immediately,

 YtY⊺t=∥Yt∥22=d∑j=1(1√21[Xt−1=i](1[Xt=j]−M(i,j)))2=121[Xt−1=i]d∑j=11[Xt=j]+M(i,j)2−2⋅1[Xt=j]M(i,j)=121[Xt−1=i](1+∥M(i,⋅)∥22−2M(i,Xt))≤1[Xt−1=i],

so that , and as is a real valued random variable. Construct now the matrix ,

with . Computing the row sums and column sums of this matrix in absolute value,

 d∑k=1∣∣Zt,i,j,k∣∣=|1[Xt=j]−M(i,j)|d∑k=1|1[Xt=k]−M(i,k)|≤d∑k=11[Xt=k]+d∑k=1M(i,k)=2

and similarly, . From Hölder’s inequality, , and from the sub-additivity of the norm and Jensen’s inequality , it follows that

Now decomposing the error probability of the learner, while choosing an arbitrary value for the desired number of visits to each state,

 PM,μ(∣∣∣∣∣∣M−^M∣∣∣∣∣∣>ε)≤d∑i=1PM,μ(∥∥^M(i,⋅)−M(i,⋅)∥∥1>ε and Ni∈[ni,3ni])+PM,μ({∃i∈[d]:Ni∉[ni,3ni]}). (4.1)

Since , we have

 (4.2)

and setting , it follows that for all , , and finally, from Lemma B (stated and proven in the appendix) it is possible to control the number of visits to states, such that for larger than , and we have that , and the upper bound is proven.

{remark}

Note that one can derive an upper bound for the problem with respect to the max norm , by studying the entry-wise martingales and invoking the scalar version of Freedman’s inequality (Freedman, 1975). Similarly, since for , it is the case that , we can derive the more general upper bound for the problem with respect to the norm .

{proof}

[Proof of Section 2.1 (part 1): learning lower bound ]

Suppose for simplicity of the analysis that we consider Markov chains of with states instead of , and that is even. We define the following class of Markov chains parametrized by a given distribution , where the conditional distributions defined at each state of the chain is always except for state , where it is only required that it has a loop of probability to itself, the remaining weights corresponding to degrees of freedom. For , and such that and .

 Gπ=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩Mη=⎛⎜ ⎜ ⎜ ⎜⎝π1…πdπ∗⋮⋮⋮⋮π1…πdπ∗η1…ηdπ∗⎞⎟ ⎟ ⎟ ⎟⎠:η=(η1,…,ηd,π∗)∈Δd+1⎫⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪⎭. (4.3)

For , and , set the number of visits to the th state. Focusing on the th state, since , it is immediate that . Introduce the subset of Markov chains in such that

 η(σ)=⎛⎜⎝1−π∗+16σ1εd,1−π∗−16σ1εd…1−π∗+16σd2εd,1−π∗−16σd2εd,π∗⎞⎟⎠,

where . Also define with . A direct computation yields that for , , where is the Hamming distance. From the Varshamov-Gilbert lemma, we know that , , such that with , . Restricting our problem to this set , and finally noticing that , from Tsybakov’s method Tsybakov (2009) applied to our problem,

 Rm≥12⎛⎜ ⎜ ⎜⎝1−42d16∑σ∈ΣDKL(Mmσ∣∣∣∣Mm0)log2d16⎞⎟ ⎟ ⎟⎠, (4.4)

where we wrote to be the KL divergence between the two distributions of words of length from each of the Markov chains, and is the quantity we now compute. Recall that from the tensorization property of the KL divergence,

 DKL(Mmσ∣∣∣∣Mm0)=m∑t=1EX1,…Xt−1[DKL(Xt∼Mσ|X1,…,Xt−1∣∣∣∣Xt∼M0|X1,…,Xt−1)], (4.5)

so that successively,

 DKL(Mmσ∣∣∣∣Mm0)=m∑t=1EX1,…Xt−1[DKL(Xt∼Mσ|Xt−1∣∣∣∣Xt∼M0|Xt−1)]=m∑t=1EX1,…Xt−2[EXt−1[DKL(Xt∼Mσ|Xt−1∣∣∣∣Xt∼M0|Xt−1)|X1=x1,…,Xt−2=xt−2]] (4.6)
 =m∑t=1EX1,…Xt−2⎡⎣∑xt−1∈[d+1](DKL(Xt∼Mσ|xt−1∣∣∣∣Xt∼M0|xt−1))PMσ(Xt−1=xt−1|Xt−2=xt−2)⎤⎦=π∗m∑t=1EX1,…Xt−2[(DKL(Xt∼Mσ|Xt−1=d+1∣∣∣∣Xt∼M0|Xt−1=d+1))]=π∗m∑t=1DKL(Mσ(d+1,⋅)||M0(d+1,⋅))=π∗mDKL(Mσ(d+1,⋅)||M0(d+1,⋅)) (4.7)

and, , and since also,

 DKL(Mσ(d+1,⋅)||M0(d+1,⋅))=d2(1−π∗+16εd)ln⎛⎜⎝1−π∗+16εd1−π∗d⎞⎟⎠+d2(1−π∗−16εd)ln⎛⎜⎝1−π∗−16εd1−π∗d⎞⎟⎠≤128ε2 (4.8)

we finally get , so that for .

{proof}

[Proof of Section 2.1 (part 2): learning lower bound ]

We treat and , as fixed. For and , define the block matrix

 Mη,τ=(CηRτR⊺τLτ),

where , , and are given by

 Lτ=18diag(7−4τ1ε,7+4τ1ε,…,7−4τd/3ε,7+4τd/3ε),
 Cη=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝34−ηηd/3−1…ηd/3−1ηd/3−134−η⋱⋮⋮⋱⋱ηd/3−1ηd/3−1…ηd/3−134−η⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,
 Rτ=18⎛⎜ ⎜ ⎜ ⎜ ⎜⎝1+4τ1ε1−4τ1ε0………0001+4τ2ε1−4τ2ε0…0⋮⋮⋮⋮⋮⋮⋮0………01+4τd/3ε1−4τd/3ε⎞⎟ ⎟ ⎟ ⎟ ⎟⎠.

Holding fixed, define the collection

 Hη={Mη,τ:τ∈{0,1}d/3} (4.9)

of Markov matrices. Denote by the element corresponding to . Note that every is ergodic and reversible, and its unique stationary distribution is uniform.

A graphical illustration2 of this class of Markov chains is provided in Figure 3; in particular, every consists of an “inner clique” (i.e., the states indexed by ) and “outer rim” (i.e., the states indexed by ).

Lemma B in the Appendix establishes a key property of the elements of : each in this class satisfies

 γps(M)=Θ(η). (4.10)

Suppose that , where and is uniform. Define the random variable , to be the first time all of the states in the inner clique were visited,

 T{\tiny{CLIQ}}=inf{t≥1:|{X1,…,Xt}∩[d/3]|=d/3}, (4.11)

Lemma B in the Appendix gives a lower estimate on this quantity:

 m≤d20ηln(d3)⟹P(T{% \tiny{CLIQ}}>m)≥15. (4.12)

Let be the collection of all -state Markov chains whose stationary distribution is minorized by and whose pseudo-spectral gap is at least . Writing , recall that the quantity we wish to lower bound is the minimax risk for the learning problem (it will be convenient to write instead of , which only affects the constants):

 (4.13)

where the is taken over all learners and the over . We employ the general reduction scheme of Tsybakov (2009, Chapter 2.2). The first step is to restrict the to the finite subset .

 (4.14)

Define as in (4.11). Then

 (4.15)

and Lemma B implies that for ,

 (4.16)

Observe that all verify . For any estimate , define

Then for , we have

 (4.17)

whence and

 Rm≥15inf^MsupτPMη,τ(τ∗≠τ|T{\tiny{CLIQ}}>m)=15inf^τ:X↦{0,1}d/3supτPMη,τ(^τ≠τ|T{\tiny{CLIQ}}>m). (4.18)

Since implies that for some ,

 Rm≥15inf^τsupτPMη,τ(^τi∗≠τi∗|Vi∗=0). (4.19)

There are as many with as those with , so if is drawn uniformly at random and state has not been visited, the learner can do no better than to make a random choice of (where determines ). More formally, writing , the vector without its th coordinate, we can employ an Assouad-type of decomposition (Assouad, 1983; Yu, 1997):

 Rm≥15inf^τ21−d/3∑τ(i)∈{0,1}d/3−1[12Pτi=0(^τi≠τi|Ni=0)+12Pτi=1(^τi≠τi|Ni=0)]=21−d/310∑τ(i)∈{0,1}d/3−1inf^τ[Pτi=0(^τi=1|Ni=0)+Pτi=1(^τi=0|Ni=0)]=21