Minimax Testing of Identity
to a Reference Ergodic Markov Chain
We exhibit an efficient procedure for testing, based on a single long state sequence, whether an unknown Markov chain is identical to or -far from a given reference chain. We obtain nearly matching (up to logarithmic factors) upper and lower sample complexity bounds for our notion of distance, which is based on total variation. Perhaps surprisingly, we discover that the sample complexity depends solely on the properties of the known reference chain and does not involve the unknown chain at all, which is not even assumed to be ergodic.
Distinguishing whether an unknown distribution is identical to a reference one or is -far from it in total variation (TV) is a special case of statistical property testing. For the iid case, it is known that a sample of size , where is the support size, is both sufficient and necessary (Batu et al., 2001; Valiant and Valiant, 2017). This is in contradistinction to the corresponding learning problem, with the considerably higher sample complexity of (see, e.g., Anthony and Bartlett (1999); Waggoner (2015); Kontorovich and Pinelis (2019)). The Markovian setting has so far received no attention in the property testing framework, with the notable exception of the recent work of Daskalakis et al. (2018) and Cherapanamjeri and Bartlett (2019), which we discuss in greater detail in Section 2. Daskalakis et al. “initiate[d] the study of Markov chain testing”, but imposed the stringent constraint of being symmetric on both the reference and unknown chains. In this paper, we only require ergodicity of the reference chain, and make no assumptions on the unknown one — from which the tester receives a single long trajectory of observations — other than it having states.
We prove nearly matching upper and lower bounds on the sample complexity of the testing problem in terms of the accuracy and the number of states , as well as parameters derived from the stationary distribution and mixing. We discover that for testing, only the reference chain affects the sample complexity, and no assumptions (including ergodicity) need be made on the unknown one. In particular, we exhibit an efficient testing procedure, which, given a Markovian sequence of length
correctly identifies the unknown chain with high probability, where and are, respectively, the minimum stationary probability and mixing time of the known reference chain. We also derive an instance-specific version of the previous bound: the factor in (1.1) can be replaced with the potentially much smaller quantity , defined in (4.1). Additionally, we construct two separate worst-case lower bounds of and , exhibiting a regime for which our testing procedure is unimprovable.
2 Related work
We consider distribution testing in the property testing (within the more classical statistical hypothesis testing111A recent result in this vein is Barsotti et al. (2016); see also references therein. ) framework — a research program initiated by Batu et al. (2000).
The special case of iid uniformity testing was addressed (for various metrics) by Goldreich and Ron (2011); Paninski (2008). Extensions to iid identity testing for arbitrary finite distributions were then obtained (Goldreich, 2016; Diakonikolas et al., 2019+), including the instance-optimal tester of Valiant and Valiant (2017), who showed that may be replaced with , the -pseudo-norm of the reference distribution.
To our knowledge, Daskalakis et al. (2018) were the first to consider the testing problem for Markov chains (see references therein for previous works addressing goodness-of-fit testing under Markov dependence). Their model is based on the pseudo-distance defined by Kazakos (1978) as
where is the term-wise geometric mean of the transition kernels and is the largest eigenvalue in magnitude. This pseudo-distance has the property of vanishing on pairs of chains sharing an identical connected component. Daskalakis et al.’s sample complexity upper bound of required knowledge of the hitting time of the reference chain, while their lower bound involves no quantities related to the mixing rate at all. The authors conjectured that for , the correct sample complexity is — i.e., independent of the mixing properties of the chain. This conjecture was recently partially proven by Cherapanamjeri and Bartlett (2019), who gave an upper bound of , without dependence on the hitting time.
The present paper compares favorably with Daskalakis et al. in that the latter requires both the reference and the unknown chains to be symmetric (and, a fortiori, reversible) as well as ergodic. We only require ergodicity of the reference chain and assume nothing about the unknown one.
Additionally, we obtain nearly sharp sample complexity bounds in terms of the reference chain’s mixing properties. Finally, our metric dominates the pseudo-metric , and hence their identity testing problem is reducible to ours (see Lemma 9.1 in the Appendix), although the reduction does not preserve the convergence rate.
We note that the corresponding PAC-type learning problem for Markov chains was only recently introduced in Hao et al. (2018); Wolfer and Kontorovich (2019). The present paper uses the same notion of distance as the latter work, in which the minimax sample complexity for learning was shown to be of the order of . Our present results confirm the intuition that identity testing is, statistically speaking, a significantly less demanding task than learning: the former exhibits a quadratic reduction in the bound’s dependence on over the latter.
3 Definitions and notation
We define , denote the simplex of all distributions over by , and the collection of all row-stochastic matrices by . For , we will write either or , as dictated by convenience. All vectors are rows unless indicated otherwise. For , and any we also consider its -fold product , i.e. is a shorthand for being all mutually independent, and such that . A Markov chain on states being entirely specified by an initial distribution and a row-stochastic transition matrix , we identify the chain with the pair . Namely, writing for , by , we mean that
We write to denote probabilities over sequences induced by the Markov chain , and omit the subscript when it is clear from context. Taking the null hypothesis to be that (i.e., the chain being tested is identical to the reference one), will denote probability in the completeness case, and in the soundness case. The Markov chain is stationary if for , and ergodic if (entry-wise positive) for some . If is ergodic, it has a unique stationary distribution and moreover the minimum stationary probability , where
Unless noted otherwise, is assumed to be the stationary distribution of the Markov chain in context. The mixing time of a chain is defined as the number of steps necessary for its state distribution to be sufficiently close to the stationary one (traditionally taken to be within ):
We use the standard norm , which, in the context of distributions (and up to a convention-dependent factor of ) corresponds to the total variation norm. For , define
Finally, we use standard , and order-of-magnitude notation, as well as their tilde variants , , where lower-order log factors in any parameter are suppressed.
An -identity tester for Markov chains with sample complexity function is an algorithm that takes as input a reference Markov chain and drawn from some unknown Markov chain , and outputs such that for , both and hold with probability at least . (The probability is over the draw of and any internal randomness of the tester.)
4 Formal results
Since the focus of this paper is on statistical rather than computational complexity, we defer the (straightforward) analysis of the runtimes of our tester to the Appendix, Section 8.
Theorem 4.1 (Upper bound)
There exists an -identity tester (provided at Algorithm 1), which, for all , , satisfies the following. If receives as input a -state “reference” ergodic Markov chain , as well as a sequence of length at least , drawn according to an unknown chain (starting from an arbitrary state), then it outputs such that
holds with probability at least . The sample complexity is upper-bounded by
An important feature of Theorem 4.1 is that the sample complexity only depends on the (efficiently computable, see Section 8) properties of the known reference chain. No assumptions, such as symmetry (as in Daskalakis et al. (2018); Cherapanamjeri and Bartlett (2019)) or even ergodicity, are made on the unknown Markov chain, and none of its properties appear in the bound.
Our results indicate that in the regime where the term is not dominant, the use of optimized identity iid testers as subroutines confers an -fold improvement over the naive testing-by-learning strategy.
Theorem 4.2 (Instance-specific upper bound)
There exists an -identity tester , which, for all , , satisfies the following. If receives as input a -state “reference” ergodic Markov chain , as well as a sequence of length at least , drawn according to an unknown chain (starting from an arbitrary state), then it outputs such that
holds with probability at least . The sample complexity is upper-bounded by
where and are as in Theorem 4.1, and
Since we always have , the instance-specific bound is always at least as sharp as the worst-case one in Theorem 4.1. It may, however, be considerably sharper, as illustrated by a simple random walk on a -vertex, -regular graph (Levin et al., 2009, Section 1.4), for which the instance-specific bound is — a savings of roughly .
Theorem 4.3 (Lower bounds)
For every , , and , , there exists a -state Markov chain with mixing time and stationary distribution such that every -identity tester for reference chain must require in the worst case a sequence drawn from the unknown chain of length at least
where are as in Theorem 4.1.
As the proof shows, for any , a testing problem can be constructed that achieves the component of the lower bound. Moreover, for doubly-stochastic , we have , which shows that the upper bound cannot be improved in all parameters simultaneously.
5 Overview of techniques
For both upper and lower bounds, we survey existing techniques, describe their limitations vis-à-vis our problem, and highlight the key technical challenges as well as our solutions for overcoming these.
5.1 Upper bounds
Naïve approach: testing-by-learning.
We mention this approach mainly to establish a baseline comparison. Wolfer and Kontorovich (2019) showed that in order to -learn an unknown -state ergodic Markov chain under the distance, a single trajectory of length is sufficient. It follows that one can test identity with sample complexity
This naïve bound, aside from being much looser than bounds provided in the present paper, has the additional drawback of depending on the unknown and, in particular, being completely uninformative when the latter is not ergodic.
Reduction to iid testing.
Our upper bound in Theorem 4.1 is achieved via the stratagem of invoking an existing iid distribution identity tester as a black box (this is also the general approach of Daskalakis et al. (2018)). Intuitively, given the reference chain , we can compute its stationary distribution and thus know roughly how many visits to expect in each state. Further, computing the mixing time gives us confidence intervals about these expected visits (similar to Wolfer and Kontorovich (2019), via the concentration bounds of Paulin (2015)). Hence, if a chain fails to visit each state a “reasonable” number of times, our tester in Algorithm 1 rejects it. Otherwise, given that state has been visited as expected, we can apply an iid identity tester to its conditional distribution. The unknown Markov chain passes the identity test if every state’s conditional distribution passes its corresponding iid test.
A central technical challenge in executing this stratagem is the fact that conditioning on the number of visits introduces dependencies on the sample, thereby breaking the Markov property. To get around this difficulty, we use a similar scheme as in Daskalakis et al. (2018).
5.2 Lower bounds
A lower bound of is immediate via a reduction from the testing problem of Daskalakis et al. (2018) to ours (see Remark 9.1). Although our construction for obtaining the sharper lower bound of shares some conceptual features with the constructions in Hao et al. (2018); Wolfer and Kontorovich (2019), a considerably more delicate analysis is required here. Indeed, the technique of tensorizing the KL divergence, instrumental in the lower bound of Wolfer and Kontorovich, would yield (at best) a sub-optimal estimate of in our setting. Intuitively, bounding TV via KL divergence is too crude for our purposes. Instead, we take the approach of reducing the problem, via a covering argument, to one of iid testing, and construct a family of Markov chains whose structure allows us to recover the Markov property even after conditioning on the number of visits to a certain “special” state. The main contribution for this argument is the decoupling technique of Lemma 7.4. The second lower bound is based on the construction of Wolfer and Kontorovich, for which the mixing time and accuracy of the test can both be controlled independently. Curiously, the aforementioned argument cannot be invoked verbatim for our problem, and so we introduce here the twist of considering half-covers of the chains (Lemma 7.5), concluding the argument with a two-point technique. This adaptation shaves a logarithmic factor off the corresponding learning problem.
6.1 Proof of Theorem 4.1
In order to prove Theorem 4.1, we design a testing procedure, describe its algorithm, and further proceed with its analysis.
6.1.1 The testing procedure
For an infinite trajectory drawn from , and for any , we denote the random hitting times to state ,
and for ,
Fixing and , let us define, following Daskalakis et al. (2018), the mapping
which outputs, for a trajectory drawn from , the first states that have been observed immediately after hitting . It is a consequence of the Markov property that the coordinates of be independent and identically distributed according to the conditional distribution defined by the th state of . Namely,
Remark: For an infinite trajectory, this mapping is well-defined almost surely, provided the chain is irreducible, while for a finite draw of length , can be infinite for some , such that proper definition of is a random event that depends on and the mixing properties of the chain.
For , we define our identity tester in terms of sub-testers , , whose definition we defer until further in the analysis. Intuitively, each requires at least a “reasonable” number of visits to , i.e. a lower bound on .
6.1.2 Analysis of the tester
Consider the two following events
The probability that the tester correctly outputs for a trajectory sampled from the reference chain is
where is by definition of , stems from the fact that in the event where is well-defined,
holds, while is by definition of , and is by the following covering argument
Further setting , where is the stationary distribution of the reference chain, and from an application of the union bound,
where is the number of visits to state (not counting the final state at time ). For ,
using the Bernstein-type concentration inequalities of Paulin (2015) as made explicit in Wolfer and Kontorovich (2019, Lemma 5). Observe that no properties of the unknown chain were invoked in this deduction.
We are left with lower bounding , the probability that all state-wise testers correctly output in the idealized case where they have have access to enough samples. We first recall some standard results (see for example Waggoner, 2015).
Lemma 6.1 (iid -testing to identity)
Let and . There exists a universal constant and a tester , such that for any reference distribution , and any unknown distribution , for a sample of size drawn iid from , can distinguish between the cases and with probability .
Lemma 6.2 (BPP amplification)
Given any -identity tester for the iid case with sample complexity , and any , we can construct (via a majority vote) an amplified tester such that for , can distinguish the cases and with confidence .
6.2 Proof of Theorem 4.2
This claim follows immediately from the analysis of the iid instance-optimal tester (Valiant and Valiant, 2017), which is invoked to test the conditional distributions of each state.
Lemma 6.3 (Valiant and Valiant 2017)
Let and . There exists a universal constant , such that for any reference distribution , there exists a tester such that for any unknown distribution , for a sample of size drawn iid from , can distinguish between the cases and with probability .
We simply have to ensure that for each state , , i.e. , whence the theorem.
6.3 Proof of Theorem 4.3, lower bound in .
The metric domination result in Lemma 9.1 immediately implies a lower bound of (see Remark 9.1). We now construct two independent and more delicate lower bounds of and . Let be the collection of all -state Markov chains whose stationary distribution is minorized by and whose mixing time is at most . Our goal is to lower bound the minimax risk, defined by
where the is over all testing procedures , and the is over all such that .
The analysis is simplified by considering -state Markov chains with even; an obvious modification of the proof handles the case of odd . Fix and , and define by
Define the collection of -state Markov chain transitions matrices,
The stationary distribution of a chain of this family is given by
and for , we have . We define the th hitting time for state as the random variable ; in words, this is the first time at which state has been visited times. Suppose that for some . For any , the th hitting time to state , stochastically dominates222 A random variable stochastically dominates if for all . the random variable , where each is an independent copy distributed as . To see this, consider a similar chain where the value in the last row is replaced with with appropriately re-normalized; clearly, the modification can only make it easier to reach state . Continuing, we compute and . The Paley-Zygmund inequality implies that for ,
Define the random variable , i.e. the number of visits to state , and consider a reference Markov chain , where is the -supported uniform distribution. Restricting the problem to a subset of the family satisfying the -separation condition only makes it easier for the tester, as does taking any mixture of chains of this class in lieu of the in (6.4). More specifically, we choose
where for , and
where we wrote . It follows from Le Cam (2012, Chapter 16, Section 4) that
and so it remains to upper bound a total variation distance. For any , the statistics of the induced state sequence only differ in the visits to state .
At this point, we would like to invoke an iid testing lower bound — but are cautioned against doing so naively, as conditioning on the number of visits to a state breaks the Markov property. Instead, in Lemmas 7.2, 7.3 and 7.4 we develop a decoupling technique, which yields
We shall make use of Paninski (2008, Theorem 4), which states:
It follows that
Finally, for the mixture of chains and parameter regime in question, we have and , so that for and , it follows that . This implies a lower bound of for the testing problem.
6.4 Proof of Theorem 4.3, lower bound in .
Let us recall the construction of Wolfer and Kontorovich (2019). Taking and , fixed, and , we define the block matrix
where , , and are given by
Holding fixed, define the collection
of ergodic and symmetric stochastic matrices. Suppose that , where , and is the uniform distribution over the inner clique nodes, indexed by . Define the random variable to be the first time some half of the states in the inner clique were visited,
Lemma 7.5 lower bounds the half cover time:
while Wolfer and Kontorovich (2019, Lemma 6) establishes the key property that any element of satisfies
Let us fix some , choose as reference and as an alternative hypothesis , with . Take both chains to have the uniform distribution over the clique nodes as their initial one. It is easily verified that , so that
Further, for , we have
Since , we have
Additionally, the symmetry of our reference chain implies that . It follows, via an analogous argument that , so that
By Le Cam’s theorem (Le Cam, 2012, Chapter 16, Section 4),
Other than state and its connected outer nodes, the reference chain and the alternative chain are identical. Conditional on , the outer states connected to were never visited, since these are only connected to the rest of the chain via