Sequential Nonparametric Testing
with the Law of the Iterated Logarithm
Abstract
We propose a new algorithmic framework for sequential hypothesis testing with i.i.d. data, which includes A/B testing, nonparametric twosample testing, and independence testing as special cases. It is novel in several ways: (a) it takes linear time and constant space to compute on the fly, (b) it has the same power guarantee as a nonsequential version of the test with the same computational constraints up to a small factor, and (c) it accesses only as many samples as are required – its stopping time adapts to the unknown difficulty of the problem. All our test statistics are constructed to be zeromean martingales under the null hypothesis, and the rejection threshold is governed by a uniform nonasymptotic law of the iterated logarithm (LIL). For the case of nonparametric twosample mean testing, we also provide a finite sample power analysis, and the first nonasymptotic stopping time calculations for this class of problems. We verify our predictions for type I and II errors and stopping times using simulations.
1 Introduction
Nonparametric statistical decision theory poses the problem of making a decision between a null and alternate hypothesis over a dataset with the aim of controlling both false positives and false negatives (in statistics terms, maximizing power while controlling type1 error), all without making assumptions about the distribution of the data being analyzed. Hypothesis testing is based on a “stochastic proof by contradiction” – the null hypothesis is thought of by default to be true, and is rejected only if the observed data are statistically very unlikely under the null.
There is increasing interest in solving such problems in a “big data” regime, in which the sample size can be huge. We present a sequential testing framework for this problem that is particularly suitable for two related scenarios prevalent in many applications:

The dataset is extremely large and highdimensional, so even a single pass through it is prohibitive.

The data is arriving as a stream, and decisions must be made with minimal storage.
Sequential tests have long been considered strong in such settings. They access the data in an online/streaming fashion, assessing after every new datapoint whether it then has enough evidence to reject the null hypothesis. However, most prior work is either univariate or parametric or asymptotic, while we are the first to provide nonasymptotic guarantees on multivariate nonparametric problems.
To elaborate on our motivations, suppose we have a gigantic amount of data from each of two unknown distributions, enough to detect even a minute difference in their means if it exists. Further suppose that, unknown to us, deciding whether the means are equal is actually statistically easy ( is large), meaning that one can conclude with high confidence by just looking at a tiny fraction of the dataset. Can we take advantage of this easiness, despite our ignorance of it?
A naive solution would be to discard most of the data and run a batch (offline) test on a small subset. However, we do not know how hard the problem is, and hence do not know how large a subset will suffice — sampling too little data might lead to incorrectly not rejecting the null, and sampling too much would unnecessarily waste computational resources. If we somehow knew , we would want to choose the fewest number of samples (say ) to reject the null while controlling type I error at some target level.
Our sequential test solves the problem by automatically stopping after seeing about samples, while still controlling type I and II errors almost as well as the equivalent lineartime batch test. Without knowing the true problem difficulty, we are able to detect it with virtually no computational or statistical penalty. We devise and formally analyze a sequential algorithm for a variety of problems, starting with a basic test of the bias of a coin, then nonparametric twosample mean testing, and finally general nonparametric twosample and independence testing.
Our proposed procedure only keeps track of a single scalar test statistic, which we construct to be a zeromean random walk under the null hypothesis. It is used to test the null hypothesis each time a new data point is processed. A major statistical issue is dealing with the apparent multiple hypothesis testing problem – if our algorithm observes its first rejection of the null at time , it might raise suspicions of being a false rejection, because hypothesis tests were already conducted and the th may have been rejected purely by chance. Applying some kind of multiple testing correction, like the Bonferroni or BenjaminiHochberg procedure, is exceedingly conservative and produces very suboptimal results over a large number of tests. However, since the random walk moves only a relatively small amount every iteration, the tests are far from independent. Formalizing this intuition requires adapting a classical probability result, the law of the iterated logarithm (LIL), with which we control for type I error (when is true).
The LIL can be described as follows: imagine tossing a fair coin, assigning to heads and to tails, and keeping track of the sum of coin flips. The LIL asserts that asymptotically, always remains bounded between (and this “envelope” is tight).
When is true, we prove that the sequential algorithm does not need the whole dataset as a batch algorithm would, but automatically stops after processing just “enough” data points to detect , depending on the unknown difficulty of the problem being solved. The nearoptimal nature of this adaptive type II error control (when is true) is again due to the remarkable LIL.
As mentioned earlier, all of our test statistics can be thought of as random walks, which behave like under . The LIL then characterizes how these random walks behave under – our algorithm will keep observing new data since the random walk values will simply bounce around within the LIL envelope. Under , this random walk is designed to have nonzero mean, and hence will eventually stray outside the LIL envelope, at which point the process stops and rejects the null hypothesis.
For practically applying this argument to finite samples and reasoning about type II error and stopping times, we cannot use the classical asymptotic form of the LIL typically stated in textbooks like by [7], instead adapting a finitetime extension of the LIL by [2]. As we will see, the technical contribution is necessary to investigate the stopping time, and control type I and II errors nonasymptotically and uniformly over all .
In summary, our sequential testing framework has the following properties:

Under , it controls type I error, using a finitetime LIL computable in terms of empirical variance.

Under , and with type II error controlled at a target level, it automatically stops after seeing the same number of points as the corresponding computationallyconstrained oracle batch algorithm.

Each update takes time and constant memory.
In later sections, we develop formal versions of these statements. The statistical observations, particularly the stopping time, follow from the finitetime LIL through simple concentration of measure arguments that extend to very general sequential testing settings, but have seemingly remained unobserved in the literature for decades because of the finitetime LIL necessary to make them.
We begin by describing a sequential test for the bias of a coin in Section 2. We then provide a sequential test for nonparametric twosample mean testing in Section 3. We run extensive simulations in Section 4 to bear out our theory about its properties. We end with extensions to the general nonparametric twosample and independence testing problems, in Section 5. Proofs are deferred to the appendices.
2 Detecting the Bias of a Coin
This section will illustrate how a simple sequential test can perform statistically as well as the best batch test in hindsight, while automatically stopping essentially as soon as possible. We will show that such early stopping can be viewed as quite a general consequence of concentration of measure. Just for this section, let represent constants that may take different values on each appearance, but are always absolute.
Consider observing i.i.d. binary flips of a coin, which may be fair or biased towards , with . We want to test for fairness, detecting unfairness as soon as possible. Concretely, we therefore wish to test, for :
For any sample size , the natural test statistic for this problem is . is a (scaled) simple meanzero random walk under . A standard hypothesis testing approach to our problem is a basic batch test involving , which tests for deviations from the null for a fixed sample size (Fig. 1, left). A basic Hoeffding bound shows that
with probability under the null, so type I error is controlled at level :
2.1 A Sequential Test
The main test we propose will be a sequential test as in Fig. 1. It sees examples as they arrive one at a time, up to a large time , the maximum sample size we can afford. The sequential test is defined with a sequence of positive thresholds . We show how to set to justify statements (A) and (B) in the introduction.
Type I Error. Just as the batch threshold is determined by controlling the type I error with a concentration inequality, the sequential test also chooses to control the type I error at :
(1) 
This inequality concerns the uniform concentration over infinite tails of , but what satisfies it? Asymptotically, the answer is governed by a foundational result, the LIL:
Theorem 1 (Law of the iterated logarithm ([13])).
With probability , .
The LIL says that should have a asymptotic dependence on , but does not specify its dependence.
Our sequential testing insights rely on a stronger nonasymptotic LIL proved in ([2], Theorem 2): w.p. at least , we have simultaneously for all . This choice of satisfies (1) for , and specifies the sequential test as in Fig. 1. (Choosing this way is unimprovable in all parameters up to absolute constants ([2])).
Type II Error. For practical purposes, can be treated as a small constant (even when ). Hence, (more discussion in Appendix D.1), and the power is:
(2)  
(3) 
So the sequential test is essentially as powerful as a batch test with samples (and similarly the round of the sequential test is like an sample batch test).
Early Stopping. The standard motivation for using sequential tests is that they often require few samples to reject statistically distant alternatives. To investigate this with our working example, suppose is large and the coin is actually biased, with a fixed unknown . Then, if we somehow had full knowledge of when using the batch test and wanted to ensure a desired type II error , we would use just enough samples (written as in context):
(4) 
so that for all , since ,
(5) 
Examining (2.1), note that is a meanzero random walk. Therefore, standard lower bounds for the binomial tail tell us that suffices, and no test can statistically use much less than samples under to control type II error at .
How many samples does the sequential test use? The quantity of interest is the test’s stopping time , which is when it rejects and otherwise. In fact, the expected stopping time is close to under any alternate hypothesis:
Theorem 2.
For any and any , there exist absolute constants such that
2.2 Discussion
Before moving to the twosample testing setting, we note the generality of these ideas. Theorem 2 is proved for biased coin flips, but it uses only basic concentration of measure ideas: upper and lower bounds on the tails of a statistic that is a cumulative sum incremented each timestep. Many natural test statistics follow this scheme, particularly those that can be efficiently updated on the fly. Our main sequential twosample test in the next section does also.
Theorem 2 is notable for its uniformity over and . Note that (and therefore the sequential test) are independent of both of these – we need only to set a target type I error bound . Under any alternative , the theorem holds for all simultaneously. As decreases, of course increases, but the leading multiplicative factor decreases. In fact, with an increasingly stringent , we see that ; so the sequential test in fact stops closer to , and hence is almost deterministically best possible. Indeed, the proof of Theorem 2 also shows that , so the probability of lasting steps falls off exponentially in , and is therefore quite sharply concentrated near the optimum .
This precise line of reasoning is formalized completely nonasymptotically in the analysis of our main twosample test for the problem (6), though that result is in a stronger highdimensional setting.
3 TwoSample Mean Testing
Assume that we have samples and , with being unknown arbitrary continuous distributions on with means , and we need to test
(6) 
Denote covariances of by and . Define so that under . Let denote the standard Gaussian CDF, .
3.1 A LinearTime Sequential Test
In this section, we present our main sequential twosample test using the scheme in Fig. 1, so we only need to specify a sequence of rejection thresholds . To do this, we denote
and define our sequential test statistic as the following stochastic process evolving with :
Under , , and is a zeromean random walk.
Proposition 1.
, and
We assume for now that our data are bounded, i.e.
so that by the CauchySchwarz inequality, w.p. 1,
Since has bounded differences, it exhibits Gaussianlike concentration under the null. We examine the cumulative variance process of under ,
Using this, we can control the behavior of under .
Theorem 3 ([2]).
Take any . Then with probability , for all simultaneously,
where , and .
Unfortunately, we cannot use the theorem directly to get computable deviation bounds for type I error control, because the covariance matrix is unknown a priori. must instead be estimated on the fly as part of the sequential test, and its estimate must be concentrated tightly and uniformly over time, so as not to present a statistical bottleneck if the test runs for a long time. We prove such a result, necessary for sequential testing, relating to the empirical variance process .
Lemma 4.
With probability , for all simultaneously, there is an absolute constant such that
Its proof uses a selfbounding argument and is in the Appendix. Now, we can combine these to prove a novel uniform empirical Bernstein inequality to (practically) establish concentration of under .
Theorem 5 (Uniform Empirical Bernstein Inequality for Random Walks).
Take any . Then with probability , for all simultaneously,
where , and is an absolute constant.
Its proof follows immediately from a union bound on Thm. 3 and Lem. 4. Thm. 5 depends on , which is easily calculated by the algorithm on the fly in constant time per iteration [8]. Ignoring constants for clarity, Thm. 5 effectively implies that our sequential test from Figure 1 controls type I error at by setting
(7) 
Practically, we suggest using the above threshold with a constant of to guarantee typeI error approximately (this is all one often wants anyway, since any particular choice of is anyway arbitrary). This is what we do in our experiments, with excellent success in simulations. For exact or conservative control, consider using a small constant multiple of the above threshold, such as .
The above sequential threshold is remarkable, because wrapped into the practically useful and simple expression is a deep mathematical result – the uniform Bernstein LIL effectively involves a union bound for the error probability over an infinite sequence of times. Any naive attempt to union bound the error probabilities for a possibly infinite sequential testing procedure will be too loose and hence too conservative – indeed, the classical LIL is known to be asymptotically tight including constants, and our nonasymptotic LIL is also tight up to small constant factors.
This typeI error control with an implicit infinite union bound surprisingly does not lead to a loss in power. Indeed, our statistic possesses essentially the same power as the corresponding lineartime batch two sample test, and also stops early for easy problems. We make this precise in the following two subsections.
3.2 A LinearTime Batch Test
Here we study a simple lineartime batch twosample mean test, following the template in Fig. 1. Consider the lineartime statistic , where, as before, Note that the s are also i.i.d., and relies on data points from each distribution.
Let be under respectively. Recalling Proposition 1:
Then since is a sum of i.i.d. variables, the central limit theorem (CLT) implies that (where is convergence in distribution)
(8a)  
(8b) 
Based on this information, our test rejects the null hypothesis whenever
(9) 
where is the quantile of the standard normal distribution. So Eq. (8a) ensures that
giving us type I error control under .
In practice, we may not know , so we standardize the statistic using the empirical variance – since we assume is large, these scalar variance estimates do not change the effective power analysis. For nonasymptotic type I error control, we can use an empirical Bernstein inequality [18, Thm 11], based on an unbiased estimator of . Specifically, the empirical variance of s () can be used to reject the null whenever
(10) 
Ignoring constants for clarity, the empirical Bernstein inequality effectively suggests that the batch test from Figure 1 will have type I error control of on setting threshold
(11) 
For immediate comparison, we copy below the expression for from Eq. (7):
This similarity explains the optimal power and stopping time properties, detailed in the next subsection.
One might argue that if is large, then , and in this case we can simply derive the (asymptotic) power of the batch test given in Eq.(9) as
(12)  
Note that the second term is a constant less than . As a concrete example, when , and we denote the signaltonoise ratio as , then the power of the lineartime batch test is at least
3.3 Power and Stopping Time of Sequential Test
The striking similarity of Eq. (11) and Eq. (7), mentioned in the previous subsection, is not coincidental. Indeed, both of these arise out of nonasymptotic versions of CLTlike control and LILlike control, and we know that in the asymptotic regime for Bernoulli coinflips, CLT thresholds and LIL threshold differ by just factors. Hence, it is not surprising to see the empirical Bernstein LIL match empirical Bernstein thresholds up to factors. Since the power of the sequential test is at least the probability of rejection at the very last step, and since even for , the power of the lineartime sequential and batch tests is essentially the same. However, a sequential test that rejects at the last step is of little practical interest, bringing us to the issue of early stopping.
Early Stopping. The argument is again identical to that Section 2, proving that is nearly optimal, and arbitrarily close to optimal as tends to zero. Once more note that the “optimal” above refers to the performance of the oracle lineartime batch algorithm that was informed about the right number of points to subsample and use for the onetime batch test. Formally, let denote this minimum sample size for the twosample mean testing batch problem to achieve a power , the indicating that this is an oracle value, unknown to the user of the batch test. From Eq. (12), it is clear that for the power becomes at least . In other words,
(13) 
Theorem 6.
3.4 Discussion
This section’s arguments have given an illustration of the flexibility and great generality of the ideas we used to test the bias of the coin. In the twosample setting, we just design the statistic to be a meanzero random walk under the null. As in the coin’s case, the LIL controls type I error, and the rest of the arguments are identical because of the common concentration properties of all random walks.
Our test statistic is chosen with several considerations in mind. First, the batch test is lineartime in the sample complexity, so we are comparing algorithms with the same computational budget, on a fair footing. There exist batch tests using Ustatistics that have higher power than ours ([21]) for a given , but they use more computational resources ( rather than ).
Also, the batch statistic is a sum of random increments, a common way to write many hypothesis tests, and one that can be computed on the fly in the sequential setting. Note that is a scalar, so our arguments do not change with , and we inherit the favorable highdimensional statistical performance of the statistic; [21] has more relevant discussion. The statistic also has been shown to have mighty generalizations in the recent statistics literature, which we discuss in Section 5.
Though we assume data scaled to have norm for convenience, this can be loosened. Any data with bounded norm can be rescaled by a factor just for the analysis, and then our results can be used. This results in an empirical Bernstein bound like Thm. 5, but of order . The dependence on is very weak, and is negligible even when .
In fact, we only require control of the higher moments (e.g. by Bernstein conditions, which generalize boundedness and subGaussianity conditions [3]) to prove the nonasymptotic Bernstein LIL in [2], exactly as is the case for the usual Bernstein concentration inequalities for averages ([3]). Therefore, our basic arguments hold for unbounded increments as well. In fact, the LIL itself, as well as the nonasymptotic LIL bounds of [2], apply to martingales – much more general versions of random walks capable of modeling dependence on the past history. Our ideas could conceivably be extended to this setting to devise more datadependent tests, which would be interesting future work.
4 Empirical Evaluation
In this section, we evaluate our proposed sequential test on synthetic data, to validate the predictions made by our theory concerning its type I/II errors and the stopping time.
We simulate data from two multivariate Gaussians (), motivated by our discussion at the end of Section 3.2: each Gaussian has covariance matrix , one has mean and the other has for some . We keep here to keep the scale of the data roughly consistent with the biasedcoin example, though we find the scaling of the data makes no practical difference, as we discussed.
4.1 Running the Test and Type I Error
Like typical hypothesis tests, ours is designed to control type I error. When implementing our algorithmic ideas, it suffices to set as in (7), where the only unknown parameter is the proportionality constant . The theory suggests that this is an absolute constant, and prescribes an upper bound for it, which can conceivably be loose because of the analytic techniques used (as [2] discusses). On the other hand, in the asymptotic limit the bound becomes tight; the empirical converges quickly to its mean , and we know from secondmoment versions of the LIL that , and suffice. However, as we consider smaller finite times, that bound must relax (at the extremely low or when flipping a fair coin, for instance).
Nevertheless, we find that in practice, for even moderate sample sizes like the ones we test here, the same reasonable constants suffice in all our experiments: and , with following Thm. 5 and similar fixedsample Bennett bounds ([3, 2]; also see Appendix D). The situation is exactly analogous to how the Gaussian approximation is valid for even moderate sample sizes in batch testing, making possible a huge variety of common tests that are asymptotically and empirically correct with reasonable constants to boot.
To be more specific, consider the null hypothesis for the example of the coin bias testing given earlier; these fair coin flips are the most anticoncentrated possible bounded steps, and render our empirical Bernstein machinery ineffective, so they make a good test case. We choose and as above, and plot the cumulative probability of type I violations up to time for different (where is the stopping time of the test), with the results in Fig. 2. To control type I error, the curves need to be asymptotically upperbounded by the desired levels (dotted lines). This does not appear true for our recommended settings of , but the figure still indicates that type I error is controlled even for very high with our settings. A slight further raise in beyond suffices to guarantee much stronger control (Appendix G).
4.2 Type II Error and Stopping Time
Now we verify the results at the heart of the paper – uniformity over alternatives of the type II error and stopping time properties.
Fig. 3 plots the power of the sequential test against the maximum runtime using the Gaussian data, at a range of different alternatives ; the solid and dashed lines represent the power of the batch test (11) with samples, and the sequential test with maximum runtime . As we might expect, the batch test has somewhat higher power for a given sample size, but the sequential test consistently performs well compared to it. The role of here is basically to set a desired tolerance for error; increasing does not change the intermediate updates of the algorithm, but does increase the power by potentially running the test for longer. So each curve in Fig. 3 transparently illustrates the statistical tradeoff inherent in hypothesis testing against a fixed simple alternative, but the great advantage of our sequential test is in achieving all of them simultaneously with the same algorithm.
To highlight this point, we examine the stopping time compared to the batch test for the Gaussian data, in Fig. 4. We see that the distributions of are all quite concentrated, and that their medians (marked) fit well to a slope line, showing the predicted dependence on . Some more experiments are in Appendix G.1.
5 Further Extensions
A General TwoSample Test. Given two independent multivariate streams of i.i.d. data, instead of testing for differences in mean, we could also test for differences in any moment, i.e. differences in distribution, a subtler problem which may require much more data to ascertain differences in higher moments. In other words, we would be testing
One simple way to do this is by using a kernel twosample test, like the Maximum Mean Discrepancy (MMD) test proposed by [10]. The population MMD is defined as
where is the unit ball of functions in the Reproducing Kernel Hilbert Space corresponding to some positive semidefinite Mercer kernel . One common choice is the Gaussian kernel . With this choice, the population MMD has an interesting interpretation, given by Bochner’s theorem [24] as
where are the characteristic functions of . This means that the population MMD is nonzero iff the distributions differ (i.e. the alternative holds).
The authors of [10] propose the following (lineartime) batch test statistic after seeing samples: , where . The associated test is consistent against all fixed (and some local) alternatives where ; see [10] for a proof, and [21] for a highdimensional analysis of this test (in the limited setting of meantesting that we consider earlier in this paper). Both properties are inherited by the following sequential test.
The sequential statistic we construct after seeing batches ( samples) is the random walk , which has mean zero under the null because . The similarity with our meantesting statistic is not coincidental; when , they coincide, further motivating our choice of test statistic earlier in the paper. As before, we use the LIL to get type I error control, nearly the same power as the lineartime batch test, and also early stopping much before seeing points if the problem at hand is easy.
A General Independence Test. Given a single multivariate stream of i.i.d data, where each datapoint is a pair , the independence testing problem involves testing whether is independent of or not. More formally, we want to test
(14) 
A test of linear correlation/covariance only detects linear dependence. As an alternative to this, [26] proposed a population quantity called distance covariance, given by
where are i.i.d. pairs from the joint distribution on . Remarkably, an alternative representation is
where are the characteristic functions of the marginals and joint distribution of and . Using this, the paper [26] concludes that iff . One way to form a lineartime statistic to estimate is to process the data in batches of size four, i.e. and calculate the scalar
where the summations are over all possible ways of assigning , each pair being one from . The expectation of this quantity is exactly , and the batch test statistic, given datapoints, is simply . As before, the associated test is consistent for any fixed alternatives where . Noting that under the null, our random walk after seeing batches (i.e. points) will just be . As in previous sections, the LIL results from [2] can be used to get type I error control, and early stopping much before seeing points, if the problem at hand is statistically easy.
6 Related Work
Parametric or asymptotic methods. Our statements about the control of type I/II errors and stopping times are very general, following up on early sequential analysis work. Most sequential tests operate in the Wald’s framework expounded in [27]. In a seminal line of work, Robbins and colleagues delved into sequential hypothesis testing in an asymptotic sense [23]. Apart from being asymptotic, their tests were most often for simple hypotheses (point nulls and alternatives), were univariate, or parametric (assuming Gaussianity or known density). That said, two of their most relevant papers are [22] and [4], which discuss statistical methods related to the LIL. They give an asymptotic version of the argument of Section 2, using it to design sequential KolmogorovSmirnov tests with power one. Other classic works that mention using the LIL for testing various simple or univariate or parametric problems include [5, 6, 14, 16]. These all operate in the asymptotic limit in which the classic LIL can be used to set .
For testing a simple null against a simple alternative, Wald’s sequential probability ratio test (SPRT) was proved to be optimal by the seminal work [29], but this applies when both the null and alternative have a known parametric form. The same authors also suggested a univariate nonparametric twosample test in [28], but presumably did not find it clear how to combine these two lines of work.
Bernsteinbased methods. Finitetime uniform LILtype concentration tools from [2] are crucial to our analysis, and we adapt them in new ways; but novelty in this respect is not our primary focus here, because less recent concentration bounds can also be used to yield similar results. It is always possible to use a weighted union bound (allocating failure probability over time as ) over fixed Bernstein bounds, resulting in a deviation bound of . A more advanced “peeling” argument, dividing time into exponentially growing epochs, improves the bound to (e.g. in [12]). This suffices in many simple situations, but in general is still arbitrarily inferior to our bound of , precisely in the case in which we expect the secondmoment Bernstein bounds to be most useful over Hoeffding bounds. A yet more intricate peeling argument, demarcating the epochs by exponential intervals in rather than , can be used to achieve our iteratedlogarithm rate, in conjunction with the wellknown secondorder uniform martingale bound due to Freedman ([9]). This serves as a sanity check on the nonasymptotic LIL bounds of [2], where it is also shown that these bounds have the best possible dependence on all parameters. However, it can be verified that even a suboptimal uniform concentration rate like would suffice for the optimal stopping time properties of the sequential test to hold, with only a slight weakening of the power.
Bernstein inequalities that only depend on empirical variance have been used for stopping algorithms in Hoeffding races [17] and other even more general contexts [19]. This line of work uses the empirical bounds very similarly to us, albeit in the nominally different context of direct estimation of a mean. As such, they too require uniform concentration over time, but achieve it with a crude union bound (failure probability ), resulting in a deviation bound of . Applying the more advanced techniques above, it may be possible to get our optimal concentration rate, but to our knowledge ours is the first work to derive and use uniform LILtype empirical Bernstein bounds.
In practice. To our knowledge, implementing sequential testing in practice has previously invariably relied upon CLTtype results patched together with heuristic adjustments of the CLT threshold (e.g. the wellknown HaybittlePeto scheme for clinical trials [20] has an arbitrary conservative choice of through the sequential process and at the last datapoint). These perform as loose functional versions of our uniform finitesample LIL upper bound, though without theoretical guarantees. In general, it is unsound to use an asymptotically normal distribution under the null at stopping time – the central limit theorem (CLT) applies to any fixed time , but it may not apply to a random stopping time (see Anscombe’s randomsum CLT [1, 11] and related work). This has caused myriad practical complications in implementing such tests (see [15], Section 4). One of our contributions is to rigorously derive a directly usable finitesample sequential test, in a way we believe can be generically extended.
We emphasize that there are several advantages to our proposed framework and analysis which, taken together, are unique in the literature. We tackle the multivariate nonparametric (possibly even highdimensional) setting, with composite hypotheses. Moreover, we not only prove that the power is asymptotically one, but also derive finitesample rates that illuminate dependence of other parameters on , by considering nonasymptotic uniform concentration over finite times. The fact that it is not provable via purely asymptotic arguments is why our optimal stopping property has gone unobserved for a wide range of tests, even as basic as the biased coin. In our more refined analysis, it can be verified (Thm. 2) that the stopping time diverges to when the required type II error , i.e. power .
7 Conclusion
We have presented a sequential scheme for multivariate nonparametric hypothesis testing against composite alternatives, which comes with a full finitesample analysis in terms of onthefly estimable quantities. Its desirable properties include type I error control by considering finitetime LIL concentration; nearoptimal type II error compared to lineartime batch tests, due to the iteratedlogarithm term in the LIL; and most importantly, essentially optimal early stopping, uniformly over a large class of alternatives. We presented some simple applications in learning and statistics, but our design and analysis techniques are general, and their extensions to other settings are of continuing future interest.
References
 [1] Francis J Anscombe. Largesample theory of sequential estimation. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 48, pages 600–607. Cambridge Univ Press, 1952.
 [2] Akshay Balsubramani. Sharp uniform martingale concentration bounds. arXiv preprint arXiv:1405.2639, 2015.
 [3] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.
 [4] DA Darling and Herbert Robbins. Iterated logarithm inequalities. Proceedings of the National Academy of Sciences of the United States of America, pages 1188–1192, 1967.
 [5] DA Darling and Herbert Robbins. Some further remarks on inequalities for sample sums. Proceedings of the National Academy of Sciences of the United States of America, 60(4):1175, 1968.
 [6] DA Darling and Herbert Robbins. Some nonparametric sequential tests with power one. Proceedings of the National Academy of Sciences of the United States of America, 61(3):804, 1968.
 [7] Vilim Feller. An Introduction to Probability Theory and Its Applications: Volume One. John Wiley & Sons, 1950.
 [8] Tony Finch. Incremental calculation of weighted mean and variance. University of Cambridge, 4, 2009.
 [9] David A. Freedman. On tail probabilities for martingales. Ann. Probability, 3:100–118, 1975.
 [10] A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, 13:723–773, 2012.
 [11] Allan Gut. Anscombe’s theorem 60 years later. Sequential Analysis, 31(3):368–396, 2012.
 [12] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multiarmed bandits. In Conference on Learning Theory, 2014.
 [13] A. Ya. Khinchin. über einen satz der wahrscheinlichkeitsrechnung. Fundamenta Mathematicae, 6:9–20, 1924.
 [14] Tze Leung Lai. Powerone tests based on sample sums. The Annals of Statistics, pages 866–880, 1977.
 [15] Tze Leung Lai, Zheng Su, et al. Sequential nonparametrics and semiparametrics: Theory, implementation and applications to clinical trials. Institute of Mathematical Statistics, 2008.
 [16] Hans Rudolf Lerche. Sequential analysis and the law of the iterated logarithm. Lecture NotesMonograph Series, pages 40–53, 1986.
 [17] PoLing Loh and Sebastian Nowozin. Faster hoeffding racing: Bernstein races via jackknife estimates. In Algorithmic Learning Theory, pages 203–217. Springer, 2013.
 [18] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
 [19] Volodymyr Mnih, Csaba Szepesvári, and JeanYves Audibert. Empirical bernstein stopping. In Proceedings of the 25th international conference on Machine learning, pages 672–679. ACM, 2008.
 [20] R Peto, MC Pike, Philip Armitage, Norman E Breslow, DR Cox, SV Howard, N Mantel, K McPherson, J Peto, and PG Smith. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. ii. analysis and examples. British journal of cancer, 35(1):1, 1977.
 [21] Sashank J. Reddi, Aaditya Ramdas, Barnabás Póczos, Aarti Singh, and Larry Wasserman. On the high dimensional power of a lineartime two sample test under meanshift alternatives. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS 2015), 2015.
 [22] Herbert Robbins. Statistical methods related to the law of the iterated logarithm. The Annals of Mathematical Statistics, pages 1397–1409, 1970.
 [23] Herbert Robbins. Herbert Robbins Selected Papers. Springer, 1985.
 [24] Walter Rudin. Real and complex analysis. Tata McGrawHill Education, 1987.
 [25] William F Stout. A martingale analogue of kolmogorov’s law of the iterated logarithm. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 15(4):279–290, 1970.
 [26] G.J. Székely, M.L. Rizzo, and N.K. Bakirov. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794, 2007.
 [27] Abraham Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2):117–186, 1945.
 [28] Abraham Wald and Jacob Wolfowitz. On a test whether two samples are from the same population. The Annals of Mathematical Statistics, 11(2):147–162, 1940.
 [29] Abraham Wald and Jacob Wolfowitz. Optimum character of the sequential probability ratio test. The Annals of Mathematical Statistics, pages 326–339, 1948.
Appendix A Proof of Theorem 2
Proof.
Write as a placeholder absolute constant in the sense of Sec. 2. Then for any sufficiently high , our definitions for and tell us that
(15)  
(16) 
for , from (2.1) and the definition of . Also, using a Hoeffding bound on (15), we see that . So for any and ,
(17)  
(18) 
Here (17) sums the infinite geometric series with initial term , and (18) uses the inequality as well as . ∎
Appendix B Proof of Proposition 1
Proof of Proposition 1.
Since are all independent, . Next,
Since , we have
from which the result is immediate. ∎
Appendix C Proof of Theorem 3
We rely upon a variancedependent form of the LIL. Upon noting that and , it is an instance of a general martingale concentration inequality from [2].
Theorem 7 (Uniform Bernstein Bound (Instantiation of [2], Theorem 4)).
Suppose w.p. for all . Fix any and define . Then with probability , for all simultaneously, and
In principle this tight control by the second moment is enough to achieve our goals, just as the secondmoment Bernstein inequality for random variables suffices for proving empirical Bernstein inequalities.
However, the version we use for our empirical Bernstein bound is a more convenient though looser restatement of Theorem 7. To derive it, we refer to the appendices of [2] for the following result:
Lemma 8 ([2], Theorem 16).
Take any , and define and as in Theorem 7. With probability , for all simultaneously,
Theorem 3 follows by loosely combining the above two uniform bounds.
Proof of Theorem 3.
Recall . Theorem 7 gives that w.p. , for all , and
(19) 
Taking a union bound of (19) with Lemma 8 gives that w.p. , the following is true for all simultaneously:
For all we have bounded by the maximum of the two cases above. The result can be seen to follow, by relaxing the explicit bound to instead transform into . ∎
Appendix D Proportionality Constants and Guaranteed Correctness
After observing the first few samples, regardless of how many, it is impossible to empirically conclude with certainty that the type I error of a sequential test (Fig. 2) has ever leveled off. And although our theory can guarantee type I error control, it is reasonable to question whether our empirically recommended prescription is actually sound, even in the hypothetical case .
In fact, we can show that it is unsound. Consider first the biased coin example of Sec. 2. If is the test statistic, the number of type I error violators under the null is
w.p. 1, from the asymptotic LIL of Thm. 1.
So the sequential test will almost surely reject with , which is very undesirable. We still recommend this for two reasons.
Firstly, it appears not to be an empirical issue, because of