Accurate Inference in Adaptive Linear Models
Abstract
Estimators computed from adaptively collected data do not behave like their nonadaptive brethren. Rather, the sequential dependence of the collection policy can lead to severe distributional biases that persist even in the infinite data limit. We develop a general method – decorrelation – for transforming the bias of adaptive linear regression estimators into variance. The method uses only coarsegrained information about the data collection policy and does not need access to propensity scores or exact knowledge of the policy. We bound the finitesample bias and variance of the estimator and develop asymptotically correct confidence intervals based on a novel martingale central limit theorem. We then demonstrate the empirical benefits of the generic decorrelation procedure in two different adaptive data settings: the multiarmed bandits and autoregressive time series models.
vsred \addauthorydblue \addauthorlmgreen
1 Introduction
Randomized experiments have played a pivotal role in advancing our understanding in many fields of science and engineering. Throughout, we will assume that data collected is in the form of samples . Here are the outcomes and is a vector of features or covariates associated with the samples. In the standard linear model, the outcomes and covariates are related as through a parameter as:
(1) 
In this model, the ‘noise’ term represents inherent variation in the sample, or the variation that is not captured in the model. Parametric models of the type (1) are a fundamental building block in many regression and classification settings. A further common, and often critical, assumption is that the covariates are independent of the outcomes and the inherent variation . This paper is motivated from experiments where the sample is not completely randomized but rather adaptive chosen. By adaptive, we mean that the choice of a new data point is guided from inferences on past data. Consider the following sequential paradigms:

Multiarmed bandits: This class of sequential decision making problems captures the classical ‘exploration versus exploitation’ tradeoff. At each time , the experimenter chooses an ‘action’ from a set of available actions and accrues a reward where follow the model (1). Here the experimenter must balance the conflicting goals of learning about the underlying model (i.e., ) for better future rewards, while still accruing reward in the current time step.

Active learning: In such settings, acquiring labels is costly, and the experimenter must learn with as few outcomes as possible. At time , based on prior data the experimenter chooses a new data point to label based on its value in the learning problem.

Time series analysis: Here, the data points are naturally ordered in time, with denoting a time series and the covariates can include observations from the past time points.
Here, time induces a natural sequential dependence across the samples. In the first two instances, the actions or policy of the experimenter are responsible for creating such dependence. In the case of time series data, this dependence is endogenous, and a consequence of the modeling. A common feature, however, is that the choice of the design or sequence is typically not made for inference on the model after the data collection is completed. This does not, of course, imply that accurate estimates on the parameters cannot be made from the data. Indeed, it is often the case that the sample is informative enough to extract consistent estimators of the underlying parameters. Indeed, this is often crucial to the success of the experimenter’s policy. For instance, notions such as ‘regret’ in sequential decisionmaking or the risk in active learning are intimately connected with the accurate estimation of the underlying parameters (Castro and Nowak, 2008; Audibert and Bubeck, 2009; Bubeck et al., 2012; Rusmevichientong and Tsitsiklis, 2010). Our motivation is the natural followup question of accurate ex poste inference in the standard statistical sense:
Can adaptive data be used to compute accurate confidence regions and values?
As we will see, the key challenge is that even in the simple linear model of (1), the distribution of classical estimators can differ from the predicted central limit behavior of nonadaptive designs. In this context we make the following contributions:

Decorrelated estimators: We present a general method to decorrelate arbitrary estimators constructed from the data. This construction admits a simple decomposition into a ‘bias’ and ‘variance’ term. In comparison with competing methods, like propensity weighting, our proposal requires little explicit information about the datacollection policy.

Bias and variance control: Under a natural exploration condition on the data collection policy, we establish that the bias and variance can be controlled at nearly optimal levels. In the multiarmed bandit setting, we prove this under an especially weak averaged exploration condition.

Asymptotic normality and inference: We establish a martingale central limit theorem under a moment stability assumption. Applied to our decorrelated estimators, this allows us to construct confidence intervals and conduct hypothesis tests in the usual fashion.

Validation: We demonstrate the usefulness of the decorrelating construction in two different scenarios: multiarmed bandits (MAB) and autoregressive (AR) time series. We observe that our decorrelated estimators retain expected central limit behavior in regimes where the standard estimators do not, thereby facilitating accurate inference.
2 Main results: decorrelation and inference
We focus on the linear model and assume that the data pairs satisfy:
(2) 
where are independent and identically distributed random variables with , and bounded third moment. We assume that the samples are ordered naturally in time and let denote the filtration representing increasing information in the sample. Fornally, we let data points be adapted to this filtration, i.e. are measurable with respect to for all .
Our goal in this paper is to use the available data to construct ex poste confidence intervals and values for individual parameters, i.e. entries of . A natural starting point is to consider is the standard least squares estimate:
where and . When the data collection not adaptive, classical results imply that the standard least squares estimate is distributed asymptotically as , where denotes the Gaussian distribution with mean and covariance . Lai and Wei (1982) extend these results to the current scenario:
Theorem 1 (Theorems 1, 3 (Lai and Wei, 1982)).
Let () denote the minimum (resp. maximum) eigenvalue of . Under the model (2), assume that have finite third moment and almost surely, with and . Then the following limits hold almost surely:
Further assume the following stability condition: there exists a sequence of non random matrices such that and in probability. Then,
At first blush, this allows to construct confidence regions in the usual way. More precisely, the result implies that is a consistent estimate of the noise variance. Therefore, the interval is a 95% twosided confidence interval for the first coordinate . Indeed, this result is sufficient for a variety of scenarios with weak dependence across samples, such as when the form a Markov chain that mixes rapidly. However, while the assumptions for consistency are minimal, the additional stability assumption required for asymptotic normality poses some challenges. In particular:

The rate of convergence to the asymptotic central limit theorem depends on the quantitative rate of the stability condition. In other words, variability in the inverse covariance can cause deviations from normality of estimator (Dvoretzky, 1972). In finite samples, this can manifest itself in the bias of the estimator as well as in higher moments.
An example of this phenomenon is the standard multiarmed bandit problem (Lai and Robbins, 1985). At each time point , the experimenter (or data collecting policy) chooses an arm and observes a reward with mean . With denoting the mean rewards, this falls within the scope of model 2, where the vectors takes the value (the basis vector), if the arm or option is chosen at time .^{1}^{1}1Strictly speaking, the model 2 assumes that the errors have the same variance, which need not be true for the multiarmed bandit as discussed. We focus on the homoscedastic case where the errors have the same variance in this paper. Other stochastic bandit problems with covariates such as contextual or linear bandits (Rusmevichientong and Tsitsiklis, 2010; Li et al., 2010; Deshpande and Montanari, 2012) can also be incorporated fairly naturally into our framework. For the purposes of this paper, however, we restrict ourselves to the simple case of multiarmed bandits without covariates. In this setting, ordinary least squares estimates correspond to computing sample means for each arm. The stability condition of Theorem 1 requires that , the number of times a specific arm is sampled is asymptotically deterministic as grows large. This is true for certain regretoptimal algorithms (Russo, 2016; Garivier and Cappé, 2011). Indeed, for such algorithms, as the sample size grows large, the suboptimal arm is sampled for a constant that depends on and the distribution of noise . However, in finite samples, the dependence on and the slow convergence rate of lead to significant deviation from the expected central limit behavior.
Villar et al. (2015) studied a variety of multiarmed bandit algorithms in the context of clinical trials. They empirically demonstrate that sample mean estimates from data collected using many standard multiarmed bandit algorithms are biased. Recently, Nie et al. (2017) proved that this bias is negative for Thompson sampling and UCB. The presence of bias in sample means demonstrates that standard methods for inference, as advocated by Theorem 1, can be misleading when the same data is now used for inference. As a pertinent example, testing the hypotheses “the mean reward of arm 1 exceeds that of 2” based on classical theory can be significantly affected by adaptive data collection.
The papers (Villar et al., 2015; Nie et al., 2017) focus on the finite sample effect of the data collection policy on the bias and suggest methods to reduce the bias. It is not hard to find examples where higher moments or tails of the distribution can be influenced by the data collecting policy. A simple, yet striking, example is the standard autoregressive model (AR) for time series data. In its simplest form, the AR model has one covariate, i.e. with . In this case:
(3) 
Here the least squares estimate is given by . When is bounded away from 1, the series is asymptotically stationary and the estimate has Gaussian tails. On the other hand, when is on the order of the limiting distribution of the least squares estimate is nonGaussian and dependent on the gap (see Chan and Wei (1987) for a description in terms of the standard Weiner process). An histogram for the errors in two cases: stationary with and (nearly) nonstationary with is shown on the left in Figure 1 where the large example case is clearly nonGaussian.
On the other hand, using the same data our decorrelating procedure is able to obtain estimates admitting Gaussian limit distributions, as evidenced in the right panel of Figure 1. We show a similar phenomenon in the MAB setting where our decorrelating procedure corrects for the unstable behavior of the estimator (see Section 4 for details on the empirics). Delegating discussion of further related work to 3, we now describe this procedure and its motivation.
2.1 Removing the effects of adaptivity: decorrelation
We propose to decorrelate the estimator by constructing:
(4) 
for a specific choice of a ‘decorrelating’ or ‘whitening’ matrix . This is inspired from analogous constructions for highdimensional linear regression by Zhang and Zhang (2014); Javanmard and Montanari (2014b, a); Van de Geer et al. (2014). As we will see, these ideas can be useful also in the present regime where we keep fixed and . By rearranging:
(5) 
We interpret as a ‘bias’ and as a ‘variance’. This is aided by the following critical constraint on the construction of the whitening matrix :
Definition 1 (Welladaptedness of ).
Without loss of generality, we assume that are adapted to . Let be a filtration such that are adapted w.r.t. and is independent of . We say that is welladapted if the columns of are adapted to , i.e. the column is measurable with respect to .
With this in hand, we have the following simple lemma.
Lemma 2.
Assuming is welladapted, as in Definition 1:
A concrete proposal is to tradeoff the bias, controlled by the size of , with the the variance which appears through . This leads to the following optimization problem:
(6) 
Solving the above in closed form yields ridge estimators for , and by continuity, also the standard least squares estimator. Departing from Zhang and Zhang (2014); Javanmard and Montanari (2014a), we solve the above in an online fashion in order to obtain a welladapted . We define, recursively :
As in the case of the offline optimization, we may obtain closed form formulae for the columns (see Algorithm 1). The method as specified requires additional computational overhead, which is typically minimal compared to computing , or even a regularized version like the ridge or lasso estimate. We refer to as a estimate or a decorrelated estimate.
2.2 Implicit stochastic gradient descent in reverse
While we motivated decorrelation decorrelation as an online procedure for optimizing the biasvariance tradeoff objective (6), it holds a dual interpretation as implicit stochastic gradient descent (SGD) (see, e.g., Kulis and Bartlett, 2010) (also known as incremental proximal minimization (Bertsekas, 2011) or the normalized least mean squares filter (Nagumo and Noda, 1967) in this context) with stepsize applied to the leastsquares objective, . Importantly, to obtain the welladapted form of our updates, one must apply implicit SGD in reverse, starting with the final observation and ending with the initial observation ; this recipe yields the parameter updates and
Unrolling the recursion, we obtain the equivalent form for
which precisely matches the updates given in Algorithm 1.
2.3 Bias and variance
We now examine the bias and variance control for . We first begin with a general bound for the variance:
Theorem 3 (Variance control).
For any set nonadaptively, we have that
In particular, . Further, if for all :
This theorem suggests that one must set as large as possible to minimize the variance. While this is accurate, one must take into account the bias of and its dependence on the regularization . Indeed, for large regularization, one would expect that , which would not help control the bias. In general, One would hope to set , thereby determining , at a level where its bias is negligible in comparison to the variance. The following theorem formalizes this:
Theorem 4 (Variance dominates MSE).
Recall that the matrix is a function of . Suppose that there exists a deterministic sequence such that under the collection policy:
(7)  
(8) 
Then we have
The conditions of Theorem 4, in particular the bias condition on are quite general. In the following proposition, we verify some sufficient conditions under which the premise of Theorem 4 hold.
Proposition 5.
The following conditions suffice for the requirements of Theorem 4.

The data collection policy satisfies for some sequence and for all :
(9) (10) for a large enough constant . The choice here is .

The matrices commute and with probability at least .
It is useful to consider the intuition for the sufficient conditions given in Proposition 5. By Lemma 2, note that the bias is controlled by , which increases with . Consider a case in which the samples lie in a strict subspace of . In this case, controlling the bias uniformly over is now impossible regardless of the choice of . As an example, in a multiarmed bandit problem, if the policy does not sample a specific arm, there is no information available about the reward distribution of that arm. Intuitively, the data collection policy should ‘explore’ most of the space, which is formalized in Proposition 5. For multiarmed bandits, one can interpret this assumption as a guarantee of ‘minimum exploration’. Indeed, policies such as epsilongreedy and Thompson sampling (Thompson, 1933) satisfy it with appropriate choices of .
Given sufficient exploration, Proposition 5 recommends a reasonable value to set for the regularization parameter. In particular setting to a value such that occurs with high probability suffices to ensure that the decorrelated estimate is approximately unbiased. Correspondingly, the MSE (or equivalently variance) of the decorrelated estimate need not be smaller than that of the original estimate. Indeed the variance scales as , which exceeds with high probability the scaling for the MSE. This is the cost paid for removing most of the bias in the estimate. Before we move to the specific inference results, note that the procedure requires only access to high probability lower bounds on , which intuitively quantifies the exploration of the data collection policy. In comparison with methods such as propensity score weighting or conditional likelihood optimization, this represents rather coarse information about the data collection process. In particular, given access to propensity scores, one can simulate the process to extract appropriate values for the regularization . This is the approach we take in the experiments of Section 4.
Propensity score methods are also ineffective when data collection policies make adaptive decisions that are deterministic given the history. A pertinent example is that of UCB algorithms for bandits, which make historydependent deterministic decisions to pull arms. Finally, we note that the cost of increased variance can be mitigated by the use of information on the data collection policy, as demonstrated in (Nie et al., 2017; Dimakopoulou et al., 2017). It is an interesting open problem to develop a middle ground between these approaches.
2.4 A central limit theorem and confidence intervals
Our final result is a simple central limit theorem that provides an alternative to the stability condition of Theorem 1 and standard martingale central limit theorems. We state it for martingales of the form of , as required, but a form for general martingales also holds true. We make the following crucial moment stability assumption:
Assumption 1.
For , and positive integer
Theorem 6 (Martingale CLT).
Let be a adapted sequence, and be predictable. In addition to the stability assumption 1, suppose that , , that are subgaussian, i.e. almost surely, for a constant and the predictable sequence is bounded almost surely. Then . In particular, for any bounded, continuous function
where is independent of .
The assumptions and are made for simplicity of the proof, which uses the usual Fourieranalytic approach to prove the central limit theorem (Billingsley, 2008). These can likely be relaxed significantly to standard third moment assumptions as in a Lyapunov central limit theorem. Assumption 1 is an alternate form of stability. The essence of this assumption is that it controls the dependence of the conditional covariance of on the first two conditional moments of the martingale increments . In words, it states that conditioning on the conditional covariance does not change the first two moments of the random variables by much. In particular, this holds given a quantitative version of the stability condition of Lai and Wei (1982); Dvoretzky (1972). For instance, if a nonrandom sequence satisfies , then Assumption 1 holds.
With a central limit theorem in hand, one can now assign confidence intervals in the standard fashion, based on the assumption that the bias is negligible. For instance, it is not hard to show the following result on twosided confidence intervals.
3 Related work
There is a long line of work in statistics literature extending the results of Lai and Wei (1982) to a variety of different models and estimators (Wei, 1985; Lai, 1994; Chen et al., 1999; Heyde, 2008). These results are in a similar flavor as Theorem 1 in that they demonstrate consistency for natural estimators such as least squares or (quasi)likelihood optimizers under weak assumptions and appropriate central limit theorems under an additional stability assumption similar to that of Theorem 1. In the same vein, there is much work in statistics and econometrics on nonstationary time series, including testing for unit roots in autoregressive processes and inference on (near) unit parameters (see (Shumway and Stoffer, 2006; Enders, 2008; Dickey and Fuller, 1979; Phillips and Perron, 1988) and references therein). Methods from this line of work are focused on specific time series models and do not extend to the general setup we consider. We instead focus on literature from sequential decisionmaking, offline learning, policy learning and causal inference that more closely resembles our work in terms of goals, techniques and scope of applicability.
The seminal work of Lai and Robbins (Robbins, 1985; Lai and Robbins, 1985) has spurred a vast literature in statistics and computer science on multiarmed bandit problems and sequential experiments that propose allocation algorithms based on confidence bounds (see Bubeck et al. (2012) and references therein). A variety of confidence bounds and corresponding rules have been proposed (Auer, 2002; Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010; AbbasiYadkori et al., 2011; Jamieson et al., 2014) based largely on tools in martingale theory of concentration and law of iterated logarithm. Such results can certainly be used to compute valid confidence intervals. However such bounds are conservative for a few reasons. First, they do not explicitly account for bias in estimates and, correspondingly, must be wider to account for this. Second, obtaining optimal constants in the concentration inequalities can require sophisticated tools even for nonadaptive data (Ledoux, 1996, 2005). This is evidenced in all of our experiments which show that concentration inequalities yield valid, but overly conservative intervals.
A closelyrelated line of work is that of learning from logged data (Li et al., 2011; Dudík et al., 2011; Swaminathan and Joachims, 2015) and policy learning (Athey and Wager, 2017; Kallus, 2017). These focus on efficiently estimating the reward (or value) of a certain test policy using data collected from a different policy. For linear models, this reduces to accurately estimating the prediction error which is directly related to the estimation error on the parameters . While our work shares some features, particularly resembling the techniques of (Kallus, 2017), we focus on unbiased estimation of the parameters and obtaining accurate confidence intervals for linear functions of the parameters. Some of the work on learning from logged data also builds on propensity scores and their estimation (Imbens, 2000; Lunceford and Davidian, 2004), which are wellstudied in econometrics and causal inference. In particular, our techniques also closely resemble those of Athey et al. (2016); Wang and Zubizarreta (2017) which propose balancing covariates or residuals for causal inference in the potential outcomes framework.
As covered in the previous section, Villar et al. (2015) empirically demonstrate the presence of bias for a number of multiarmed bandit algorithms. Recent work by Dimakopoulou et al. (2017) also shows a similar effect in contextual bandits. Along with a result on the sign of the bias, Nie et al. (2017) also propose conditional likelihood optimization methods to estimate parameters of the linear model. Through the lens of selective inference, they also propose methods to randomize the data collection process that simultaneously lower bias and reduce the MSE. Their techniques rely on considerable information about (and control over) the data generating process, in particular the probabilities of choosing a specific action at each point in the data selection. This can be viewed as lying on the opposite end of the spectrum from our work, which attempts to use only the data at hand, along with coarse aggregate information on the exploration inherent in the data generating process.
4 Experiments
In this section we empirically validate the decorrelated estimators introduced in the previous section in two scenarios that involve sequential dependence in covariates. Continuing from Section 2, our first scenario is a simple experiment of multiarmed bandits with gaussian rewards. The second scenario is that of autoregressive time series data. In these cases, we compare the coverage obtained by and typical widths for confidence intervals for individual parameters that we obtain from the decorrelated estimators.
4.1 Multiarmed bandits
The stochastic multiarmed bandit problem is a sequential decision making problem in which, on each time step , a decision maker carries out one of a finite set of actions (formalized as for ) and receives a stochastic reward related to the selected action. The rewards received then inform the subsequent actions taken. Villar et al. (2015) studied this problem in the context of patient allocation in clinical trials, where each datapoint represents a patient in a trial, each action corresponds to a treatment that can be administered, and each represents an outcome following treatment. To demonstrate the utility of our decorrelated estimator in this setting, we reproduce a modified form of their simulation.
Specifically, we sequentially assign one of treatments to each of patients using one of three policies (i) an greedy policy (called ECB or Epsilon Current Belief), a practical UCB strategy based on the law of iterated logarithm (UCB) Jamieson et al. (2014) and (iii) Thompson sampling Thompson (1933). The ECB and TS sampling strategies are Bayesian. They place an independent Gaussian prior (with mean and variance ) on each unknown mean outcome parameter and form an updated posterior belief concerning following each treatment administration and observation . For ECB, the treatment administered to patient is, with probability , the treatment with the largest posterior mean; with probability , a uniformly random treatment is administered instead, to ensure sufficient exploration of all treatments. Note that this strategy satisfies condition 9 with . For TS, at each patient , a sample of the mean treatment effect is drawn from the posterior belief. The treatment assigned to patient is the one maximizing the sampled mean treatment, i.e. . In UCB, the algorithm maintains a score for each arm that is a combination of the mean reward that the arm achieves and the empirical uncertainty of the reward. For each patient , the UCB algorithm chooses the arm maximizing this score, and updates the score according to a fixed rule. For details on the specific implementation, we refer the reader to Jamieson et al. (2014).
In this regime, with just 2 treatments, we can expect a purely random policy to faithfully provide valid confidence intervals using the classical theory. The setup here is chosen to show how the adaptivity in data collection causes the estimators’ behavior to dramatically differ from classical theory predicted by Theorem 1 Lai and Wei (1982).
We repeat this simulation times with . From each trial simulation, we estimate the parameters using both and our decorrelated estimator with which is the percentile of achieved by the policy . This choice is guided by Corollary 4. We compare the quality of decorrelated estimator confidence regions, Gaussian confidence regions (‘OLS_gsn’), and the multiarmed bandit concentration inequality regions (‘OLS_conc’) proposed in (AbbasiYadkori et al., 2011, Sec. 4). Figure 2 (left column) shows that the Gaussian OLS lower tail regions typically overestimate coverage, while upper tail regions underestimate them. This is consistent with the observation that the sample means are biased negatively Nie et al. (2017). The concentration OLS tail bounds are all very conservative, producing nearly 100% coverage, irrespective of the nominal level. Meanwhile, the decorrelated intervals provide faithful empirical coverage for all nominal levels for every scenario except for a few cases in Thompson sampling.
Figure 2 (right column) shows the quantilequantile plots of and estimator errors for each parameter . As in the AR experiment of the next section, the distribution of errors is distinctly nonGaussian with considerable excess kurtosis for every policy. Conversely, for the estimator the excess kurtosis is vastly reduced for every policy. Indeed, it is nearly for ECB and UCB. For Thompson sampling, the excess kurtosis is reduced by a factor of at least 4 compared to the initial values.
Figure 3 shows the mean width computed by the classical , concentration bounds and decorrelation, for both arms. While decorrelation has, on average, wider confidence intervals, they compare favorably with those provided by the concentration inequalities, particularly in the moderate confidence regimes. Moreover, the variation in mean widths is very small, as compared with that of the based methods. This shows that the stability condition assumed in Theorem 1 fails to hold for the standard methods. It would be interesting to show that the stability instead holds for decorrelation under general conditions.
4.2 Autoregressive time series
In this section we according to the classical AR model where
(11) 
Here we consider the case , and demonstrate similar results for in the Supplementary Material. We generate data for model with parameters , and . As shown in Figure 1, the Gaussianity of the OLS estimate is intimately related to the stationarity of the series, in particular how close is to in terms of the number of data points , or the length of the time series. We now evaluate the hypothesis testing procedures in terms of the following metrics: Lower empirical coverage confidence , for various choices of the nominal confidence , Upper empirical coverage probability and typical width of confidence region: .
We plot the coverage confidences for various values of in Figure 4 and the empirical widths (with standard errors on the widths) on the right panel of Figure 4. The QQ plot of the error distributions on the bottom right panel of Figure 4 shows that the errors are skewed downwards, while the errors we obtain are appropriately Gaussian. We obtain the following improvements over the comparison methods of standard errors (labeled ‘OLS’) and concentration inequalities (labeled ‘Conc’) from AbbasiYadkori et al. (2011)

The Gaussian confidence regions consistently underestimate the nominal coverage probability in the lower tail and overestimate the nominal coverage in the upper tail. Meanwhile, the concentration inequalities provide very conservative intervals that cover with nearly 100% probability, irrespective of the nominal level. In contrast, our decorrelated intervals achieve empirical coverage that closely approximates the nominal confidence levels.

In terms of width comparisons, our estimated widths compare favourably to the as they are not much larger and are typically smaller than those obtained via concentration inequalities. Note also that the widths for and the concentration bounds vary significantly over many runs, while our widths concentrate much better.

In Figure 4, we also plot the empirical coverage widths (labeled ‘OLSEmp’) for the distribution as measured directly from the histogram in the left panel. Note that this can only be done as we know specifically the model, but it serves as an oracle bootstrap benchmark. Our confidence widths compare quite favorably to this oracle bootstrap as well.
We also include results of a (nearly) nonstationary AR(2) experiment in Figure 5. All the above conclusions also hold for the AR(2) model.
5 Proofs
5.1 Proofs of Theorems 3 and 4
The proofs of the main results rely on the following simple lemma.
Lemma 8.
Consider the estimate as defined in Algorithm 1. Assume . Then for any ,
(12) 
Proof.
This follows directly from the fact that and the following formula for :
(13) 
which implies:
(14) 
The result follows as is bounded uniformly. ∎
Proof of Theorem 3.
We have:
(15)  
(16) 
where in the second line we use Lemma 8 and sum over the telescoping series in . The result follows. ∎
Proof of Theorem 4.
From Lemma 2 and CauchySchwarz we have that
(17) 
Using Theorem 1, the second term is bounded by . We first show that this term is at most , under the conditions of Theorem 4. First, note that
(18)  
(19)  
(20) 
With this and condition 8, we have that:
(21)  
(22) 
Combining with condition 7 and Theorem 3 gives the result. ∎
We split the proof of Proposition 5 for the different conditions independently in the following lemmas.
Lemma 9.
Proof.
Lemma 10.
If the matrices commute, we have that
(28) 
Proof.
From the closed form in Lemma 8 and induction, we get that:
(29) 
The scalar equality extends to commuting matrices . Applying this to the terms in the product above, which commute by assumption:
(30)  
(31) 
using the fact that . Finally, employing commutativity the fact that is the minimum eigenvalue of , the desired result follows. ∎
We can now prove Proposition 5.
Proof of Proposition 5.
We need to satisfy conditions 7 and 8 for both the cases. Using either Lemma 9 or 10, with the appropriate choice of we have that
(32) 
thus obtaining condition 7. In fact, this can be made polynomially small with a constant factor smaller choice of . Condition 8 only needs to be verified for the case of Lemma 9 or condition 9. It follows from a standard application of the matrix Azuma inequality Tropp (2012), the fact that and the fact that are bounded. ∎
5.2 Proof of Theorem 10: Central limit theorem
It suffices to show the case for of the theorem. In this case, it is not hard to show that the moment stability assumption of the main article subsumes the condition of the following:
Theorem 11.
Let be a martingale difference sequence, adapted to the filtration , with the predictable conditional covariance process . Define and . Suppose, additionally, that satisfies:

For all , are conditionally subgaussian for some constant . almost surely.

For any fixed ,
(33) (34)
Then, for any bounded and continous function we have that,
(35) 
where is independent of .
Proof.
By Levy’s continuity theorem it suffices to consider the subclass of functions for . Note that, since is independent of , a.s. for all and . The proof is mostly standard. Using the boundedness of , note that a.s. Furthermore, for all , by definition. For simplicity, define the following errors that will ultimately be controlled by the conditions of the theorem:
(36)  
(37) 
We will show that