Bootstrapping Generalization Error Bounds for Time Series
We consider the problem of finding confidence intervals for the risk of forecasting the future of a stationary, ergodic stochastic process, using a model estimated from the past of the process. We show that a bootstrap procedure provides valid confidence intervals for the risk, when the data source is sufficiently mixing, and the loss function and the estimator are suitably smooth. Autoregressive (AR()) models estimated by least squares obey the necessary regularity conditions, even when mis-specified, and simulations show that the finite-sample coverage of our bounds quickly converges to the theoretical, asymptotic level. As an intermediate step, we derive sufficient conditions for asymptotic independence between empirical distribution functions formed by splitting a realization of a stochastic process, of independent interest.
t1Supported by grants from the Institute for New Economic Thinking (IN01100005, INO1400020) and from the National Science Foundation (DMS1207759, DMS1418124).
class=MSC] \kwd[Primary ]60K35 \kwd60K35 \kwd[; secondary ]60K35
dependent data \kwdstatistical learning \kwdweak convergence \kwdbootstrap
Suppose we have observed data points from a stationary stochastic process , and want to predict using a model estimated from the observations. As in any other statistical prediction problem, we want to be able to evaluate how well we will forecast, and in particular make an estimate of the expected performance, or risk, of the model, i.e., estimate the expected loss for some suitable loss function. While several different notions of risk have been proposed for time series [40, 27], we follow Shalizi and Kontorovich , defining risk as the long-term average of instantaneous expected losses, conditioned on the observations:
where it is understood that is estimated using only , but that when predicting , its inputs may come from any time up to .111Thus, for example, if we used an AR(1) model, with parameter , the estimator would be a function of alone, but the prediction at time would be . Shalizi and Kontorovich  show that this is well-defined for ergodic sources. Accepting this notion of risk, how might we estimate it? Suppose takes as input some fixed vector and let . Clearly, the in-sample performance of on the training data :
will generally be overly optimistic an estimate of out-of-sample performance, precisely because the model has been adjusted to fit that specific time series (and not another). We thus wish to know the “generalization error” of the model . It is particularly helpful to control the probability that exceeds any given value, i.e., to probabilistically bound how much worse than its in-sample performance might really be.
In this paper, we give asymptotically-valid generalization error bounds for time-series forecasting. To do this, it is clearly sufficient to have a consistent estimator for the quantile of the generalization error , since that lets us give a probabilistic guarantee for worst-case out-of-sample performance of our model:
Following an idea proposed in McDonald [32, sec. 8.3], we achieve this by using the block bootstrap  to directly simulate estimating a model from a time series, and then evaluating the fitted model’s ability to forecast the continuation of the same realization of the process. (See Section 2.3, Algorithm 1 for a precise statement of the algorithm.) Therefore, the main theoretical question that we consider is bootstrap consistency of our estimator of the generalization error . By this, we mean showing that the centered, normalized bootstrap process,
converges in distribution, conditional on the data, to the same distribution as the true sampling distribution,
where denotes expectation with respect to the bootstrap measure, and the precise definition of is postponed to Section 3.1.
Our results focus on models whose estimators are expressible as plug-in estimators of some -dimensional marginal distribution of the stochastic process . In this framework, we can also view the generalization error itself as a functional of distribution functions, enabling the use of weak convergence results from empirical process theory. In contrast to the constraints on the model and its estimator, essentially the only limit we put on the data source is that it has to be mixing (i.e., all correlations must decay) at a polynomial rate. In particular, we do not assume that the model is in any way well-specified. An important part of our proof is establishing sufficient conditions for the asymptotic independence of empirical processes formed by splitting one realization of a stochastic process. This was implications beyond the current problem, e.g., it allows us to show the asymptotic Gaussianity and consistency of -fold cross-validation for the risk with dependent data (Appendix B).
We show (Section 3.3) that our conditions on the model hold for linear autoregressive (AR()) models estimated by least squares — again, whether or not the actual data-generating process is a linear autoregression. Simulation studies (Section 4) indicate that convergence to the asymptotic coverage levels is quite rapid, even when the model is highly mis-specified.
1.1 Related work
If the data source were not just stationary but independent and identically distributed (IID), the definition of would reduce to the usual , and we would be on familiar ground. If we want a point estimate of the risk, we might adjust the in-sample performance through the analytical approximations of the various information criteria , or we might turn to computational procedures which attempt to directly simulate extrapolating to new observations from the same distribution, such as cross-validation  or the bootstrap 222Bunke and Droge  compares the properties of cross-validation and a bootstrapped generalization error, but in a theoretical study involving IID linear regression with Gaussian errors. Shao  studies the theoretical properties of the bootstrapped generalization error for model selection consistency in the linear regression, generalized linear regression, and autoregressive model regimes. He considers a residual resampling scheme and shows that this bootstrap is not consistent for model selection, but an out of variant of the bootstrap is consistent.. If we want an interval estimate, there are again abundant analytical results from statistical learning theory, based on notions of Vapnik-Chervonenkis dimension, Rademacher complexity, etc. [53, 38]. Many of these approaches, however, require at the very least extensive re-thinking when applied to time series.
Thus, for example, while information criteria have been extensively studied for selecting the order of time series models (mostly AR and ARMA models) since at least the work of Shibata , Hannan , it is well-known that their penalty terms they add to the in-sample risk are only accurate under correct model specification . Similarly, while forms of cross-validation for time series exist [10, 21, 44, 45], we are unaware of work which studies their properties under model mis-specification. Both information criteria and cross-validation moreover only provide point estimates of the risk, rather than probabilistic bounds.
Extensive work has been done on extending the statistical-learning approach to deriving such “probably approximately correct” bounds to time series [35, 24, 37, 36, 27, 34], but this strand of work has, from our point of view, two drawbacks. The first is that the bounds derived hold uniformly over very wide ranges of processes. This is obviously advantageous when one knows little about the data source, but means that the bounds are often extremely conservative. The second drawback is that bounds often involve difficult-to-calculate quantities, especially the beta-mixing coefficients which quantify the decay of correlations in the stochastic process (see Definition 3.1). These are typically unknown, though not inestimable . In contrast, our bounds, while only asymptotically valid, are distribution-dependent and fully calculable, though we do need to make a (weak) assumption about the mixing coefficients.
2 Proposed Estimator
In this section, we describe our algorithm in greater detail. We begin by discussing the resampling procedure, which will be used to generate both the training and the test sets. The circular block bootstrap variant that we will use can be found in Lahiri .
2.1 The Circular Block Bootstrap (CBB)
Suppose the functional of interest is a function of a -dimensional distribution function. For each observation , generate a âchunkâ given by . If an index exceeds , change the value of the index to . By wrapping the data around a circle, we ensure that each point is equally likely to be resampled.333Had we ignored chunks with indices exceeding , we would have had a moving-blocks bootstrap, which has the undesirable property that the bootstrap expected value no longer equals the sample mean.
To capture the dependence structure in the data, we will need to resample contiguous blocks of the chunks, given by , where is the block-length. Suppose the desired resample size is . Then we will resample blocks, where the blocks are uniformly chosen from . Let denote the indices corresponding to the selected blocks. The result of the bootstrap procedure is the vector given by . If the length exceeds , truncate entries at the end of the vector as needed. We can expand each to form the data matrix given by:
Note that it is also possible to perform a CBB procedure directly on to generate a bootstrapped series , and then construct the bootstrapped data matrix. This approach, however, suffers from the fact that some rows in bootstrapped data matrix contain observations from two different blocks, leading to worse finite sample performance.
2.2 Computing the Generalization Error Bound
Our proposed procedure is described in Algorithm 1, which follows McDonald [32, sec. 8.3]. Intuitively, we use the CBB to generate a training and test set. We then compute an estimator of the generalization error, and take the quantile of this estimator across all bootstrap iterations and use this to a construct confidence interval for the risk.
2.3 Choosing the Block Length
The theoretical results in the next section are compatible with a range of block-length sequences, and so allow for data-driven block-length selection. To our knowledge, there are no procedures in the literature that provide a consistent estimate of the optimal block length for quantile estimation of a general Hadamard-differentiable function. (The methods of Hall, Horowitz and Jing  and Lahiri, Furukawa and Lee  only apply to functionals in the “smooth function of the mean” framework.) For convenience, we have used the procedure of Politis and White , which is optimized for the sample mean but performs adequately in our simulation study444In additional simulations, not shown here, we used block-lengths of the form for a large grid of values of . The procedure of Politis and White  was at least comparable, in terms of coverage, to the best-in-retrospect ..
3 Main Results
In this section, we will prove the main theorem, which establishes sufficient conditions for bootstrap consistency of an estimator of the generalization error to hold. Let denote the test error. We will consider the case where . The sampling distribution that we would like to approximate with our bootstrap estimator is that of .
To derive limiting distributions, we will need assumptions about the dependence structure. We will assume that the process has a -mixing coefficient that decays at least at a cubic rate. While there are several equivalent definitions of the -mixing coefficient, below we state the version we find to be most intuitive for strictly stationary processes.
Definition 3.1 (-mixing coefficient for stationary processes).
Let be a stationary stochastic process and be the probability space induced by the Kolmogorov extension. The coefficient of absolute regularity, or -mixing coefficient is given by:
where is the total variation norm, is the joint distribution of the blocks and and is the product measure between the two blocks. We say that a process is -mixing if .
From the definition, we can see that implies asymptotic independence as the gap between the past and future of the process grows.
We will also need to impose a differentiability condition on the estimated parameters , which take as input a -dimensional distribution function. Since the input is an infinite-dimensional object, we will need a generalized notion of derivative. The notion we use is tangential Hadamard differentiability, which we define below:
Definition 3.2 (Tangential Hadamard derivative).
A map is Hadamard differentiable at , tangential to a set if there exists a continuous linear map such that:
as for all converging sequences and .
Finally, let denote the space of bounded functions mapping from to . If the second argument is omitted, take to be . Now we are ready to state the main theorem.
Let be observations from a strictly stationary -mixing process supported on for some . Suppose our loss function can be expressed as , where and , where is a subset of a normed space. Assume that and are generated by the CBB procedure described in Section 2 with the block size . Further, let , where and . Also assume the following conditions:
is appropriately tangentially Hadamard differentiable with respect to at and is differentiable with respect to with bounded mixed partial derivatives up to the th order on
is appropriately tangentially Hadamard differentiable at
Then, the bootstrap is consistent for .
3.2 Proof of Main Theorem
We will show bootstrap consistency using the functional delta method for the bootstrap. We will define additional notions needed to state this theorem below. The definitions we adopt here can be found in Kosorok .
Bootstrap consistency is known to follow from a certain conditional weak convergence, which we will define below. For ease of exposition, we do not address measurability issues in our definitions; again, the reader is referred to van der Vaart and Wellner  or Kosorok  for details.
We say that converges weakly in probability conditional on the data, or , if where is the space of bounded functions with Lipschitz norm , and is conditional expectation over the weights given the data.
Proposition 3.5 (Functional Delta Method for the Bootstrap, Kosorok , Theorem 12.1).
For normed spaces and , let be a Hadamard differentiable map at , tangential to , with derivative . Let and have values in , with , where is tight and takes values in for some sequence of constants , the maps are measurable for every almost surely, and where for a constant . Then .
As discussed in Lahiri , we will construct chunks of size such that and consider functionals of the distribution function of . Let . In addition, let denote the empirical measure for the training data and denote the empirical measure for the test data, where . Similarly, let and denote the bootstrap measure for the training set and the test set, respectively, where each measure is conditioned on points.
For , let the operator be defined element-wise: that is, iff for all . If we let be the function class given by:
It follows that the empirical distribution function for the training set and test set can be expressed point-wise as and , respectively. Analogous expressions hold for bootstrap measures. To represent the entire distribution function, we can view and as elements of and and as elements of , where is the probability space associated with the bootstrap weights. For a fixed sample path, we may view these mappings as belonging to .
Our first step is to translate and into plug-in estimators. This is done in the following lemma:
Let , and define the mappings:
Then, and are plug-in estimators for and , respectively. Furthermore, suppose that is a stationary -mixing process satisfying and is a function of and such that . Then for each :
Proof of Lemma
The fact that the training and test errors are plug-in estimators of functionals given in (10) and (11) follows immediately by inspection. To see (13), consider the measurable space , where superscripts denote product spaces, and let be a sequence of probability measures corresponding to the distribution of . By the definition of , it follows that:
where is the product measure between and . Now we will use a result from Schervish and Seidenfeld , stated in a less general form for current purposes:
Lemma 3.7 (Schervish and Seidenfeld ).
Let and be probability distributions on where is allowed to be . Define the space of histories , where and and let be a history. Suppose and permit regular conditional probabilities, denoted and , respectively. Then,
Applying this lemma with the assumed mixing conditions yields:
where is the distribution of the process and 555The fact that such an exists can be shown using the Open Mapping Theorem in functional analysis. See for example, Driver [14, Problem 25.23]., which implies that , by Borel-Cantelli Lemma. Now, we will use the fact that for all bounded , implies that:
Applying this result to ,which we assumed to be bounded, we conclude that , which implies that the Cesaro mean also converges to the desired limit.
Going back to at least Akaike , a proposal for risk in the time series setting is , where and are observations from different runs of the process (and are therefore independent). While easier to work with than many proposals for the conditional risk, it suffers from the fact that we are often interested in how our model will perform on future data from the same realization of the process. Our result here provides sufficient conditions for which this convenient notion of risk is equivalent to that of Shalizi and Kontorovich .
Summability of -mixing coefficients ensures convergence of the shifted risk to the desired limit, which is stronger than what would be minimally needed to show Cesaro convergence. One could make a similar argument based on a uniform bound on the decay of correlations in Cesaro-mean, but -mixing is used in the main theorem, so we will not introduce nonstandard notions of dependence here.
Now define , which represent centered bivariate empirical processes related to the empirical distribution functions and bootstrapped empirical distribution function, respectively:
To apply the functional delta method, we need to show that and converge to the same Gaussian Process, up to a multiplicative constant on the covariance function. We will derive the limiting distribution of in the lemma below.
Proof of Lemma
We will prove joint convergence using the following result from the literature:
Proposition 3.9 (van der Vaart and Wellner , 1.5.3).
Let and be asymptotically tight nets such that
for stochastic processes and . Then there exists versions of and with bounded sample paths and in .
That is, to show weak convergence of a vector-valued stochastic process, it suffices to show (a) marginal asymptotic tightness and (b) finite-dimensional convergence of arbitrary joint distributions. We will set , where is the function class corresponding to indicator functions defined in (9). To show (a), it is sufficient to prove that there exists a semimetric for which is totally bounded and
Generally, proving this condition is nontrivial. However, it is known to hold for the process under some additional conditions:
Let be a stationary sequence in a Polish space with marginal distribution P, and let be a class of functions in . Let . Suppose there exists a such that:
is permissible, VC and has envelope F satisfying .
Then in , where is a mean 0 Gaussian Process with covariance structure given in (21).
We can apply the result to each process marginally to show that (a) holds. See Appendix A.2 for additional details.
To show (b), we will use the Cramer-Wold device. We will need a central limit theorem for nonstationary triangular arrays satisfying some mixing condition to show univariate convergence. To this end, we will state a modification of a theorem of Ekstrom 666This paper uses the notion of weakly-approaching random variables , which does not require the existence of a limiting distribution; this is more general than what we need here., itself a modification of Politis, Romano and Wolf , which requires a slightly different notion of mixing:
Definition 3.11 (-mixing coefficient).
It is a well-known result in mixing theory that [8, Eq. 1.1].
Proposition 3.12 (Ekstrom , Theorem 1).
Let be a triangular array of random variables, with being the -mixing coefficients for the row, and being the sample mean of that row. Assume the following. For some ,
for some and all
Then, if exists,
Let , be a set of arbitrary elements of for any and let be arbitrary elements of . After applying Cramer-Wold, we are left with the following sample mean:
We will view as elements of the triangular array. B1 follows for our class . Although the process is now nonstationary, it still satisfies B2 since the mixing coefficient of is bounded by . Since we can take to be arbitrarily large and , (b) follows.
Now we will need to check that the limiting variance matches that of limiting Gaussian process defined in (21). Simple calculation reveals that:
The desired result will follow if we can show that the last two terms on the RHS goes to 0. We find conditions for this to be true below.
Suppose is a stationary stochastic process that is polynomially -mixing and let be the linear hull of . That is, is the function class corresponding to finite linear combinations of , given by:
where is the function class given in (9).Then, for any :
First, recall Davydov ’s inequality, which states that:
where . For any , . We can therefore set for an arbitrarily small . Letting and applying the above inequality, we see that:
Since we have shown both (a) and (b), we have the joint convergence of the empirical process defined in (18) to a tight element in . Furthermore, since each of the finite dimensional distributions are Gaussian, the limiting process is Gaussian. Asymptotic independence of the marginal Gaussian processes follows from asymptotic independence of finite dimensional distributions; see Kallenberg [23, Lemma 11.1].
When the stochastic process is compactly supported, this result holds for an arbitrary permissible VC class. In a more general setting, there is a trade-off between assumptions on moments and mixing rates.
For -dimensional distribution functions, this lemma holds under -mixing due to a result of Rio [47, Theorem 7.3].
Now, finite VC dimension, along with the assumed mixing conditions, allows us to invoke the following theorem due to Radulović :
Let be a stationary sequence taking values in for some with marginal distribution , and let be a class of functions in . Let Also assume that are generated by the CBB procedure with block size as , and that there exists , , and such that
is permissible, VC, and has envelope satisfying , and
Then in , where is a mean 0 Gaussian Process with covariance structure given in (21).
Therefore, it follows that:
See Appendix A.1 for details.
Since is permissible, by Yu  the measurability criteria are satisfied.
Finally, the functional delta method requires tangential Hadamard differentiability of at our limit point (F,F). Note that can be expressed as the following composition of mappings:
Due to the Chain Rule, which we will state below, it will suffice to establish tangential Hadamard differentiability for each of the intermediate mappings.
Proposition 3.15 (Chain Rule, van der Vaart and Wellner , Lemma 3.9.3).
If is Hadamard differentiable at tangentially to and is Hadamard differentiable at tangentially to , then is Hadamard differentiable at tangentially to with derivative .
Proposition 3.16 (Integration, van der Vaart and Wellner , Lemma 3.9.17).
Given a cadlag function and a function of bounded variation on an interval , define
Then, is Hadamard differentiable at each such that . The derivative is given by:
Similar reasoning holds for and Hadamard differentiability of the difference holds due to the fact that the derivative is a linear operator. Then by the Functional Delta Method for the bootstrap, the bootstrap is consistent for .
Adding an estimate of the quantile of the generalization error to the training error is not the only way to construct confidence intervals. In particular, one could also bootstrap the test error.
3.3 Application of main theorem to models under weak assumptions
The motivating example for this theorem is the setting in which we use an model to predict a stationary -mixing process. We will state a corollary of our theorem that establishes bootstrap consistency of our procedure in this setting under squared error loss.
We will start by establishing C3, appropriate tangential Hadamard differentiability of at . For notational convenience, let , where and . We will establish the required tangential Hadamard differentiability by showing that:
The expression on the LHS of (34) may be rewritten as:
For the first term in (36), there exists such that for all :
Since , this term goes to . Since and , the second term also converges to 0, and Hadamard differentiability at follows.
The only condition that remains to be checked is C4, the tangential Hadamard differentiability of the least squares estimator.
We will start by introducing some notation. Given a subset of a Banach space and another Banach space , let be the Banach space of uniformly bounded functions . Let be the subset consisting of all maps with at least one zero. Let be a map that assigns one of its zeros to each element . In the squared error case, represents the domain of the coefficients and represents the codomain of the estimating equations. Further let lin denote the linear closure of . Hadamard differentiability is established by the following:
Proposition 3.18 (Z-Estimators, van der Vaart and Wellner , Lemma 3.9.34).
Assume is uniformly norm-bounded, is one-to-one, possesses a and has an inverse (with domain ) that is continuous at 0. Let be Frechet differentiable at with derivative , which is one-to-one and continuously invertible on lin . Then the map is Hadamard differentiable at tangentially to the set of that are continuous at . The derivative is given by .
This result establishes Hadamard differentiability of the least squares estimator as a function of . We can view as a Hadamard differentiable mapping of distribution functions to complete the proof. We will show this last step in the proposition below:
Let , be the least squares estimator with parameters, where is a subset in which integration is well-defined. Let the th coordinate of be given by:
Then, is Hadamard differentiable at a distribution function , tangentially to 777Since has bounded variation in the Hardy-Krause sense, this tangent set is sufficient for well-definition of the integral by an integration by parts argument. For details, see Remark 6., with derivative of the th coordinate given by: