Prediction of dynamical time series using kernel based regression and smooth splines

Prediction of dynamical time series using kernel based regression and smooth splines

Abstract

Prediction of dynamical time series with additive noise using support vector machines or kernel based regression is consistent for certain classes of discrete dynamical systems. Consistency implies that these methods are effective at computing the expected value of a point at a future time given the present coordinates. However, the present coordinates themselves are noisy, and therefore, these methods are not necessarily effective at removing noise. In this article, we consider denoising and prediction as separate problems for flows, as opposed to discrete time dynamical systems, and show that the use of smooth splines is more effective at removing noise. Combination of smooth splines and kernel based regression yields predictors that are more accurate on benchmarks typically by a factor of 2 or more. We prove that kernel based regression in combination with smooth splines converges to the exact predictor for time series extracted from any compact invariant set of any sufficiently smooth flow. As a consequence of convergence, one can find examples where the combination of kernel based regression with smooth splines is superior by even a factor of . The predictors that we analyze and compute operate on delay coordinate data and not the full state vector, which is typically not observable.

Department of Mathematics, University of Michigan (raymundo@umich.edu) (divakar@umich.edu).

1 Introduction

The problem of time series prediction is to use knowledge of a signal for and infer its value at a future time , where is positive and fixed. A time series is not predictable if it is entirely white noise. Any prediction scheme has to make some assumption about how the time series is generated. A common assumption is that the observation is a projection of the state of a dynamical system with noise superposed [8]. Since the state of the dynamical system can be of dimension much higher than , delay coordinates are used to reconstruct the state. Thus, the state at time may be captured as

(1.1)

where is the delay parameter and is the embedding dimension. Delay coordinates are (generically) effective in capturing the state correctly provided , where is the dimension of the underlying dynamics [21].

Farmer and Sidorowich [8] used a linear framework to compute predictors applicable to delay coordinates. It was soon realized that the nonlinear and more general framework of support vector machines would yield better predictors [15, 18, 19]. Detailed computations demonstrating the advantages of kernel based predictors were given by Müller et al [19] and are also discussed in the textbook of Schölkopf and Smola [22]. Kernel methods still appear to be the best, or among the best, for the prediction of stationary time series [16, 20].

A central question in the study of noisy dynamical time series is how well that noise can be removed to recover the underlying dynamics. Lalley, and later Nobel, [12, 14, 13] have examined hyperbolic maps of the form , with . It is assumed that observations are of the form , where is iid noise. They proved that it is impossible to recover from , even if the available data is for and infinitely long, if the noise is normally distributed. However, if the noise satisfies for a suitably small , the underlying signal can be recovered. The recovery algorithm does not assume any knowledge of . The phenomenon of unrecoverability is related to homoclinic points. If the noise does not have compact support, with some nonzero probability, it is impossible to distinguish between homoclinic points.

Lalley [14] suggested that the case of flows could be different from the case of maps. In discrete dynamical systems, there is no notion of smoothness across iteration. In the case of flows, the underlying signal will depend smoothly on time but the noise, which is assumed to be iid at different points in time, will not. Lalley’s algorithm for denoising relies on dynamics and, in particular, on recurrences. In the case of flows, we rely solely on smoothness of the underlying signal for denoising. As predicted by Lalley, the case of flows is different. Denoising based on smoothness of the underlying signal alone can handle normally distributed noise or other noise models. Thus, our algorithms are split into two parts: first the use of smooth splines to denoise, and second the use of kernel based regression to compute the predictor. Only the second part relies on recurrences.

Prediction of discrete dynamics, within the framework of Lalley [12], has been considered by Steinwart and Anghel [24] (also see [3]). Suppose and is the noisy state vector. The risk of a function is defined as

where is the distribution of the noise and is a probability measure invariant under and with compact support. Thus, the risk is a measure of how well the noisy future state vector can be predicted given the noisy current state vector. It is proved that kernel based regression is consistent with respect to this notion of risk for a class of rapidly mixing dynamical systems. Although the notion of risk does not require denoising, consistency of empirical risk minimization is proved for additive noise of compact support as in [12]. In the case of empirical risk minimization, compactness of added noise is not a requirement imposed by the underlying dynamics but is assumed to make it easier to apply universality theorems.

Our results differ in the following ways. We consider flows and not discrete time maps. In addition, we work with delay coordinate embedding [21] and do not require the entire state vector to be observable. Finally, we prove convergence to the exact predictor, which goes beyond consistency. The convergence theorem we prove is not uniform over any class of dynamical systems. However, we do not assume any type of decay in correlations or rapid mixing. Non-uniformity in convergence is an inevitable consequence of proving a theorem that is applicable to any compact invariant set of a generic finite dimensional dynamical system [1, 9, 25]. This point is further discussed in section 2, which presents the main algorithm as well as a statement of the convergence theorem. Section 3 presents a proof of the convergence theorem.

In section 4, we present numerical evidence of the effectiveness of combining spline smoothing and kernel based regression. The algorithm of section 2 is compared to computations reported in [19] and the spline smoothing step is found to improve accuracy of the predictor considerably. The numerical examples bring up two points that go beyond either consistency or convergence. First, we explain heuristically why it is not a good idea to iterate -step predictor -times to predict the state steps ahead. Rather, it is a much better idea to learn the -step predictor directly. Second, we point out that no currently known predictor splits the distance vector between stable and unstable directions, a step which was argued to be essential for an optimal predictor by Viswanath et al [28]. The heuristic explanation for why iterating a -step predictor times is not a good idea relies on the same principle.

The concluding discussion in section 5 points out connections to related lines of current research in parameter inference [16, 17] and optimal consistency estimates for stationary data [11].

2 Prediction algorithm and statement of convergence theorem

Let , where , , define a flow that may be limited to an open subset of with compact closure. Let be the time- map with initial data . It is assumed that , , is a trajectory of the flow whose initial point is . Let be a compactly supported invariant probability measure of the flow-map for and let be its support. It is assumed that the initial point is drawn from the measure . For , the trajectory exists for all and is unique. In addition, the flow is assumed to be ergodic with respect to the measure .

Let be a generic nonlinear projection. Let be the projection of the random trajectory . By the embedding theorem of Sauer et al [21], we assume that the delay coordinates give a diffeomorphism into the state space implying that can be recovered from the delay vector, with delay ,

for . This delay vector is denoted by .

As a consequence of the embedding, there is a measure compactly supported in that corresponds to . The measure is ergodic and invariant under the flow lifted via the embedding. Denote the compact support of by . For every point in , there corresponds a unique point in and vice versa. Because the prediction algorithm is based on delay coordinates and not the state vector, it is more convenient to work in the embedding space and in terms of and . Therefore, we will rely on the bijective correspondence between and and use the notation instead of and instead of . With these conventions, can be thought of as the path in with . Similarly, can be thought of as a real-valued signal with , where is the first component of . In later arguments, the assumption that is -distributed will be significant, and so will be the ergodicity of the flow with respect to .

Given the signal , it is assumed that the recorded observations are , where is iid noise. Following Eggermont and LaRiccia [5, 6], we assume that and for some . To avoid inessential technicalities it is assumed that so that the delay is an integral multiple of the time step . In particular, we set . Similarly, we assume , , where is the look-ahead into the future. The noisy delay coordinates are assumed to be available for , which implies that the observation interval of is .

The exact predictor is a function such that for . Lemma 3 proves uniqueness and existence of the exact predictor . The exact predictor corresponds to a fixed , but that dependence is not shown in the notation. The problem as considered by Müller et al [19] is to recover the exact predictor from the noisy observations . Let denote Vapnik’s -loss function. The algorithm of Müller et al computes such that the functional

(2.1)

is minimized for in the reproducing kernel Hilbert space corresponding to the kernel . The kernel is assumed to be given by . The kernel bandwidth parameter and the Lagrange multiplier are both determined using cross-validation. This method approximates the exact predictor for . If , , the approximation is iterated times. We will compare our predictor against that of Müller et al using some of the same examples and the same framework as they do in section 4.

In our algorithm, the first step is to apply spline smoothing. In particular, we apply cubic spline smoothing [4] to compute a function , such that the functional

(2.2)

is minimum for over , where denotes the Sobolev space of twice-differentiable functions with the norm . The parameter is determined using five-fold cross-validation. The minimizer depends upon the noise-free signal as well as the instantiation of the iid noise in for . However, the dependence on the iid noise is not shown in the notation.

The second step of our algorithm is similar to the method of Müller et al. The predictor is computed as

(2.3)

Both the parameters and are determined using five-fold cross-validation. Here and therefore are fixed because we seek to approximate the exact predictor with lookahead fixed at . As explained in section 4, it is significant that the predictor directly optimizes with a lookahead of . Iterating a -step predictor times gives worse predictions.

The second step (2.3) differs from the algorithm of Müller et al in using the spline smoothed signal in place of the noisy signal . Our algorithm relies mainly on spline smoothing to eliminate noise. Yet another difference is that we use the least squares loss function in place of the -loss function. This difference is a consequence of relying on spline smoothing to eliminate noise. As explained by Christmann and Steinwart [3], the -loss function, Huber’s loss, and the loss function are used to handle outliers. However, spline smoothing eliminates outliers, and we choose the loss function because of its algorithmic advantages.

We now turn to a discussion of the convergence of the predictor to the exact predictor . The first step is to assess the accuracy of spline smoothing. We quote the following lemma, which is a convenient restatement of a result of Eggermont and LaRiccia [5, 6] (see pages 132 and 133 of [6]). In the lemma, denotes the Sobolev space of -times differentiable functions with norm .

Lemma 1.

Assume . Suppose that is a signal defined for with . For , let , where and where are iid random variables. It is further assumed that , , and for some . Let be the spline that minimizes the functional

over . Assume

Let be the probability that

where the -norm is over the interval . Then .

Some remarks about the connection of this lemma to the algorithm given by (2.2) and (2.3) follow. First, the lemma assumes a fixed choice of (the relevant theorem in [5, 6] in fact allows to lie in an interval). In our algorithm, is determined using cross-validation because of its practical effectiveness [29].

Second, the probability (which may be interpreted as the probability that spline smoothing fails to denoise effectively) depends on and therefore on the particular trajectory. If depends on only though a bound on the -th derivative of , , the bound would be uniform for all trajectories on the compact invariant set . The achievability part of Stone’s optimality result [26] gives such a bound but the algorithm in that proof does not appear practical. Proving a similar result for smooth splines based on the existing literature does not appear entirely straightforward. In the norm, some uniform bounds have been proved for smooth splines by Györfi et al [10]. A bound on the norm can be combined with a bound on the the -th derivative using a Sobolev inequality to obtain an -norm bound. Although the rate of convergence would be slightly sub-optimal, it would suffice for our purposes. However, the result of Györfi et al is for expectations and not for convergence in probability, and an argument using Chebyshev’s inequality does not give strong bounds.

The convergence analysis of the second half of the algorithm also alters the algorithm slightly. In particular, the use of cross-validation to choose parameters is not a part of the analysis. To state the convergence theorem, we first fix . By the universality theorem of Steinwart [23], we may choose such that in a compact domain that has a non-empty interior and contains the invariant set . The convergence theorem also makes the technical assumption , which may always be satisfied by taking small enough.

The choice of the kernel-width parameter is important in practice. In the convergence proof, the choice of is not explicitly considered. However, still plays a role because depends upon .

The parameter in (2.3) is fixed as for the proof. Next we pick and such that the covering of the invariant set using boxes of dimension ensures that the variation of (as well as that of the exact predictor and , which is defined later) within each box is bounded by .

Suppose are boxes of dimension that cover in the manner hinted above. We next choose such that the measure of the trajectories (with respect to the ergodic measure ) that sample each one of the boxes adequately (in a sense that will be explained) is greater than if the time interval of the trajectory exceeds .

The parameter is a bound on the infinite norm accuracy of the smooth spline as in Lemma 1. Choose small enough that

where is a constant specified later. The main purpose of increasing is to make spline smoothing accurate. However, the following condition requiring to be large enough is assumed in the proof:

Within this set-up, we have the following convergence theorem.

Theorem 2.

For , , , and , chosen as above, we have

when is constructed (or learnt) from the signal , , for of -measure greater than and with probability (probability of successful denoising in the spline-smoothing step) tending to in the limit .

Nonuniform bounds implying a form of weak consistency are considered by Steinwart, Hush, and Scovel [25]. However, the algorithm of (2.2) and (2.3) does not fit into the framework of [25]. The application of spline smoothing to produce means that may not be stationary, and our method of analysis does not rely on verifying a weak law of large numbers as in [25]. The analysis summarized above and given in detail in the following section relies on -norm bounds.

3 Proof of convergence

We begin the proof with a more complete account of how the embedding theorem is applied. Let , where , , be a flow. Let be the time- map with initial data . Let be an open set with compact closure. If , it is assumed that is well-defined for , where is the embedding dimension.

Assumption: For embedding dimension and a suitably chosen delay , the map

is a diffeomorphism between and its image in . This assumption is generically true [21]. This map is called the delay embedding. Denote the image of under the delay embedding by .

The invariant measures and as well as , , , , , and are as defined earlier. It is assumed that , which implies .

Lemma 3.

Suppose for , . Denote the delay vector

by so that . There exists a unique and well-defined function called the exact predictor, such that

for all . In particular, for all and all .

Proof.

To map to , first invert the delay map to obtain the point in , advance that point by by applying , and finally project using . Each of the three maps in this composition is or better. The predictor must be unique because is uniquely determined by the flow.∎

Remark.

The embedding theory of Sauer et al [21] may be applied to the compact invariant set without enclosing it in the open set . Indeed, if the box counting dimension of is , the embedding dimension need only satisfy and . That can be advantageous because we may have much smaller than . However, there are two difficulties if is a fractal set. First, tangent spaces cannot be defined and we cannot assert the delay map to be a diffeomorphism although it will be one-one generically. Second, we will need to extend to the closure of an open neighborhood of in to apply the universality theorem, and such an extension cannot be made from if is a fractal set. Both these difficulties go away if we take to be a submanifold that contains . If is the dimension of , we would only require . For simplicity, we have assumed to be an open set.

The following convexity lemma is an elementary result of convex analysis [7]. It is stated and proved for completeness.

Lemma 4.

Let and be convex and continuous in , where and is a Hilbert space. If , the subgradient at , assume that

for , all , and . Let and . Suppose that

for , and assume that and . Then,

Proof.

Because minimizes , we have . Thus,

Similarly, . By adding the two inequalities, we have

proving the lemma. This last step relies on and the assumption . ∎

If , , is the noise-free signal, our arguments are phrased under the assumption that . This assumption is realized with probability , which tends to as increases (by Lemma 1). For convenience, we denote by . The probability that is successfully denoised by smooth splines so that is then .

In general, a function defined on an embedded submanifold can be extended to an open neighborhood of the submanifold using a partition of unity. Because is an embedded submanifold, , and the exact predictor is defined on , it follows that there exists such that can be extended to , where

We will always assume so that the spline-smoothed signal maps to under delay embedding with probability greater than . Without loss of generality, we assume . The convergence proof will assess the approximation to with respect to the measure . Therefore, the manner in which the extension is carried out is not highly relevant. The sole purpose of the extension is to facilitate an application of the universality theorem for Gaussian kernels.

Let

(3.1)

Thus, is a bound on the size of the embedded invariant set with ample allowance for error in spline smoothing.

Let denote the spline-smoothed signal and the noise-free signal with . Define

where , , and is any smooth and positive kernel defined over . The kernel will be specialized to the Gaussian kernel when applying the universality theorem. Define

using the noise-free signal . Let and define

For , all three functionals are strictly convex and have a unique minimizer. The unique minimizers of , , and are denoted by , , and , respectively. The functional is the same as in (2.3), the second step of the algorithm. Thus, is the computed approximation to the exact predictor .

The following lemma bounds the minimizers of , , in norm by .

Lemma 5.

The minimizer satisfies with probability greater than . The minimizers and satisfy and .

Proof.

Because minimizes , we must have . We have with probability greater than . Thus, and the stated bound for follows. The bounds for and are proved similarly. ∎

Lemma 6.

Assume and for . For with , we have . Here depends only on and the kernel . The kernel is assumed to be .

Proof.

First, we note that and , where is the directional derivative of in any direction. By a result of Zhou (part (c) of Theorem 1 of [30]), we may take and , where is the embedding dimension and the -norm is over . If we define using

(3.2)

it follows that both and (where is a directional derivative in any direction) are bounded above by .

We may write

(3.3)

Here is used an upper bound on . The bound of on is justified by the previous paragraph. The same bound on and follows from and .

Now, implies that

by the bound on . By replacing with if necessary, we have

The proof is completed by utilizing these bounds in (3.3) and defining as .∎

Lemma 7.

Assume . With probability greater than , .

Proof.

Follows from Lemmas 5, 6, and 4. Lemma 4 is applied with , , and . The choice of is justified by Lemma 5 and the choice of is justified by Lemma 6. To justify the choice of , note that and can both be written as with a convex functional. The identity shows that is the unique subgradient at for . Thus, if (the subgradient of is unique and may be obtained explicitly), we must have , justifying the choice of .∎

Lemma 8.

Assume . For and , we have

Proof.

We will argue as in Lemma 6 and assume that , and are bounded by .

Suppose . In the difference

we may apply the mean value theorem to the integral and argue as in Lemma 6 to upper bound the difference by . The proof is completed by summing the differences from to and dividing by . ∎

Lemma 9.

Assume . Then .

Proof.

Follows from Lemmas 5, 8, and 4. Lemma 4 is applied with , , and . The choices of , , and are justified using Lemmas 5 and 8 and an additional argument as in the proof of Lemma 7. ∎

Choose . At this point, we specialize to a kernel for which the universality theorem of Steinwart applies. For example, . We may then find such that , where the -norm is over . In fact, we will need the difference to be bounded by only for . The larger compact space is needed to apply the universality theorem and for other RKHS arguments.

Lemma 10.

Let . If minimizes , we have

In addition,

Proof.

We have

because is the minimizer, and

This last inequality uses . The proof of the first part of the lemma is completed by combining the inequalities. To prove the second part, we argue similarly after noting . ∎

Consider half-open boxes in of the form

with and . The whole of is a disjoint union of such boxes. Because is compact, we can assume that , where the union is disjoint, each is a half-open box of the form above, and for .

We will pick to be so large, that each box has a diameter that is bounded as follows:

Here is determined later, and is the -norm in the function space . Lemma 10 tells us that