Prediction in Projection: A new paradigm in delay-coordinate reconstruction

# Prediction in Projection: A new paradigm in delay-coordinate reconstruction

J.
###### Abstract
\OnePageChapter{singlespace}

Delay-coordinate embedding is a powerful, time-tested mathematical framework for reconstructing the dynamics of a system from a series of scalar observations. Most of the associated theory and heuristics are overly stringent for real-world data, however, and real-time use is out of the question due to the expert human intuition needed to use these heuristics correctly. The approach outlined in this thesis represents a paradigm shift away from that traditional approach. I argue that perfect reconstructions are not only unnecessary for the purposes of delay-coordinate based forecasting, but that they can often be less effective than reduced-order versions of those same models. I demonstrate this using a range of low- and high-dimensional dynamical systems, showing that forecast models that employ imperfect reconstructions of the dynamics—i.e., models that are not necessarily true embeddings—can produce surprisingly accurate predictions of the future state of these systems. I develop a theoretical framework for understanding why this is so. This framework, which combines information theory and computational topology, also allows one to quantify the amount of predictive structure in a given time series, and even to choose which forecast method will be the most effective for those data.

Garland \otherdegreesB.S., Colorado Mesa University, 2009

This dissertation—and my entire academic journey—would have been abandoned multiple times, and likely never would have begun, if it were not for your endless love and support.

###### Acknowledgements.
\OnePageChapter{singlespace} Liz, I have been told that as you look back on your life there will be one or maybe two people that radically alter your entire life course. I have no doubt you are that person in my life. You personally introduced me to complexity science and reinvigorating me as a scientist at a time in my life I was ready to leave academics for good. You have not only shaped me as an academic but as a better human being. You opened more doors than I can count and did everything in your power to give me every possible opportunity to succeed. You are the epitome of advisor, mentor, colleague, champion and friend.

My Family, without your love and support I am nothing.

Jim Meiss, for all the trident visits and willingness to let me bounce ideas off of you over the years, from the beginning to the end of my dissertation. All our conversations about computational topology were invaluable in providing my thesis with a sound theoretical foundation.

Ryan James, for introducing me to, and patiently teaching me information theory.

Jim Crutchfield, for numerous conversations, at the Santa Fe Institute and at UC Davis, throughout the research process. Your guidance in information theory was invaluable in finalizing the story of my thesis.

Holger Kantz, for hosting me at Max-Planck-Institut Für Physik Komplexer Systemer, where I had multiple interactions with you and your students that drastically matured the overall picture of my thesis.

Bob Easton, for all the Trident seminars.

Aaron Clauset, Sriram Sankaranarayanan, Mark Muldoon, Simon DeDeo, and Zach Alexander, for interesting and useful conversations throughout this process.

Santa Fe Institute, my academic home and sanctuary every summer. Your interdisciplinary halls recharged me every year, gave me a place to forge interesting collaborations with people (and in fields) I never would have imagined and took my career to heights I could have never dreamed. \emptyLoF\emptyLoT {singlespace}

## Chapter \thechapter Overview and Motivation

Complicated nonlinear dynamics are ubiquitous in natural and engineered systems. Methods that capture and use the state-space structure of a dynamical system are a proven strategy for forecasting the behavior of systems like this, but use of these methods is not always straightforward. The governing equations and the state variables are rarely known; rather, one has a single (or perhaps a few) series of scalar measurements that can be observed from the system. It can be a challenge to model the full dynamics from data like this, especially in the case of forecast models, which are only really useful if they can be constructed and applied on faster time scales than those of the target system. While the traditional state-space reconstruction machinery is a good way to accomplish the task of modeling the dynamics, it is problematic in real-time forecasting because it generally requires input from and interpretation by a human expert. This thesis argues that that roadblock can be sidestepped by using a reduced-order variant of delay-coordinate embedding to build forecast models: I show that the resulting forecasts can be as good as—or better than—those obtained using complete embeddings, and with far less computational and human effort. I then explore the underlying reasons for this using a novel combination of techniques from computational topology and information theory.

Modern approaches to modeling a time series for forecasting arguably began with Yule’s work on predicting the annual number of sunspots [122] through what is now known as autoregression. Before this, time-series forecasting was done mostly through simple global extrapolation [119]. Global linear methods, of course, are rarely adequate when one is working with nonlinear dynamical systems; rather, one needs to model the details of the state-space dynamics in order to make accurate predictions. The usual first step in this process is to reconstruct that dynamical structure from the observed data. The state-space reconstruction techniques proposed by Packard et al. [89] in 1980 were a critical breakthrough in this regard. In 1981, Takens showed that this method, delay-coordinate embedding, provides a topologically correct representation of a nonlinear dynamical system if a specific set of theoretical assumptions are satisfied. I discuss this in detail in Section 1.1 alongside the appropriate citations.

A large number of creative strategies have been developed for using the state-space structure of a dynamical system to generate predictions, as discussed in depth in Section 3.3. Perhaps the most simple of these is the “Lorenz Method of Analogues” (LMA), which is essentially nearest-neighbor prediction [72]. Even this simple strategy, which builds predictions by looking for the nearest neighbor of a given point and taking that neighbor’s observed path as the forecast—works quite well for forecasting nonlinear dynamical systems. LMA and similar methods have been used successfully to forecast measles and chickenpox outbreaks [112], marine phytoplankton populations [112], performance dynamics of a running computer(e.g., [36, 37]), the fluctuations in a far-infrared laser [98, 119], and many more.

The reconstruction step that is necessary before any of these methods can be applied to scalar time-series data, however, can be problematic. Delay-coordinate embedding is a powerful piece of machinery, but estimating good values for its two free parameters, the time delay and the dimension , is not trivial. A large number of heuristics have been proposed for this task, but these methods, which I cover in depth in Sections 1.2 and 1.3, are computationally intensive and they require input from—and interpretation by—a human expert. This can be a real problem in a prediction context: a millisecond-scale forecast is not useful if it takes seconds or minutes to produce. If effective forecast models are to be constructed and applied in a manner that outpaces the dynamics of the target system, then, one may not be able to use the full, traditional delay-coordinate embedding machinery to reconstruct the dynamics. And the hurdles of delay-coordinate reconstruction are even more of a problem in nonstationary systems, since the reconstruction machinery is only guaranteed to work for an infinitely long noise-free observation of a single dynamical system. This means that no matter how much effort and human intuition is put into estimating , or how precise a heuristic is developed for that process, the theoretical constraints of delay-coordinate embedding can never be satisfied in practice. This means that an experimentalist can never guarantee, on any theoretical basis, the correctness of their embedding, no matter their choice of . In Section 1, I provide an in-depth discussion of these issues.

The conjecture that forms the basis for this thesis is that a formal embedding, although mandatory for detailed dynamical analysis, is not necessary for the purposes of prediction—in particular, that reduced-order variants of delay-coordinate reconstructions are adequate for the purposes of forecasting, even though they are not true embeddings [38]. As a first step towards validating that conjecture, I construct two-dimensional time-delay reconstructions from a number of different time-series data sets, both simulated and experimental, and then build forecast models in those spaces. I find that forecasts produced using the Lorenz method of analogues on these reduced-order models of the dynamics are roughly as accurate as—and often even more accurate than—forecasts produced by the same method working in the complete embedding space of the corresponding system. This exploration is detailed in Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction.

Figure 1 shows a quick proof-of-concept example: a pair of forecasts of the so-called “Dataset A,” a time series from a far-infrared laser from the Santa Fe Institute prediction competition [119].

Even though the low-dimensional reconstruction used to generate the forecast in the right panel of the figure is not completely faithful to the underlying dynamics of this system, it appears to be good enough to support accurate short-term forecast models of nonlinear dynamics. While this example is encouraging, Dataset A is only one time series and it was drawn from a comparatively simple system—one that is well-described by a first-return map (or, equivalently, a one-dimensional surface of section). The examples presented in Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction offer a broader validation of this thesis’s central claim by constructing forecasts using two-dimensional delay reconstructions of ensembles of data sets from a number of different systems whose dynamics are far more complex than the time series in Figure 1. Uniformly, the results indicate that the full complexity (and effort) of the delay-coordinate ‘unfolding’ process may not be strictly necessary to the success of forecast models of real-world nonlinear dynamical systems. Finally, I want to emphasize that this reduced-order strategy is intended as a short-term forecasting scheme. Dimensional reduction is a double-edged sword; it enables on-the-fly forecasting by eliminating a difficult estimation step, but it effects a guaranteed memory loss in the model. I explore this limitation experimentally in Section 9 and theoretically in Section 11.3.

While the results in Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction are interesting from a practical perspective, in that they allow delay-coordinate reconstruction to be used in real time, they are perhaps even more interesting from a theoretical perspective. The central premise of this thesis is a heresy, according to the dogma of delay-coordinate embedding, but regardless, it works. This naturally leads to the need for a deeper exploration into why such a deviation from theory provides so much practical traction.

That exploration is precisely the focus of Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction, where I provide two disjoint explanations of why prediction in projection—my reduced-order strategy—works. The first is from an information theoretic perspective; the second utilizes computational topology. These two disjoint branches of mathematics offer two very different, but quite complementary, tools for exploring this discontinuity between theory and practice. The prior, the subject of Section 11, provides a framework for understanding how information is stored and transmitted from past to future in delay-coordinate reconstructions. Building upon ideas from this field, I develop a novel method called time-delayed active information storage () that can be used to select forecast-optimal parameters for delay-coordinate reconstructions [42]. Using , I show that for noisy finite length time series, a two-dimensional projection (i.e., ) often provides as much—or more—information about the future than a traditional embedding. This further corroborates the central premise of this thesis. This counter-intuitive result, its source, and its implications are discussed in depth in Section 11.3. Section 12 offers an alternative view of the reconstruction process—one based on topology. As I discuss in Section 1.1, the theoretical restrictions of delay-coordinate embedding are intended to ensure a diffeomorphic reconstruction, something that is required for analysis of dynamical invariants but that is excessive for reconstruction of topology. I conjecture that one of the reasons prediction in projection works is that topology (which is preserved by a homeomorphism) becomes correct at much lower embedding dimensions than what one would expect from the associated theorems. The results in Section 12 confirm this, providing insight into the mechanics of prediction in projection, explaining why this approach exhibits so much accuracy while being theoretically wrong, and offering a deeper understanding into delay-coordinate reconstruction theory in general.

Of course, no forecast model will be ideal for every task. In fact, as a corollary of the undecidability of the halting problem [117], no single forecasting schema is ideal for all noise-free deterministic signals [119]—let alone all real-world time-series data sets. I do not want to give the impression that this reduced-order model will be effective for every time series, but I have shown that it is effective for a broad spectrum of signals. Following this line of reasoning, it is important to be able to determine when prediction is possible at all, and, if so, what forecast model is the best one to use. To this end, I have developed a Shannon information-theory based heuristic for quantifying precisely when a given time-series is predictable given the correct model [41]. This heuristic—the focus of Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction—allows for a priori evaluation of when prediction in projection will be effective.

The rest of this thesis is organized as follows. Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction reviews all the necessary background and related work, including the theory and practice of delay-coordinate embedding, information theory, and the forecast methods, as well as the figure of merit that I use for assessing forecast accuracy. In Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction, I introduce the case studies used in this thesis. In Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction, I demonstrate the effectiveness of this reduced order forecast strategy on a range of different examples, comparing it to traditional linear and nonlinear forecasting strategies, and exploring some of its limitations. In Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction, I provide a mathematical foundation for why prediction in projection works. In Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction, I describe a measure for quantifying time-series predictability to understand when my reduced-order method—or any forecasting strategy—will be effective. At the end of Chapters Prediction in Projection: A new paradigm in delay-coordinate reconstruction-Prediction in Projection: A new paradigm in delay-coordinate reconstruction I discuss specialized avenues of future research directly associated with the specific contribution of that chapter. In Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction, I conclude and outline the next frontier of this work: developing strategies for grappling with nonstationary time series in the context of delay coordinate based forecasting—which, I believe, will require a combination of all aspects of this thesis to solve.

## Chapter \thechapter Background and Related Work

### 1 Reconstructing Nonlinear Deterministic Dynamics

The term nonlinear deterministic dynamical system describes a set combined with a deterministic nonlinear evolution or update rule , also called the generating equations. The set could be as simple as or a similar geometric manifold, or as abstract as a set of symbols [79]. Elements of the set are referred to as states of the dynamical system; the set is generally referred to as the state space. The update or evolution rule is a fixed mapping that gives a unique image to any particular element of the set. In the problems treated in this thesis, this update rule is deterministic and fixed: given a particular state, the next state of the system is completely determined. The theory of dynamical systems is both vast and rich. This section of this dissertation is intended to review the subset of this field that is needed to understand the core ideas of my thesis. It is not intended as a general review of this field. For more complete reviews, see [79, 14, 59].

Dynamical systems can be viewed as falling into one of two categories: those that are discrete in time and those that are continuous in time. The former are referred to as maps and denoted by {singlespace}

 →yn+1=Φ(→yn),n∈N (1)

The latter are referred to as flows and are represented by a system of first-order ordinary differential equations {singlespace}

 ddt→y(t)=Φ(→y(t)),t∈R+ (2)

When the generating equations of a dynamical system are known, the future state of any particular initial condition can be completely determined. Unfortunately, knowledge of the generating equations (or even the state space) is a luxury that is very rarely afforded to an experimentalist. In practice, the dynamical system under study is a black box that is observed at regular time intervals. In such a situation, one can reconstruct the underlying dynamics using so-called delay-coordinate embedding, the topic of the following section.

#### 1.1 Delay-Coordinate Embedding

The process of collecting a time series or trace is formally the evaluation of an observation function [99] at the true system state at time for , i.e., for . Specifically, smoothly observes the path of the dynamical system through state space at regular time intervals, e.g., measuring the angular position of a pendulum every 0.01 seconds or measuring the average number of instructions executed by a computer per cycle [83, 6]. Provided that the underlying dynamics and the observation function —are both smooth and generic, Takens [113] formally proves that the delay coordinate map {singlespace}

 F(h,Φ,τ,m)(→y(tj))=([xj xj−τ … xj−(m−1)τ])=→xj (3)

from an -dimensional smooth compact manifold to is almost always a diffeomorphism on whenever and is large enough, i.e., .

###### Definition (Diffeomorphism, Diffeomorphic).

A function is said to be a diffeomorphism if it is a bijective correspondence whose inverse is also . Two manifolds and are said to be diffeomorphic if there exists a diffeomorphism that maps onto .

What all of this means is that, given an observable deterministic dynamical system—a computer for example, a highly complex nonlinear dynamical system [83] with no obvious —I can measure a single quantity (e.g., instructions executed per cycle or L2 cache misses) and use that time series to faithfully reconstruct the underlying dynamics up to diffeomorphism. In other words, the true unknown dynamics and the dynamics reconstructed from this scalar time series have the same topology. Though this is less information than one might like, it is still very useful, since many important dynamical properties (e.g., the Lyapunov exponent that parametrizes chaos) are invariant under diffeomorphism. It is also useful for the purposes of prediction—the goal of this thesis.

The delay-coordinate embedding process involves two parameters: the time delay and the embedding dimension . For notational convenience, I denote the embedding space with dimension and time delay as . To assure topological conjugacy, the Takens proof [113] requires that the embedding dimension must be at least twice the dimension of the ambient space; a tighter bound of , the capacity dimension of the original dynamics, was later established by Sauer et al. [99].

###### Definition (Capacity Dimension [79]).

Let denote the minimum number of open sets (-balls) of diameter less than or equal to that form a finite cover of a compact metric space . Then the capacity dimension of is a real number such that: as , explicitly {singlespace}

 dcap≡−limϵ→0+lnN(ϵ)lnϵ (4)

if this limit exists.

Operationalizing either of these theoretical constraints can be a real challenge. is not known and accurate calculations are not easy with experimental data. And besides, one must first embed the data before performing those calculations.

Apropos of the central claim of this thesis, it is worth considering the intention behind these bounds on . The worst-case bound of is intended to eliminate all projection-induced trajectory crossings in the reconstructed dynamics. For most systems, and most projections, the dimensions of the subspaces occupied by these false crossings are far smaller than those of the original systems [99]; often, they are sets of measure zero. For the delay-coordinate map to be a diffeomorphism, all of these crossings must be unfolded by the embedding process. This is necessary if one is interested in calculating dynamical invariants like Lyapunov exponents. However, the near-neighbor relationships that most state-space forecast methods use in making their predictions are not invariant under diffeomorphism, so it does not make sense to place that strict condition on a model that one is using for those purposes. False crossings will, of course, cause incorrect predictions, but that is not a serious problem in practice if the measure of that set is near zero, particularly when one is working with noisy, real-world data.

My reduced-order strategy explicitly fixes . This choice takes care of one of the two free parameters in the delay-coordinate reconstruction process, but selection of a value for the delay, , is still an issue. The theoretical constraints in this regard are less stringent: must be greater than zero and not a multiple of the period of any orbit [99, 113]. In practice, however, the noisy and finite-precision nature of digital data and floating-point arithmetic combine to make the choice of much more delicate [59]. It is to this issue that I will turn next.

#### 1.2 Traditional Methods for Estimating the Embedding Delay τ

The parameter defines the amount of time separating each coordinate of the delay vectors: . The theoretical constraints on the time delay are far from stringent and this parameter does not—in theory [99, 113]—play a role in the correctness of the embedding. However, that assumes an infinitely long noise-free time series [99, 113], a luxury that is rare in practice. As a result of this practical limitation, the time delay plays a crucial role in the usefulness111Here by usefulness I mean that not only are the dynamical invariants (e.g., Lyapunov exponents and fractal dimension) and topological properties, (e.g., neighborhood relations) preserves, but also that those quantities are attainable from the reconstructed dynamics. of the embedding [33, 59, 17, 67, 18, 68, 95].

The fact that the time delay does not play into the underlying mathematical framework is a double-edged sword. Because there are no theoretical constraints, there is no practical way to derive an “optimal” lag or even know what criterion an “optimal” lag would satisfy [59]. Casdagli et al. [24] provide a theoretical discussion of this, together with some treatment of the impacts of on reconstructing an attractor using a noisy observation function. Unfortunately no practical methods for estimating came from that discussion, but it does nicely outline a range of between redundancy and irrelevance. For very small , especially with noisy observations, and are effectively indistinguishable. In this situation, the reconstruction coordinates are highly redundant [24, 46], i.e., they contain nearly the same information about the system.222This is made more rigorous in Section 2, where I discuss information theory. This is not a good choice for because additional coordinates add almost nothing new to the model. Choosing an arbitrarily large is undesirable as well. On this end of the spectrum, the coordinates of the reconstruction become causally unrelated, i.e., the measurement of is irrelevant in understanding [24]. Useful values lie somewhere between these two extrema. In practice, selecting useful values can be quite challenging, as demonstrated in the following example.

###### Example 1.

To explore the effects of on an embedding, I first construct an artificial time series by integrating the Rössler system [96] {singlespace}

 ˙x = −y−z (5) ˙y = x+ay (6) ˙z = b+z(x−c) (7)

with , , and , using a standard fourth-order Runge-Kutta integrator starting from for 100,000 time steps with a time step of . This results in a trajectory of the form , where for . This trajectory is plotted in Figure 2(a). To discard transient behavior, I remove the first 1,000 points of this trajectory. I define the observation function as , resulting in the time series: . The first 5,000 points of this time series can be seen in Figure 2(b).

To illustrate the role of in the delay-coordinate embedding process, I embed using and several different choices of . These embeddings are shown in Figure 3. In theory, each of the choices of in Figure 3 should yield correct, topologically equivalent embeddings—given the right choice of . In practice, however, that is not the case.

First consider the top-left panel of Figure 3 where . Here, the axes are spread apart so little that the embedding appears to be a noisy line. This is because and are effectively indistinguishable at this small . In the embedding in the bottom-right panel of Figure 3, the reconstruction appears to be a ball of noise with only traces of underlying structure. At this large , the coordinates of the reconstruction are causally unrelated. This is known as “overfolding.” To visualize this concept, consider the progression in Figure 3 from to . As increases, the reconstruction expands away from the diagonal and begins to resemble the original attractor. However, as increases past this point, the top corner of the reconstruction is slightly folded over.

This “melting” effect is called folding in the literature. “Overfolding” occurs when the reconstructed attractor folds back on itself, completely collapsing back to the diagonal, (as can be seen for ) and then re-expanding away from the diagonal, (as can be seen for ). Overfolding produces an unnecessarily (and in the case of noise, often incorrect) reconstruction [24, 95, 59]. Compare, for example, the bottom-left panel of Figure 3 with the actual attractor in Figure 2(a). From a theoretical standpoint, given the right choice of , these two objects are topologically equivalent; from a practical standpoint, however, the embedding is overly complex.

If the time series were noisy, this overfolding would likely introduce additional error. With knowledge of the true attractor, it is easy to say that the and embeddings most closely match its shape; without that knowledge, however, the choice is not obvious. The situation is even more delicate than this. If one knew, somehow, that and were both good reconstructions, how would one know which of these two choices was optimal? With , no folding has occurred, which is beneficial because with noisy or projected dynamics (choosing too small), foldings may cause false crossings.333A false crossing is when two trajectories intersect due to projection or measurement error, a phenomenon that cannot happen in a theoretical deterministic dynamical system. But the trajectory in is not as “space filling” as or as spread apart from the diagonal, so the coordinates are most likely more redundant. Weighing the importance of these kinds of criteria is non-trivial and, I believe, application specific. In the rest of this section, I review heuristics aimed at optimizing the estimation of by weighing these different attributes against one another.

There are dozens of methods for estimating e.g., [42, 46, 33, 88, 59, 17, 67, 18, 68, 95]. This is a central issue in my thesis, so the following section surveys this literature in some depth. Choice of is application- and system-specific [17, 59, 95]; a that works well for Lyapunov exponent calculation may not work well for forecasting. For this reason, Kantz & Schreiber [59] suggest that it may be necessary to perform additional system- and application-specific tuning of after using any generic selection heuristic. In my first set of examples, I use the method of mutual information [33, 68]—described below in detail. While this is the standard -selection method, I will show in Section 9.1 that this choice is almost always suboptimal for forecasting. In Section 11, I provide a solution to this: an alternative selection method that leverages “active information storage” to select a that is optimal for forecasting specific reconstructions [42].

##### 1.2.1 Linear Independence and Autocorrelation

A naïve strategy for selecting the time delay would be to choose a that forces the coordinates of the delay vectors to be linearly independent. This is equivalent to choosing the first zero of the autocorrelation function {singlespace}

 R(τ)=1N−τ∑j(xj−μx)(xj−τ−μx)σ2x (8)

where , and are respectively the length, average and standard deviation of the time series [59, 33] . Several other methods have been proposed that suggest instead choosing where the autocorrelation function first drops to a particular fraction of its initial value, or at the first inflection point of that function [59, 95].

An advantage to this class of methods is that its members are extremely computationally efficient; the autocorrelation function, for instance, can be calculated with the fast Fourier transform [95]. However, autocorrelation is a linear statistic that ignores nonlinear correlations. This often yields values that work well for some systems and not well for others [59, 95, 33].

##### 1.2.2 General Independence and Mutual Information

Instead of seeking linear independence between delay coordinates, it may be more appropriate to seek general independence—i.e., coordinates that share the least amount of information (also called “redundancy”) with one another. The following discussion requires some methods from information theory; for a review of these concepts, please refer to Section 2. Fraser & Swinney argue that selecting the first minimum of the time-delayed mutual information will minimize the redundancy of the embedding coordinates, maximizing the information content of the overall delay vector [33]. In that approach, one obtains generally independent delay coordinates by symbolizing the two time series and by binning, discussed in Section 2.5.1, and then computes the mutual information between and for a range of , call this . Then for each , is the amount of information shared between the coordinates and , i.e, quantifies how redundant the second axis is [33]. According to [33], choosing a that minimizes , results in generally independent delay coordinates, i.e, delay coordinates that are minimally redundant.

The argument for choosing in this way applies strictly to two-dimensional embeddings [33, 59], but was extended to work in dimensions in [68]. To accomplish this, Liebert & Schuster rewrote mutual information in terms of second-order Renyí entropies. This transformation allowed them to show that the minima of agreed with the minima of the correlation sum [49], , defined as {singlespace}

 C(ϵ;m,τ)=1N2N∑\lx@stackreli,j=1i≠jΘ[ϵ−||→xi−→xj||] (9)

where is the length of the time series, is the Heavyside function, and are the and delay vectors in . In addition to extending the argument of [33] to dimensions, the modification of [68] allowed for much faster approximations of by simply finding the minimum of , which can be done quickly with the Grassberger-Procaccia algorithm [49, 68].

The choice of the first minimum of is intended to avoid the kind of overfolding of the reconstructed attractor and irrelevance between coordinates that was demonstrated in Figure 3. This choice is discussed and empirically verified in [68] by showing that the first minimum of (so in turn ) corresponded to the most reliable calculations of the correlation dimension [49].

###### Definition (Correlation Dimension).

If the correlation sum, , decreases like a power law, , then is called the correlation dimension. Formally {singlespace}

 D=limϵ→0logC(ϵ)logϵ (10)

if this limit exists. The Grassberger-Procaccia algorithm444The term Grassberger-Procaccia algorithm is used generically for any algorithm that estimates the correlation dimension (and more generally the correlation integral) from the small- behavior of the correlation sum .[49] allows the correlation dimension to be approximated for as {singlespace}

 D=limϵ→0logC(ϵ;m,τ)logϵ (11)

It was not shown in [68], however, that this choice corresponds to the best choice of for the purposes of forecasting. In Section 9.1 I show that this is in fact not the case for all time series. Even so, it is a reasonable starting point, as this method is the gold standard in the associated literature. In my first round of experiments, and as a point of comparison, I select at the first minimum of the mutual information [33, 68] as calculated by mutual in the TISEAN package [53]. There are a few possible drawbacks to this method. For example, there is no guarantee that will ever achieve a minimum; a first-order autoregressive process, for example, does not [59]. Rosenstein et al. [95] argue that calculating mutual information is too computationally expensive. Several papers [51, 75] have argued that mutual information can give inconsistent results, especially with noisy data.

##### 1.2.3 Geometric and Topological Methods

There are several geometric and topological methods for approximating that address some of the shortcomings of mutual information, including: wavering product [67], fill factor, integral local deformation [18], and displacement from diagonal [95], among others. Most of these methods have the distinct advantage of attempting to solve for both and simultaneously, albeit at the cost of being more complicated and less computationally efficient. (This additional computational overhead is not a factor in my reduced-order framework as I explicitly fix .)

##### 1.2.4 Wavering Product

The wavering product of Liebert et al. [67] is a topological method for simultaneously determining embedding dimension and time delay. This approach focused on detecting when the attractor is properly unfolded, i.e., the situation in which projection-induced overlap disappears.

Liebert et al. focused on preserving neighborhood relations of points in . When transitioning from to , an embedding preserves neighborhood relations of every point in , i.e., inner points remain inner points, and analogously with the boundary points. If these neighborhood relations are preserved, then is a sufficient embedding dimension. The so-called “direction of projection” [67] that mitigates false crossings is associated with the best choice of , i.e., the that yields (for a fixed dimension) the smallest amount of overlap. To this end, they defined two quantities {singlespace}

 Q1(i,k,m,τ)=distτm+1(i,j(k,m))distτm+1(i,j(k,m+1)),Q2(i,k,m,τ)=distτm(i,j(k,m))distτm(i,j(k,m+1)) (12)

where, is the standard Euclidean distance measured in between an reference point in and its nearest neighbor in or similarly for , the nearest neighbor of in . To determine if the neighborhood relations are preserved in the embedding, they defined the wavering product {singlespace}

 Wi(m,τ)=((Nnb∏k=1Q1(i,k,m,τ)Q2(i,k,m,τ))1/(2Nnb) (13)

where is the number of neighbors used in each neighborhood. If , then the topological properties are preserved locally by the embedding [67]. In order to compute this globally, Liebert et al. defined the average wavering product as {singlespace}

 ¯¯¯¯¯¯W(m,τ)=ln⎡⎣1NrefNref∑i=1Wi(m,τ)⎤⎦ (14)

where is the number of reference points, typically chosen to be about 10% of the signal. A minimum of as a function of yields an optimal for that choice of . They also showed that a sufficient embedding dimension can be found when converged to zero. Choosing the embedding parameters in this way guarantees that the embedding faithfully preserves neighborhood relations. This is particularly important when forecasting based on neighbor relations.

This technique works very well on many systems, including the Rössler system and the Mackey-Glass system. In particular, Liebert et al. showed that choosing and in this way allowed for accurate estimation of the information dimension [49]. Noise, however, is a serious challenge for this heuristic [62], so it may not be useful for real-world datasets.

##### 1.2.5 Integral Local Deformation

Integral local deformation, introduced by Buzug & Pfister in [18], attempts to maintain continuity of the dynamics on the reconstructed attractor: viz., neighboring trajectories remain close for small evolution times. The underlying rationale is that false crossings will cause what look like neighboring trajectories to separate exponentially in very short evolution time. Integral local deformation quantifies this. Buzug & Pfister show that choosing and to minimize this quantity gives an approximation of that minimizes false crossings created by projection.

In my work, I rely strongly on the continuity of the reconstructed dynamics, since I use the image of neighboring trajectories for forecasting. Integral local deformation seems useful at first glance for choosing a that helps to preserve the continuity of the underlying dynamics in the face of projection. However, the computational complexity of this measure makes it ineffective for on-the-fly adaptation or selection of .

##### 1.2.6 Fill Factor

In [18], Buzug & Pfister introduced a purely geometric heuristic for estimating . This method attempts to maximally fill the embedding space by spatially spreading out the points as far as possible. To accomplish this, Buzug & Pfister calculate the average volume of a large number of -dimensional parallelepipeds, spanned by a set of arbitrarily chosen -dimensional delay vectors. They then show that the first maximum of the average of these volumes as a function of (for a fixed ) maximizes the distance between trajectories. This method is computationally efficient, as no near-neighbor searching is required. However, for any attractor with multiple unstable foci, there is no significant maximum of the fill factor as a function of [95, 18]. In addition, this method cannot take into account overfolding, as an overfolded embedding may be more space-filling than the “properly” unfolded counterpart [95]. This consideration is corrected (at the cost of additional computational complexity) in the method described next.

##### 1.2.7 Average Displacement / Displacement from Diagonal

The average displacement method introduced by Rosenstein et al. [95], which is also known as the displacement from diagonal method [59], also seeks a that causes the embedded attractor to fill the space as much as possible, while mitigating error caused by overfolding and also addressing some other concerns [18]. Rosenstein et al. define the average displacement (from diagonal) for as {singlespace}

 ⟨Sm(τ)⟩=1NN∑i=1 ⎷m−1∑j=1(xi+jτ−xi)2 (15)

For a fixed , increases with increasing (at least initially; the attractor may collapse for large due to overfolding). Rosenstein et al. suggest choosing and where the slope between successive drops to around 40% of the slope between and , where and are the first and second choices of . In noisy data sets, this leads to consistent and accurate computation of the correlation dimension. However, this—like most heuristics—was developed to correctly approximate dynamical invariants (e.g., correlation dimension), and comes with no guarantees about forecast accuracy.

###### Remark.

Several papers (e.g., [64, 77, 95, 105, 85]) have claimed that the emphasis should be placed on the window size rather than or independently. The basic premise behind this idea is that it is more important to choose to span an important time segment (e.g., mean orbital period) than the actual choice of either or independently. This is something I have not found to be the case when choosing parameters for delay reconstruction-based forecasting.

#### 1.3 Traditional Methods for Estimating the Embedding Dimension m

As the embedding dimension is not a parameter in my reduced-order algorithm, I only review a few important methods for estimating it. This discussion is important mainly because these conventions are the point of departure (and comparison) for my work.

A scalar time series measured from a dynamical system is a projection of the original state space onto a one-dimensional sub-manifold. A fundamental concern in the theoretical embedding dimension requirement is to ensure that the embedding has enough dimensions to “spread out” and thus avoid false crossings. Such crossings violate several properties of deterministic dynamical systems, e.g., determinism, uniqueness and continuity. In Figure 4(a), for example, the embedding of the -coordinate of the Rössler system contains trajectory crossings. In Figure 4(b), however, the top-right region of Figure 4(a) has folded under the attractor and the intersections on the top left of Figure 4 have become a “tunnel.”

The issue here is that the dynamics do not have enough space to spread apart in two dimensions. However when this dimension is increased, the attractor can spread out and the intersections disappear. According to [99], choosing ensures that the attractor has enough space to spread out and false crossings will not occur. More precisely, the probability of a crossing occurring in a ball of size is . Recall from Section 1.1, however, that this is for an infinitely long noise free time series and may not hold in practice, as noise can easily cause false crossings and violate the assumptions that went into this estimation. It should also be noted that, even if I knew the capacity dimension of the system—which is generally not the case—I do not necessarily want to choose to be . This is a (generally loose) sufficient bound that should ensure the correctness of the embedding. But it is often the case that the embedding unfolds completely before . The Lorenz system [71] has , for example. [99] would suggest using , but in fact this system can be embedded properly using [62].

Naïvely, it may seem that simply choosing an “extremely large” would be a simpler and completely reasonable choice. This is not true, in practice. First, the complexity of many of the algorithms that deduce information about dynamical systems scale exponentially with  [68]. Worse yet, each noisy data point creates noisy embedding points in the reconstruction  [24]. This amplification of noise quickly destroys the usefulness of an embedding. In light of both of these concerns, good values for the minimal are highly sought after. For a noisy real-valued time series, this is still an open problem, but there exist several heuristic approximations (e.g., [67, 20, 64, 18, 62, 53, 59]). Recall, too, that several of the methods presented in the previous section for estimating e.g.,wavering product [67] and integral local deformation [18]—simultaneously estimate both the delay and the dimension, —the other free parameter in the embedding process.

There are two standard classes of methods for estimating the minimal , the method of dynamical invariants and the method of false neighbors. In the following sections, I review the basics of these two families.

##### 1.3.1 Method of Dynamical Invariants

Dynamical invariants, such as correlation dimension, are topological measures of a system that persist under diffeomorphism. In theory, this means that once a particular choice of embedding dimension, say , yields a topologically valid reconstruction, increasing should have no impact on these dynamical invariants. This is the case because in theory every will be topologically conjugate, to one another and to the original dynamics. This implies that dynamical invariants will become persistent for increasing , once has been reached. Hence, choosing the first for which dynamical invariants stop changing is a good way to estimate the minimal dimension needed to obtain a topologically valid reconstruction. The class of methods that is the topic of this section follows directly from this logic: to choose , one approximates some dynamical invariant (e.g., dominant Lyapunov exponent or correlation dimension) for a range of embedding dimensions, choosing the first embedding dimension for which it becomes persistent, and then corroborates with other dynamical invariants.

For example, one can approximate the correlation dimension for a range of embedding dimensions using the Grassberger & Procaccia algorithm [49], choosing the first for which that approximation stops changing. Then one corroborates this choice by approximating the dominant Lyapunov exponent for a range of (using for example the algorithm in [120]), then choosing the first where this result stops changing. Finally, one ensures these two estimates of are consistent with each other.

Recall, though, that noise in the data can impact any dynamical invariant calculation, and that that impact increases with  [24]. It is more often the case that there is a range of embedding dimensions for which the dynamical invariant being approximated stays “fairly consistent.” Ascertaining this is computationally expensive and requires time-intensive post processing and human interpretation. For these reasons, it is common to use an alternative heuristic, such as those covered in the next section, to narrow down the search to a smaller range of embedding dimensions and then select from this range using the method of dynamical invariants.

##### 1.3.2 The Method of False Neighbors

The method of false neighbors was proposed by Kennel et al. in [62]. This heuristic searches for points that appear close only because the embedding dimension is too small. Consider a point on the top of the tunnel in Figure 4(b) and a point directly below this point on the planar part of the Rössler attractor. These two points are near neighbors in because the tunnel collapses down on the planar region; however, they are not near neighbors in because the embedded attractor inflates, separating points on the top of the tunnel from the points on the planar region. Since these two points are neighbors in and not neighbors in , they are false near(est) neighbors at . Consider, in contrast, two neighboring points on the top of the tunnel in . If the space is projected down to , these points would still be neighbors.

More formally, Kennel et al. define the nearest neighbor of as a false near(est) neighbor if {singlespace}

 (distτm+1(i,j(k,m))2−distτm(i,j(k,m))2distτm(i,j(k,m))2)1/2>Rtol (16)

where is some tolerance. Recall that the is the distance between the point and its nearest neighbor . But notice for delay vectors that , so this condition simplifies to {singlespace}

 |xi−mτ−xj(k,m)−mτ|distτm(i,j(k,m))>Rtol (17)

In particular, a neighbor is a false neighbor if the distance between the two points in is significantly more (viz., ) than the distance between the two neighbors in . Kennel et al. claim that choosing a single nearest neighbor is sufficient (i.e., ) [62]. In addition, they claim that empirically seems to give robust results. This tolerance can be interpreted as defining false neighbors as points that are 10 times farther apart in than in .

This heuristic alone is not enough to distinguish chaos from uniform noise and can incorrectly classify time series constructed from a uniform distribution as having low-dimensional dynamics. Kennel et al. found that for a uniformly distributed random time series, on average, the nearest neighbor of a point is not near at all. Rather, , where . That is, the average distance to the nearest neighbor is the size of the attractor. To handle this, they defined a secondary heuristic , where is another free parameter chosen as 2.0 without justification. I want to note that this heuristic is added to distinguish pure-uniform noise from chaotic dynamics, not to aid in estimating embedding dimension for noisy observations of a chaotic system.

For a time series with noise, near-neighbor relations—which are the basis for this class of heuristics—can cause serious problems in practice. For well-sampled, noise-free data, it makes sense to choose as the first embedding dimension for which the ratio of true to false neighbors goes to zero [62]. For noisy data however, this is unrealistic; in practice, the standard approach is to choose the first for which the percentage of false near(est) neighbors drops below 10%. If topological correctness is vitally important for the application, a range of embedding dimensions for which the percentage of false near(est) neighbors drops to around % is typically chosen and then this range is refined using the method of dynamical invariants described above. This 10% is an arbitrary threshold, however; depending on the magnitude of noise present in the data, it may need to be adjusted, as may and . For example, in the computer performance data presented in Section 6.1.2, the percentage of false near(est) neighbors rarely dropped below even 20%.

Recently an extension of the false near(est) neighbor method was proposed by Cao [20], which attempts to get around the three different tolerances (, and the percentage of false neighbors) in [62]. Cao points out that the tolerances—in particular, —need to be specified on a per-time-series and even per-dimension basis. Assigning these tolerances universally is inadvisable and in many cases will lead to inconsistent estimates. In [20], he illustrates that different choices of these three tolerances result in very different estimates for . To get around this, he defines an alternative heuristic that is tolerance free {singlespace}

 E(m)=1N−mτN−mτ∑i=1distτm+1(i,j(k,m))distτm(i,j(k,m)) (18)

While the inside of the sum is very similar to Equation 17 of [62], the numerator is slightly different: instead of . That is, the former measures the distance between an element and its -nearest neighbor in measured in , whereas the latter measures the change in distance between , and the same vectors extended in . Cao then defines and shows that when stops changing, a sufficient embedding dimension has been found. He also claims that if does not stop changing, then one is observing noise and not deterministic dynamics [20]. Cao does admit that it is sometimes hard to determine if the curve is just slowly growing but will plateau eventually (in the case of high dimensional dynamics) or just constantly growing (in the case of noise). To deal with this, he defines a secondary heuristic to help distinguish these two cases. As this method has been shown to give more consistent , I hoped that this method could provide a more accurate comparison point. However, I was never able to successfully replicate the results in [20] on any experimental data, so I chose to use the traditional version of this algorithm proposed by Kennel et al. in [62].

The astute reader may have noticed a similarity between the method of false neighbors [62], the method of wavering products  [67], and the methods of Cao [20]. It is true that these methods are quite similar. In fact, almost all the methods for determining minimum embedding dimension [67, 20, 64, 18, 62, 53, 59] are based in some way on minimizing the number of false crossings. As this parameter is not important in my work, I do not go into all of these nuances but simply use the standard false near(est) neighbor approach to which the rest of these methods are fundamentally related. In particular, I use the TISEAN [53] implementation of this algorithm (false_nearest) to choose with a threshold on the percentage of neighbors and the and selected by the TISEAN implementation. In my later discussion, I refer to the reconstruction produced in this manner as an embedding of the data. This is by no means perfect, but since it is the most widely used method for estimating , it is the most useful for the purposes of comparison.

#### 1.4 Delay-Coordinate Embedding Reality Check

As discussed in Section 1.1, the theory of delay-coordinate embedding [99, 113] outlines beautiful machinery to reconstruct—up to diffeomorphism—the dynamics of a system from a scalar observation. Unfortunately this theoretical machinery requires both infinitely-long and noise-free observations of the dynamical system: luxuries that are never afforded to a time-series analyst in practice. While there has been a tremendous amount of informative literature on estimating the free parameters of delay-coordinate embedding, at the end of the day these heuristics are just that: empirical estimates with no theoretical guarantees. This means that, even if the most careful rigorous in-depth analysis is used to estimate and , there is no way to guarantee, in the experimental context, that the reconstructed dynamics are in fact diffeomorphic to the observed dynamics.

Even worse, overestimating has drastic impacts on the usefulness of the embedding, as it exponentially amplifies the noise present in the reconstruction. If little usable structure is present in a time series in the first place, perverting this precious structure by amplifying noise is something a practitioner can ill afford to do. Moreover, the methods that are most commonly used for estimating are based on neighbor relations, which are easily corrupted by noisy data. As a result, these heuristics tend to overestimate .

In addition to noise amplification concerns and the lack of theoretical guarantees, the methods for estimating minimal embedding dimension are highly subjective, dependent on the estimate of , and require a great deal of human intuition to interpret correctly. This time-consuming, error-prone human-intensive process makes it effectively impossible to use delay-coordinate embedding for automated or ‘on-the-fly’ forecasting. As stated in Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction, this is unfortunate because delay-coordinate embedding is such a powerful modeling framework. My reduced-order framework—the foundation of this thesis—will, I hope, at least partially rectify this shortcoming.

### 2 Information Theory Primer

In this section, I provide a basic overview of notation and concepts from Shannon information theory [103], as well as a review of some more-advanced topics that are utilized throughout the thesis. I will first cover the basics; an expert in this field can easily skip this part. I will then move on to non-traditional topics viz., multivariate information theory (Section 2.3), methods for computing information measures on real-valued time series (Section 2.5), and measures to quantify the predictability of a real-valued time series (Section 2.6).

#### 2.1 Entropy

Perhaps the most fundamental concept or building block in information theory is the concept of Shannon Entropy.

###### Definition ((Shannon) Entropy [103]).

Let be a discrete random variable with support and a probability mass function that maps a possible symbol to the probability of that symbol occurring, e.g., , where is the probability that an observation is measured to be . The average amount of information gained by taking a measurement of and thereby specifying an observation is the Shannon Entropy (or simply entropy) of , defined by {singlespace}

 H[Q]=−n∑i=1p(qi)log(p(qi)) (19)

Throughout this thesis, is calculated with base two, so that the information is in bits. The entropy can be interpreted as the amount of “surprise” in observing a measurement of a discrete random variable , or equivalently the average uncertainty in the outcome of a process, or the amount of “information” in each observation of a process.

###### Example 2 (Entropy of fair and biased coins).

First consider a fair coin: and . {singlespace}

 H[Q] = −[p(h)log(p(h))+p(t)log(p(t))] (20) = −[12log(12)+12log(12)] (21) = −(−12+−12)=1 (22)

At every flip of the coin there is one bit of “new” information, or one bit of surprise. In contrast, consider an extremely biased coin with heads on both sides: and , . Then , i.e., there are zero bits of “new” information at each toss, as the coin always gives heads.

To gain an intuitive understanding of what phrases like ‘one bit of “new” information’, or ‘one bit of surprise’ mean, it is sometimes easier to interpret Equation (19) as the average number of (optimal) yes-no questions one needs to ask in order to determine what the outcome of observing a system will be. Returning to the coin-flip example above, since the fair coin had , on average, one (optimal) question needs to be asked to determine the outcome of the coin flip: “Was the coin a head?” With the biased coin, however, the entropy was zero, which means on average no questions were needed in order to infer the observation was a head (it always is!). The following example clarifies this.

###### Example 3 (Entropy of Animal-Vegetable-Mineral [28]).

You may have, at some point during your childhood, played the game “Animal-Vegetable-Mineral.” If not, the rules are simple: player one thinks of an object, and by a series of yes-no questions, and the other players attempt to guess the object. “Is it bigger than a breadbox?” No. “Does it have fur?” Yes. “Is it a mammal?” No. This continues until the players can guess the object.

As anyone who has played this game will attest, some questions are better than others—for example, you usually try to focus on general categories first (hence, the name of the game itself—is it an animal?) then get more specific within that category. Asking on the first round “is it a dissertation?” is likely to waste time—unless, perhaps, you are playing with a graduate student who is about to defend.

If a game lasts too long, you may begin to wonder if there exists an optimal set of questions to ask. “Could I have gotten the answer sooner, if I had skipped that useless question about the fur?” A moment’s reflection shows that, in fact, the optimal set of questions depends upon the player: if someone is biased towards automobiles, it would be sensible to focus on questions that specify make, model, year, etc. You could then imagine writing down a script for this player: “first ask if it is a car; then if yes, ask if it is domestic, if no, ask if it is a Honda…;” or for my nieces: “first ask if it’s Elsa from Frozen.” (It almost always is.)

For each player and their preferences (i.e., for every probability distribution over the things that player might choose), there is an optimal script. And for each optimal script for a given person, the game will last five rounds, or ten rounds, or seven, or twenty, depending on what they choose that time. Profoundly, the number of questions you have to ask on average for a particular person and optimal script pair is given by Equation (19). In particular, we are measuring information (and uncertainty): the average number of yes-no questions we’ll need to ask to find out an answer.

Understanding entropy is important to the rest of the discussion in this chapter as it is the fundamental building block of all other information-theoretic quantities.

#### 2.2 Mutual Information

It is often interesting to consider how knowledge about something informs us about something else. People carrying umbrellas, for example, tells us something about the weather; it is not perfect but (informally) if you tell me something about the weather, you also reduce my uncertainty about umbrella-carrying [28]. To constructively introduce the so-called mutual information, I will adapt the next example from [28].

How can we quantify the information between the weather and umbrella-carrying ? For simplicity, I will assume you only get to see one person—who is either carrying () or not carrying () an umbrella, i.e., with probability .

Now assume that there are some finite number of weather types (say “rain”, “cloudy”, “windy” etc., labeled with ), each with a probability of occurring, . Then from Section 2.1, the uncertainty in the weather is simply {singlespace}

 H[W]=−N∑i=1p(wj)log(p(wj)) (23)

We are interested in the probability of seeing a particular weather type given that we see the person carrying an umbrella. For this, consider the conditional probability of weather type given that you see someone carrying an umbrella—. Generally, will be higher than when is labeling weather with precipitation and is not, so the uncertainty of the weather given that someone is carrying an umbrella is then {singlespace}

 H[W|u1]=−p(u1)∑jp(wj|u1)log(p(wj|u1)) (24)

or in words, “the uncertainty about the weather, given that the person who walked in was carrying an umbrella.” Similarly, for the reverse case, we could compute the associated uncertainty , to determine “the uncertainty about the weather, given that the person who walked in was not carrying an umbrella.” Combining (summing) these two we get the conditional entropy between two variables (weather type and state of umbrella-carrying in this example).

###### Definition (Conditional Entropy [103]).

Define and to be discrete random variables with support and respectively. Then the conditional entropy is defined as {singlespace}

 H[Q|R]=−∑ip(ri)∑jp(qj|ri)log(p(qj|ri)) (25)

where is the conditional probability of given .

We can then quantify the “reduction in uncertainty” in the weather given that someone is carrying an umbrella by , and the reverse with . Note that the reduction can be positive or negative—in some climates, seeing your colleague not carrying an umbrella will make you more uncertain about the weather. Consider, for example, an extremely rainy climate; it is either sunny, cloudy, or rainy, but most often rainy. You are generally quite certain about the weather before you see your colleague (it is raining). So when they walk through the door without their umbrella, you think it is less likely to be raining, and so you are more uncertain (the options sunny, cloudy, or rainy are now more evenly balanced).

Now consider the “average reduction in uncertainty” of the weather given the state of umbrella carrying {singlespace}

 I[W,U] = p(u1)(H[W]−H[W|u1])+p(u2)(H[W]−H[W|u2]) (26) = H[W]−(p(u1)H[W|u1]+p(u2)H[W|u2]) (27) = H[W]−H[W|U] (28)

This is called the mutual information; it tells us how much less uncertain we are, on average, about given that we know .

###### Definition (Mutual Information).

Define and to be discrete random variables with support and respectively, and let be the entropy of and be the conditional entropy. Then the mutual information between and is defined as {singlespace}

 I[Q,R] = −∑i,jp(qj,ri)logp(qj,ri)p(qj)p(ri) (29) = H[Q]−H[Q|R] (30)

Note: [33].

In the next section, I extend this discussion to information shared between more than two variables. In the language of this section, that is equivalent to the situation where I have two or more colleagues with umbrellas and and I want to know the average reduction in uncertainty of the weather given the state of and , i.e., . This is unfortunately not a straightforward generalization and there is little agreement in the literature about interpreting or even defining multivariate mutual information.

#### 2.3 I-Diagrams and Multivariate Mutual Information

The mathematical definitions of multivariate mutual information can get quite confusing to interpret, especially when comparing and contrasting the difference in these definitions. To clarify this discussion, I will use -Diagrams of Yeung [121]—a highly useful visualization technique for interpreting information theoretic quantities.

##### 2.3.1 I-Diagrams

Figure 5 shows -diagrams of some of the important quantities introduced in Sections 2.1 and  2.2: (a) entropy , (b) joint entropy (c) conditional entropy and (d) mutual information . In -Diagrams, each circle represents the uncertainty in a particular variable and the shaded region is the information quantity of interest, e.g., in (a) we are interested in —the uncertainty in —so the entire circle is shaded. Figure 5(b) introduces a new measure: joint entropy . is uncertainty about processes and ; this is easily depicted in an -Diagram by simply shading both circles.

The real magic of -Diagrams comes from their ability to depict more-complex information theoretic measures by simply manipulating shaded regions. For example, recall from Section 2.2 that conditional entropy —Figure 5(b)—is the uncertainty about process given knowledge of . One way of writing this is : i.e., subtracting the shaded regions in (a) and (b) produces the shaded region in (c). The same can be done with mutual information. Recall from Section 2.2 that is the shared uncertainty between and or : i.e., subtracting the shaded region in (a) from the shaded region in (c) produces the shaded region in (d). While obviously not a proof, this kind of approach allows us to easily build intuition about more complicated identities, e.g., symmetry of mutual information:

In the next section, I will use -diagrams to explore three common interpretations of multivariate mutual information, interaction information [76] (also commonly called the co-information [10]), the binding information [87] (also called the dual total correlation [50]), and total correlation [118] (also commonly called multi-information [111]).

#### 2.4 Multivariate Mutual Information

When interpreting using -Diagrams, the situation is quite simple, as there is exactly one region of “shared uncertainty;” when generalizing even to three variables , the situation becomes much more confusing—and this is reflected in the mathematical uncertainties as well. Consider the generic three-variable -Diagram in Figure 6. Instead of having one region of overlap as in Figure 5(d), there are now four. There are three standard ways of shading each these regions to quantify .

One interpretation is the so-called interaction information [76, 10] {singlespace}

 C[Q,R,S]≡I[Q,R,S]≡ −(H[Q]+H[R]+H[S]) +(H[Q,R]+H[Q,S]+H[R,S]) −H[Q,R,S] (31)

As depicted in Figure 6(b), this is the intersection of , and . It describes the reduction in uncertainty that any two processes (e.g., and ), together, provide regarding the third process (e.g., and ). While this may seem like the natural extension of mutual information, it does not take into account the information that is shared between the two process but not with the third. One common criticism of this interpretation is that is quite often negative. For example, when the shared information between is due entirely to information in , the interaction information can be negative as well as positive. Many interpretations of negative information have been provided—e.g., that the variable inhibits (i.e., accounts for or explains some of) the correlation between {Q,R}—but in general negative information is frowned upon [1].

The next obvious step is to take into account the information that is shared between any two process but not shared with the third, as well as the information shared between all three processes. This is called the binding information [87, 50] {singlespace}

 B[Q,R,S]≡I[Q,R,S]≡H[Q,R,S]+[3∑i=1H[X(i−1)%3,X(i+1)%3])−H[Q,R,S]] (32)

where , , and , and % is the modulus operator. This quantity is depicted in Figure 6(c). has the nice feature that it is always positive, but it equally weights information contained in two variables as information contained in all three. The total correlation [118, 111] {singlespace}

 M[Q,R,S]≡I[Q,R,S]≡I[X0,X1,X2]≡2∑\mathclapi=0(H[Xi])−H[X0,X1,X2] (33)

depicted in Figure 6(d) addresses this shortcoming, but is equally criticized for over emphasizing information that is shared by all three variables.

The total correlation and binding information are both always positive but their relative merits are a subject of contention. For a nice comparison of these and discussion of the associated issues, please see [1]. The takeaway of this section should be that extending mutual information as defined in Section 2.2 to even the three-variable case, let alone beyond that, is non-trivial and not well understood at all. This will become very important in Section 11, where I propose a new information-theoretic method for selecting delay reconstruction parameters.

#### 2.5 Estimating Information from Real-Valued Time-Series Data

Note that all the information theory discussed thus far has been on discrete random variables. The topic of this thesis, however, involves real-valued time series. To compute any information measure on real-valued time series, one must “symbolize” that data, i.e., map the real values to a set of discrete symbols. Ideally, this symbolization should preserve the information and/or dynamics of the original time series, but this can be hard to accomplish in practice. The processes by which this is accomplished, and the issues that make it difficult, are the focus of this section.

##### 2.5.1 Simple Binning

A common (and by far the simplest) symbolization method is binning. To symbolize a real-valued time series by binning, one breaks the time series support into bins, which need not be equally spaced. Then one defines a discrete random variable to have a symbol for each bin , i.e., has support . The associated probability mass function is then computed using {singlespace}

 p(bi)=|{j|xj∈bi}|N (34)

For example, consider a time series with support on and bins and . Then one simply estimates the probability mass function associated with and by counting the number of time-series elements that appear in each subinterval and .

This method is extremely simple, but simplicity is often a double-edged sword. Binning is a very fast and efficient symbolization, but it is known to introduce severe biasing and spurious dynamics if the bin boundaries do not happen to create a so called generating partition of the dynamics [63, 13].

###### Definition (Generating Partition).

Given a dynamical system on a measure space , a finite partition is said to be generating if the union of all images and preimages of gives the set of all -measurable sets . In other words, the “natural” tree of partitions always generates some sub--algebra, but if it gives the full -algebra of all measurable sets , then is called generating [97].

Unfortunately, even for most canonical dynamical systems, let alone all real-valued time series, the generating partition is not known or computable. Even when the partition is known it can be fractal, as is the case with the Hénon map, for example, and thus useless for creating a finite partition. For a good review of the difficulties in finding a generating partition see [26]. I review this in more detail in Section 2.6.

##### 2.5.2 Kernel estimation methods

A useful alternative to simple binning is a class of methods known as kernel estimation [100, 34], in which the relevant probability density functions are estimated via a function with a resolution or bandwidth that measures the similarity between two points in space.555 In the case of delay-coordinate reconstruction, Given points and in , one can define {singlespace}

 ^pρ(qi,ri)=1NN∑i′=1Θ(qi−q′iri−r′i−ρ) (35)

where and . That is, is the proportion of the pairs of points in space that fall within the kernel bandwidth of , i.e., the proportion of points similar to . When is the max norm, this is the so-called box kernel. This too, however, can introduce bias [70] and is obviously dependent on the choice of bandwidth . After these estimates, and/or the analogous estimates for , are produced, they are then used directly to compute local estimates of entropy or mutual information for each point in space, which are then averaged over all samples to produce the entropy or mutual information of the time series. For more details on this procedure, see [70].

A less biased method to perform kernel estimation when one is interested in computing mutual information is the Kraskov-Stügbauer-Grassberger (KSG) estimator [63]. This approach dynamically alters the kernel bandwidth to match the density of the data, thereby smoothing out errors in the probability density function estimation process. In this approach, one first finds the nearest neighbor for each sample (using max norms to compute distances in and ), then sets kernel widths and accordingly and performs the pdf estimation. There are two algorithms for computing with the KSG estimator [70]. The first is more accurate for small sample sizes but more biased; the second is more accurate for larger sample sizes. I use the second of the two in the results reported in this dissertation, as I have fairly long time series. This algorithm sets and to the and distances to the nearest neighbor. One then counts the number of neighbors within and on the boundaries of these kernels in each marginal space, calling these sums and , and finally calculates {singlespace}

 I[Q,R]=ψ(k)−1k−⟨ψ(nq)+ψ(nr)⟩+ψ(n) (36)

where is the digamma function666The formula for the other KSG estimation algorithm is subtly different; it sets and to the maxima of the and distances to the nearest neighbors.. This estimator has been demonstrated to be robust to variations in as long as  [70].

In this thesis, I employ the Java Information Dynamics Toolkit (JIDT) implementation of the KSG estimator [70]. The computational complexity of this implementation is , where is the length of the time series and is the number of neighbors being used in the estimate. While this is more expensive than traditional binning ), it is bias corrected, allows for adaptive kernel bandwidth to adjust for under- and over-sampled regions of space, and is both model and parameter free (aside from , to which it is very robust).

#### 2.6 Estimating Structural Complexity and Predictability

An understanding of the predictive capacity of a real-valued time series—i.e., whether or not it is even predictable—is essential to any forecasting strategy. In joint work with Ryan James, I propose to quantify the complexity of a signal by approximating the entropy production of the system that generated it. In general, estimating the entropy (production) of an arbitrary, real-valued time series is a challenging problem, as discussed above, but recent advances in Shannon information theory—in particular, permutation entropy [9, 32]—have reduced this challenge. I review this class of methods in this section.

For the purposes of this thesis, I view the Shannon entropy—in particular, its growth rate with respect to word length (the Shannon entropy rate)—as a measure of the complexity and hence the predictability for a time series. Time-series data consisting of i.i.d. random variables, such as white noise, have high entropy rates, whereas highly structured time-series—for example, those that are periodic—have very low (or zero) entropy rates. A time series with a high entropy rate is almost completely unpredictable, and conversely. This can be made more rigorous: Pesin’s relation [91] states that in chaotic dynamical systems, the Kolmogorov-Sinai (KS) entropy is equal to the sum of the positive Lyapunov exponents . These exponents directly quantify the rate at which nearby states of the system diverge with time: . The faster the divergence, the larger the entropy. The KS entropy is defined as the supremum of the Shannon entropy rates of all partitions—i.e., all possible choices for binning [92]. As an aside, an alternative definition of the generating partition defined above is a partition that achieves this supremum.

From a different point of view, I can consider the information (as measured by the Shannon entropy) contained in a single observable of the system at a given point in time. This information can be partitioned into two components: the information shared with past observations—i.e., the mutual information between the past and present—and the information in the present that is not contained in the past (viz., “the conditional entropy of the present given the past”). The first part is known as the redundancy; the second is the aforementioned Shannon entropy rate. Again working with R. G. James, I establish that the more redundancy in a signal, the more predictable it is [41, 40]. This is discussed in more detail in Chapter Prediction in Projection: A new paradigm in delay-coordinate reconstruction.

Previous approaches to measuring temporal complexity via the Shannon entropy rate [102, 74] required categorical data: for some finite or countably infinite alphabet . Data taken from real-world systems are, however, effectively777Measurements from finite-precision sensors are discrete, but data from modern high-resolution sensors are, for the purposes of entropy calculations, effectively continuous. real-valued. So for this reason I need to symbolize the time series, as discussed above. The methods discussed above however, are generally biased or fragile in the face of noise.

Bandt and Pompe introduced the permutation entropy (PE) as a “natural complexity measure for time series” [9]. Permutation entropy involves a method for symbolizing real-valued time series that follows the intrinsic behavior of the system under examination. This method has many advantages, including robustness to observational noise, and its application does not require any knowledge of the underlying mechanisms of the system. Rather than looking at the statistics of sequences of values, as is done when computing the Shannon entropy, permutation entropy looks at the statistics of the orderings of sequences of values using ordinal analysis. Ordinal analysis of a time series is the process of mapping successive time-ordered elements of a time series to their value-ordered permutation of the same size. By way of example, if then its ordinal pattern, , is since . The ordinal pattern of the permutation is .

###### Definition (Permutation Entropy).

Given a time series , define as all permutations of order . For each , define the relative frequency of that permutation occurring in {singlespace}

 (37)

where quantifies the probability of an ordinal and is set cardinality. The permutation entropy of order is defined as {singlespace}

 PE(ℓ)=−∑π∈Sℓp(π)log2p(π) (38)

Notice that [9]. With this in mind, it is common in the literature to normalize permutation entropy as follows: . With this convention, “low” PE is close to 0 and “high” PE is close to 1. Finally, it should be noted that the permutation entropy has been shown to be identical to the Kolmogorov-Sinai entropy for many large classes of systems [7], as long as observational noise is sufficiently small. As mentioned before, PE is equal to the Shannon entropy rate of a generating partition of the system. This transitive chain of equalities, from permutation entropy to Shannon entropy rate via the KS entropy, allows one to approximate the redundancy of a signal—being the dual of the Shannon entropy rate—by .

In this thesis, I utilize a variation of the basic permutation entropy technique, the weighted permutation entropy (WPE), which was introduced in [32]. The intent behind the weighting is to correct for observational noise that is larger than the trends in the data, but smaller than the larger-scale features. Consider, for example, a signal that switches between two fixed points and contains some additive noise. The PE is dominated by the noise about the fixed points, driving it to , which in some sense hides the fact that the signal is actually quite structured. To correct for this, the weight of a permutation is taken into account {singlespace}

 w(xℓi+1)=1ℓi+ℓ∑j=i(xj−¯xℓi+1)2 (39)

where is a sequence of values , and