Empirical risk minimization and complexity of dynamical models

# Empirical risk minimization and complexity of dynamical models

Kevin McGoff, Andrew B. Nobel
###### Abstract.

A dynamical model consists of a continuous self-map of a compact state space and a continuous observation function . Dynamical models give rise to real-valued sequences when is applied to repeated iterates of beginning at some initial state. This paper considers the fitting of a parametrized family of dynamical models to an observed real-valued stochastic process using empirical risk minimization. More precisely, at each finite time, one selects a model yielding values that minimize the average per-symbol loss with the observed process. The limiting behavior of the minimum risk parameters is studied in a general setting where the observed process need not be generated by a dynamical model in the family and minimal conditions are placed on the loss function.

We establish a general convergence theorem for minimum risk estimators and ergodic observations, showing that the limiting parameter set is characterized by the projection of the observed process onto the set of processes associated with the dynamical family, where the projection is taken with respect to a joining-based distortion between stationary processes that is a generalization of the -distance. We then turn our attention to observations generated from an ergodic process plus an i.i.d. noise process and study conditions under which empirical risk minimization can effectively separate the signal from the noise. The key, necessary condition in the latter results is that the family of dynamical models has limited complexity, which is quantified through a notion of entropy for families of infinite sequences. Close connections between entropy and limiting average mean widths for stationary processes are established.

## 1. Introduction

Empirical risk minimization is a common approach to model fitting and estimation in a variety of parametric and non-parametric problems. In this paper we investigate the use of empirical risk minimization to fit a family of dynamical models to an observed stochastic process. Formally, a dynamical model consists of a continuous transformation on a compact metric space , and a continuous observation function . Let denote the -fold composition of with itself, and let be the identity map on . From each initial state the dynamical model yields a real-valued sequence

 {f(Tkx)=f∘Tk(x):k≥0}⊆R,

obtained by applying the observation function to a deterministic sequence of states generated by repeated iteration of the transformation . In general, need not be injective, so one cannot necessarily recover the underlying states from the sequence .

In what follows we consider an indexed family of dynamical models defined on a common compact metric space satisfying the following conditions:

1. the index set is a compact metric space;

2. the map from to is continuous;

3. the map from to is continuous.

Condition (D2) ensures that each transformation is continuous and that the action of is continuous in . Condition (D3) ensures that each observation function is continuous and that observations vary continuously with . In particular, there exists a constant such that for every and .

Dynamical models are inherently deterministic, as the sequence of observations generated by a model is fully determined once the initial condition is given. As noted in Section 1.2.1, each dynamical model has a set of invariant measures and these measures give rise to a family of stationary processes. In the large sample limit, fitting a family of dynamical models leads directly to a variational problem involving its associated processes. Conditions (D1)-(D3) ensure that the set of associated processes is non-empty and that the limiting variational problem is well-defined. Note, however, that we make no explicit assumptions about the invariant measures of any individual model.

In this paper our primary focus is on model families having limited complexity. Roughly speaking, we require that the -covering numbers of the length- real sequences produced by models in the family grow sub-exponentially in at all scales. A precise notion of entropy for model families is given in Section 1.4. In general, the elements of the family may reflect chaotic behavior, or they may capture low order regularities such as periodicity, almost periodicity, or hierarchical structure. Examples and further discussion of model families can be found in Section 1.1. Fitting a family of such models to observations of a stochastic process amounts to looking for these types of regularities in the process.

Let be a family of dynamical models that capture some behaviors of interest, and let be an observed stationary ergodic process. Suppose that we wish to identify regularities in by fitting the observed values of the process with models in . Note that we do not assume that the observed process is generated by a process in , or that itself is of limited complexity. Let be a nonnegative loss function that is jointly lower semicontinuous in its arguments. We require the following integrability condition:

 (C1) E[sup|u|≤KDℓ(u,Y0)] < ∞.

(If the supremum in (C1) is not measurable, then one may replace the expectation by an outer expectation.) For each , , and define

 Rn(θ:x) = 1nn−1∑k=0ℓ(fθ∘Tkθ(x),Yk)

to be the empirical risk of the model with initial state relative to the first observations of . We formalize empirical risk minimization as follows.

###### Definition 1.1.

A sequence of measurable functions , , will be called (empirical) minimum risk estimates for if

 (1.1) limninfxRn(^θn:x) = limninfθinfxRn(θ:x)   w.p.1,

where .

###### Remark 1.2.

Existence of the limit on the right hand side of (1.1) follows from Kingman’s subadditive ergodic theorem (under (C1)). Existence of the limit on the left hand side of (1.1) is part of the definition.

Our primary goal is to investigate and characterize the limiting behavior of minimum risk estimates. Three main results are presented, corresponding to three levels of generality. At the highest level, we provide a variational characterization of the limiting behavior of minimum risk estimates (Theorem 1.12). We then focus on a natural signal plus noise setting and show that if the family of dynamical models has low complexity then empirical risk minimization effectively separates the signal and the noise (Theorem 1.18). In the low complexity setting, we show that if the signal arises from a model in the family and the noise is appropriately centered with respect to the loss, then the minimum risk estimates are consistent (Theorem 1.21). A negative result (Proposition 1.22) shows that empirical risk minimization can be inconsistent when the family of dynamical models has high complexity.

Beyond the results above, the main contribution of our work is a systematic treatment of identifiability and complexity for families of dynamical models, with a focus on the misspecified case in which the observed process is not generated by a model in the family. We avoid assumptions on the invariant measures of the models in by working with the processes they generate. In particular, the limit set of empirical risk estimators is characterized in terms of a distortion-based projection of the observed process onto the family of processes associated with . In addition, we introduce an entropy-based definition of complexity for families of dynamical models and families of infinite sequences that may be of independent interest. Our notion of entropy has close connections with topological entropy, studied in dynamical systems, and with stochastic mean widths, studied in empirical process theory.

Both the statements and proofs of our results rely on the concept of joinings, which are stationary couplings of stochastic processes. Joinings, introduced by Furstenberg [8], have been well-studied in ergodic theory, but have not been widely applied to problems of statistical inference. Our results show that joinings are intimately connected with minimum risk fitting of dynamical models. Several tools from the theory of joinings, including disjointness and relatively independent joinings, play an important role in our analysis.

### 1.1. Examples

Here we present several examples of families of dynamical models that capture potentially interesting types of regularities. Under some additional assumptions, these families have zero entropy, and therefore our general results apply to these models. In some cases, these or similar families have already been fit to data by applied scientists (e.g. [23, 41]), albeit without any theoretical guarantees of consistency.

###### Example 1.3.

(Toral rotations and almost periodicity) Let the state space be the -dimensional torus , which is the direct product of circles, . For a vector , define the transformation to be the rotation of by the angle vector , i.e. (addition in ). Then let be a compact set of continuous functions from to (with respect to the topology induced by the supremum norm). Let , and define the family of dynamical models . With these definitions, is a continuous family of dynamical models, and . Fitting this family to a process amounts to looking for periodic or “almost periodic” (also known as “quasi-periodic”) structure in the observations. Intuitively, one is looking for up to independent “periods” in a process.

As a specific example of a setting in which such models might arise, one may consider restricted classes of dynamic gene regulatory networks that exhibit periodic behavior. Inference of gene regulatory networks from observed data is considered an important problem in systems biology [21]. In recent years, it has become increasingly feasible for experimentalists to assay the abundance of all the genes in a given system with regular frequency over time. In such cases, one would like to infer the structure of the underlying network from the observed gene expression dynamics [23].

In many situations, one would expect gene regulatory networks to be zero entropy systems. In particular, the networks studied in chronobiology should exhibit periodic dynamics by definition. Examples of such systems include the cell cycle and circadian oscillators.

###### Example 1.4.

(Subcritical logistic family and ecology) Since at least the early work of May [22], simple parametric families of dynamical systems have been used by ecologists as models of the population dynamics of many species [20]. In many instances, various types of deterministic models have been fit to ecological data (e.g. [41]).

The prototypical family in this context is the logistic family, which may be parametrized as follows. Consider the state space and the family of maps , where for . If we restrict to the region , then the family of dynamical models will have zero entropy. This situation is thought to occur in many naturally occurring populations (see results and discussion from [13]).

###### Example 1.5.

(Symbolic dynamics and quasicrystals) In recent years, interest has grown in certain materials called quasicrystals. These materials were discovered in 1983 by Schechtman and exhibit an interesting mix of properties [36]. In particular, they have long-range aperiodic order (in contrast to crystals, which have long-range periodic order). Such systems often exhibit a rigid hierarchical structure at all scales, and they are frequently modeled using symbolic dynamical systems [5, 35, 38], such as substitution systems. A full discussion of these systems is beyond the scope of this paper. However, if one were interested in looking for hierarchical structure in observational data, then these types of dynamical models could be used. Many families of such models naturally have zero entropy, as it is often forced by the long-range order and rigid hierarchical structure.

### 1.2. Preliminaries

In this section we introduce some concepts and notation that will be useful in what follows.

#### 1.2.1. Processes associated with dynamical models

Let be a dynamical model on a compact metrizable state space . Recall that a Borel probability measure on is said to be invariant under if for all Borel sets . Let be the set of Borel measures on that are invariant under , which is nonempty (see [45, p.152]). To each measure there is an associated real-valued process

 U=f(X),f(TX),f(T2X),…

where has distribution . The invariance of under ensures that is stationary. Here and in what follows we will regard real-valued processes as measures on the infinite product space equipped with its Borel sigma-field in the standard product topology.

###### Definition 1.6.

Let be a family of dynamical models. For each let

 Qθ = {U=(fθ∘Tkθ(X))k≥0:X∼μ with μ∈M(X,Tθ)}

be the set of stationary processes associated with , and let be the set of processes associated with the entire family of models .

#### 1.2.2. Joinings and distortion for stationary processes

The statements and proofs of our principal results rely critically on stationary couplings of stationary processes, which are known as joinings.

###### Definition 1.7.

A joining of two stationary processes and is a stationary process such that has the same distribution as and has the same distribution as . Let denote the family of all joinings of and .

By definition, a joining of two stationary processes is a coupling of the processes that is itself stationary. Note that the family always contains the the so-called independent joining under which and are independent copies of and , respectively. Joinings were introduced by Furstenberg [8], and have been widely studied in ergodic theory [6, 9]. For notational convenience, we will frequently use to denote a joining of with . Also, note that the joining of three or more stationary processes may be defined analogously.

###### Definition 1.8.

Let be a nonnegative loss function. The -distortion between two stationary processes and is given by

 γL(U,V) =infJ(U,V)E[L(U0,V0)].
###### Remark 1.9.

Joinings were used by Ornstein [30, 31, 32] to define the -distance between finite alphabet stationary processes based on the Hamming metric . The -distance was extended by Gray et al. [10] to stationary processes with general alphabets and to arbitrary metrics . The distortion is a straightforward generalization of these distances to nonnegative loss functions that need not be metrics on .

###### Remark 1.10.

The fact that the infimum defining runs over the set of joinings, rather than the set of couplings, is critical. A minimizing joining makes the average loss between elements of the process as small as possible over the entire future. By contrast, a minimizing coupling would make the processes as close as possible at time zero, without regard to their behavior in the future. Moreover, ergodic properties of the processes can severely constrain the set of possible joinings. For instance, for many pairs of processes and , the only joining in is the independent joining. Such processes are called disjoint.

### 1.3. Convergence of Minimum Risk Estimates

As noted above, a family of dynamical models corresponds to a family of stationary processes. The problem of fitting models in to an observed ergodic process using the loss has a population analog in which we seek processes in that minimize the distortion with . The solution to the population problem is the -projection of onto , and the corresponding set of parameters is a natural limit set for empirical risk estimators. This leads to the following definition, which is given for general loss functions.

###### Definition 1.11.

Let be a family of dynamical models parametrized by . Given a nonnegative loss function and a stationary ergodic process , let

 ΘL(Y) = argminθ∈ΘminU∈QθγL(U,Y),

which is the set of parameters such that some process in minimizes the distortion with .

The proof of the following theorem, which relies on results of McGoff and Nobel [26], is presented in Section 2.

###### Theorem 1.12.

Let be a family of dynamical models satisfying (D1)-(D3), and let be a lower semicontinuous loss function. If is a stationary ergodic process satisfying (C1), then is non-empty and compact and

1. any sequence of minimum risk estimators converges almost surely to ;

2. for each , there exists a sequence of minimum risk estimators that converges almost surely to .

We emphasize that there is no assumed relationship between the observations and the family . Additionally, identifiability of parameters is addressed in a direct way, through the distortion via the families .

Theorem 1.12 shows that the limiting behavior of minimum risk estimators is characterized by the family , and in this way the theorem reduces the asymptotic analysis of empirical risk minimization to the analysis of this limit set. We show below how analysis of yields both positive results (e.g. consistency) and negative results (inconsistency) in a signal plus noise setting.

### 1.4. Entropy for sequence families

A key issue in nonparametric inference is how to assess the complexity of a family of models. Complexity measures play an important role in establishing consistency, convergence rates, and optimality for a variety of inference procedures. Although fitting nonlinear dynamical models to ergodic processes is substantially different from model fitting for classification or regression, complexity still plays an important role in the consistency of empirical risk minimization.

We assess the complexity of a family through the covering numbers of the infinite real-valued sequences generated by its constituent models. From a statistical point of view, it is natural to consider empirical covering numbers, while from a dynamical systems point of view, it is natural to consider empirical covering numbers, as is done with topological entropy [45]. As we show below, these two approaches coincide.

Let and denote infinite sequences in . For each and , define pseudo-metrics as follows:

 dn,p(u,v)=⎧⎨⎩(n−1∑n−1k=0|uk−vk|p)1/p if 1≤p<∞max0≤k≤n−1|uk−vk| if p=∞.

Let be a family of infinite sequences. For each let denote the covering number of the set under the pseudo-metric at radius . Let

 hp(U,r)=limsupn1nlogN(U,r,dn,p),

which is the exponential growth rate of the covering numbers at radius , and define the entropy of the family as the supremum of these growth rates, namely

 hp(U)=limr↘0hp(U,r).

The following result is established in Section 3.

###### Theorem 1.13.

The entropies for are all equal.

###### Remark 1.14.

Although it is not needed here, we note that Theorem 1.13 holds more generally for sets of sequences where is any metric space such that

 limr↘0rlogN(A,r,ρ)=0

and the pseudo metrics are defined in terms of .

###### Definition 1.15 (Entropy of a dynamical family).

The entropy of a family of dynamical models is the common value of , where

 (1.2) UD={(fθ∘Tkθ(x))k≥0:x∈X,θ∈Θ}⊆RN

is the set of infinite sequences generated by models in .

###### Remark 1.16.

It is straightforward to show that is a compact subset of in its product topology. Let be the left-shift map defined by for . Then it is easy to see that is continuous and . Thus is a topological dynamical system that captures the dynamics of the family , and the entropy defined above is the topological entropy of this system.

We note that the entropy may also be characterized in terms of the entropies of the processes in , which we define in Section 3.3. The following lemma is established in Appendix A.

###### Lemma 1.17.

For any family of dynamical models satisfying (D1)-(D3),

 h(D)=supU∈QDh(U).

### 1.5. Signal Plus Noise

We now turn our attention to empirical risk minimization when the observed process has a signal plus noise structure. We assume in what follows that for each , where is a stationary ergodic process and is an i.i.d. noise process that is independent of . We indicate this relationship using the process-sum notation . We require the following integrability conditions:

 (C1) E[sup|x|≤KDℓ(x,Y0)]<∞;
 (C2) E[sup|x|≤KDℓ(x,V0)]<∞;

Note that (C1) is the same condition required in the general setting, and (C2) and (C3) refer only to and , respectively.

Theorem 1.12 ensures that any sequence of minimum risk estimators will converge to the set of optimal parameters for . Of interest here is when and whether empirical risk minimization can decouple the signal from the noise and recover the optimal parameters for the signal process . We begin with the following general result.

###### Theorem 1.18.

Let satisfy (C1)-(C3), and let satisfy (D1)-(D3). If , then any sequence of minimum -risk estimates converges almost surely to , where .

###### Remark 1.19.

Since is nonnegative and lower semicontinuous, the auxiliary loss function has the same properties (using Fatou’s Lemma for the lower semicontinuity).

###### Remark 1.20.

If a given process can be expressed in two different ways as and , then the proof of Theorem 1.18 shows that , where is defined using in place of .

Theorem 1.18 shows that if , then empirical risk minimization does indeed decouple the signal from the noise. However, the presence of the auxiliary loss function in the conclusion of the theorem begs the question of whether the limit set is equal to the limiting parameter set associated with the signal, as one would like. The following two results address this question under additional hypotheses.

###### Theorem 1.21.

Let satisfy (C1)-(C3), and let be a family of dynamical models satisfying (D1)-(D3) with . Let be any sequence of minimum risk estimators based on .

1. If is a Bregman divergence and , then converges almost surely to .

2. If is an ergodic process in and for all , with equality if and only if , then converges almost surely to .

Theorem 1.21 establishes that minimum risk fitting of a zero-entropy dynamical family effectively isolates the signal under two types of hypotheses. The first places restrictions on the loss function but allows the observation process to be quite general. The second places restrictions on the signal and on the joint behavior of the loss function and the noise processes.

Without the entropy condition , the conclusion of Theorem 1.21 does not hold in general, as the following proposition highlights. For , let be the family consisting of the single dynamical model associated to .

###### Proposition 1.22.

Let be a family of dynamical models satisfying (D1)-(D3). Suppose that

• ;

• there exists such that and contains an ergodic process .

Then there exists such that for every i.i.d. process with for , the least squares estimates derived from converge to a compact set such that . Moreover, such a family exists.

Under the conditions of this proposition, inconsistency holds almost surely despite the fact that the signal process is generated by a dynamical model in the family. The idea underlying this phenomenon is that since has positive entropy, it is capable of tracking the noise, and then the least squares estimates will overfit the observed sequence.

### 1.6. Squared loss and mean width

In this section we give special attention to the squared loss, . Analysis of the squared loss naturally leads to another measure of complexity of the family based on the mean width of sequences generated by with respect to the noise.

###### Definition 1.23.

Let be a family of dynamical models, and let be an i.i.d. process with mean zero and finite variance. The -sample mean width of relative to is

 (1.3) κn(D:ε) = E[supx,θn−1∑k=0fθ∘Tkθ(x)⋅εk].

Define the mean width of relative to to be the limiting linear growth rate of the finite sample mean widths,

 (1.4) κ(D:ε)=limn1nκn(D:ε),

which exists by subadditivity (see Remark 3.2). When we write as and refer to this quantity as the Gaussian mean width of the family .

Finite sample mean widths have been widely studied in machine learning and empirical process theory, with an emphasis on Rademacher and Gaussian noise [3, 18]. The mean width of has close connections with the entropy of .

###### Theorem 1.24.

Let be i.i.d. with mean zero and finite variance. If then . Moreover, the Gaussian mean width if and only if .

###### Remark 1.25.

In general, one cannot expect a more quantitative relationship between asymptotic mean width and entropy. Indeed, the entropy is invariant under any continuous, invertible change of coordinates (see [45, p. 167]), whereas the mean width is not invariant under such operations. In particular, the asymptotic mean width scales linearly if the observation functions are multiplied by a fixed constant, while the entropy remains unchanged.

Let denote the -distortion in the special case that is the squared loss. Note that is in fact a metric on the space of -valued stationary stochastic processes. As the squared loss is a Bregman divergence, minimum risk fitting of a zero entropy family will converge to the optimal parameter set for the signal by Theorem 1.21. The next theorem extends this result to the case where the mean width of the family is zero.

###### Theorem 1.26.

Let , where is ergodic and is an i.i.d. process with mean zero and finite variance. If , then any sequence of least squares estimators converges almost surely to .

In our final result of this section, we establish the consistency of least squares estimation for a family of transformations on a compact state space in where each observation function is the identity. Suppose is compact and is a family of transformations on such that is a compact metric space and is continuous. Further, suppose that where is distributed according to an ergodic measure , and is i.i.d. with mean zero and finite variance and is independent of .

###### Corollary 1.27.

If the topological entropy of is zero for all , then any sequence of least squares estimators converges almost surely to the set .

The limit set in Corollary 1.27 contains and serves as the natural identifiability class of in this setting.

### 1.7. Related work

Following the development of the statistical learning theory for i.i.d. data by Vapnik and Chervonenkis [43], there has been substantial interest in learning from dependent data (e.g. [1, 2, 11, 14, 28, 47]). In many such cases, the authors consider learning from dependent processes satisfying some type of mixing condition (see [4]), and the results yield learning rates for prediction. Other results provide sufficient (and sometimes necessary) conditions for the consistency of certain prediction and classification procedures from dependent processes [27, 29]. The present work is concerned with parameter or feature estimation, which is related to yet distinct from the problems considered in the previous works.

The work of Ornstein and Weiss [33] studies the estimation of a stochastic process from its samples. In that work, the authors propose a particular inference procedure and characterize when it produces consistent estimates of the stochastic process in the -bar metric. In our work, we focus on a different procedure, based on empirical risk minimization as opposed to matching -block frequencies, and we consider consistency in the context of restricted families of models, rather than in the class of all stationary ergodic processes.

Our work is also somewhat related to a line of research concerning least squares estimation (LSE) of individual sequences from noisy observations, which has been considered in a variety of contexts by several authors (see [34, 42, 46], among others). The most recent of these works ([34]) employs the tools of empirical process theory to prove results on both consistency and asymptotic normality of LSE of individual sequences from signal plus noise. In the present work, we consider only specific sets of individual sequences: those that arise from a continuous family of dynamical models, as in (1.2), and we are interested in inference of a dynamical invariant parameter (i.e. ), rather than the signal sequence itself. The dynamical invariance of our model sequences, along with the invariance of our inference target , allows us to obtain consistency of LSE under our general mean width (or entropy) conditions, rather than under the more restrictive complexity conditions appearing in [34]. Furthermore, our setting distinguishes itself from this related work by naturally handling a substantial degree of model misspecification.

In Furstenberg’s original work on joinings [8], he includes an application of joinings to a nonlinear filtering problem. Beyond this application, we are not aware of other uses of joinings in the literature on statistical inference.

In recent work of Lev, Peled, and Peres [19], some of Furstenberg’s original results have been substantially extended. This work concerns two problems: given an infinite sequence of signal plus noise, first detect whether the signal is non-zero, and second recover the signal from the given sequence. Although somewhat related to our work, these results possess significant differences from ours. For one thing, the detection and filtering problems mentioned above place no restrictions beyond measurability on the detection and filtering mappings, whereas we specifically consider finitary inference via empirical risk minimization. Additionally, they consider arbitrary sets of individual sequences with proper specification (as in [34]), where as we consider potentially misspecified parameter estimation of dynamical models.

Finally, we mention that many questions about statistical inference in the context of dynamical systems have been considered in a wide variety of subject areas; see the survey [25] for a broad overview and references. Let us mention that dynamical systems in the observational noise setting have been studied in [15, 16, 24], and statistical prediction in the context of dynamical systems has been considered in [11, 12, 39, 44].

### 1.8. Generalizations

Generalization of all of the definitions and results of the paper to -valued models and processes is straightforward, requiring only minor changes of notation. We omit the details. In a different direction, one could analyze families of dynamical models defined on a non-compact state space with uniformly bounded observation functions, requiring only measurability of the maps and . For families of this more general type, the set of associated sequences would not necessarily be a closed (hence compact) subset of , and in this case one needs to consider the closure of , along with all the stationary processes supported on this set. The analysis here can be carried out in this more general setting, but the corresponding results are difficult to interpret in the context of the original inference problem.

### 1.9. Outline of the rest of the paper

Section 2 contains the proof of the general convergence theorem. Proofs of the results on entropy and mean width appear in Section 3. The main inference results for the signal plus noise setting are established in Section 4. Section 5 contains the proof of the negative result Proposition 1.22.

## 2. Optimal tracking and proof of Theorem 1.12

In this section we discuss the connections between fitting dynamical models and the optimal tracking problem studied in [26]. In particular, we construct a single dynamical system that captures the important features of the family of dynamical models, and we show how this system may be analyzed in the context of optimal tracking.

### 2.1. Optimal tracking

The tracking problem for dynamical systems concerns two systems: a model system , where is compact and metrizable and is continuous, and an an observed system , where is a separable completely metrizable space and is Borel measurable. Given an initial segment of a trajectory from the observed system, one seeks a corresponding initial condition such that the trajectory from the model system “tracks” the given trajectory from the observed system. An optimal tracking trajectory is chosen by minimizing an additive cost functional

 n−1∑k=0c(Sky,Tkz),

where is a fixed lower semicontinuous cost function.

The results from [26] consider the situation when the observed initial condition is drawn from an ergodic measure and is bounded above by a function in . Here we state a version of the previous results that is sufficient for our purposes, for which we require a bit more notation. Let be a compact metrizable space, and let be a continuous map satisfying . We denote by the set of joinings of the process , where , with any process of the form , where for some such that .

###### Theorem A ([26]).

Let , , , , , and be as above. If and the following equality holds almost surely,

 (2.1) limn1nn−1∑k=0c(Sky,Tk^zn)=limninfz∈Z1nn−1∑k=0c(Sky,Tkz),

then converges ( almost surely) to the non-empty, compact set

 (2.2) Θmin=argminθ∈ΘminJ(ν:θ)E[c(Y0,Z0)].

Furthermore, for any , there exists such that (2.1) holds and converges to .

### 2.2. Proof of Theorem 1.12

To be begin the proof, we describe how fitting a continuous family of dynamical models to an observed stochastic process can be cast as a tracking problem. As an important first step, we define a single dynamical system that encapsulates the entire family of dynamical models. Consider the state space

 Z={(θ,(fθ∘Tkθ(x))k≥0):θ∈Θ,x∈X}⊆Θ×RN,

and define the transformation by , which is clearly continuous (in the product topology). We now establish some basic properties of the dynamical system associated with the family of dynamical models.

###### Lemma 2.1.

The set is a compact subset of when is equipped with the product topology, and the map is continuous. If is an ergodic element of , then there exists and an ergodic process with distribution such that .

###### Proof.

By our hypotheses on the family , both the parameter space and the state space are compact, and therefore is compact. Define the map by

 π(θ,x)=(θ,(fθ∘Tkθ(x))k≥0).

It is clear from the definition of that is the image of under . To show that is compact, it now suffices to check that is continuous.

Let be a sequence converging to in . Let . The continuity conditions (D2) and (D3) imply that for ,

 limnfθn∘Tkθn(xn)=fθ∘Tkθ(x).

As was arbitrary, we have shown that converges to in in the product topology, and therefore is continuous.

The left-shift is continuous, and therefore is continuous.

For the last statement of the lemma, define the map as the skew-product over the identity, . By construction, we have . Thus, is a factor map from onto . It follows that the push-forward map from to (given by ) is a surjection [7, p. 19].

Now let be ergodic. Since the push-forward map from to (given by ) is a surjection, there exists an ergodic such that . Since , the induced measure on must be invariant under the identity map. Also, it must be ergodic, since is ergodic. As the only ergodic measures for the identity map are the point masses, we see that there exists such that . Then for some ergodic measure . Finally, we conclude that , where is the distribution of an ergodic process in . ∎

We now proceed with the proof of Theorem 1.12. The observed process gives rise to an observed dynamical system in the tracking problem, where , is the left shift , and is the distribution of on . Finally, as a cost function, we choose , where

 c(v,(θ,u))=ℓ(u0,v0)

Then an application of Theorem A yields that any sequence of minimum -risk parameters converges almost surely to the set . Furthermore, this projection is nonempty and compact, and Theorem 1.12 (2) holds. We have thus proved Theorem 1.12.

## 3. Entropy and mean width

In this section we study the notions of entropy and mean width for families of dynamical models. We begin with entropy.

### 3.1. Entropy for families of dynamical models

Recall that the pseudo-metrics and the entropies were defined in Section 1.4. Additionally, we define to be the -ball of radius centered at .

Before proving the main results presented in Section 1.4, we make a simple observation. For , we have that . Hence, for any and ,

 (3.1) N(U,δ,dn,p)≤N(U,δ,dn,∞).

The following lemma, which bounds the cardinality of -separated sets that are contained in a single -ball, is used to prove Theorem 1.13. For notation, let denote the maximum cardinality of any set in that is -separated with respect to .

###### Lemma 3.1.

Let , and . Set . Then for any and ,

 M∞n(Bn,p(u,ϵ),δ)≤(3K/δ)δn/2⋅2H(δ/2)n,

where .

###### Proof.

Suppose and there exists such that for . For the sake of notation, let and . By subtracting from all these sequences if necessary, we assume without loss of generality that for all .

The idea of the proof is to bound by estimating the number of coordinates of each that deviate from by more than . To begin, we define the following subsets of : , and , for , where . Now we code the points according to this partition: define by the relation , and let be given by

 π(vj)=(rj(0),…,rj(n−1)).

First, observe that is injective. Indeed, if , then , and hence there exists such that , which implies . Second, observe that

 (3.2) s=⌈2K/δ⌉≤2K/δ+1≤3K/δ.

Now we proceed with the main bounds. Since , we have that for each ,

 ϵpn≥n∑k=1|vj(k)|p.

Furthermore, by construction, if , then , and therefore

 ϵpn≥n∑k=1|vj(k)|p≥(δ/2)p∣∣{k:rj(k)≠0}∣∣.

From this inequality and the choice of , we deduce that

 (3.3) {π(v1),…,π(vM)}⊂{z∈{0,…,s}n:∣∣{k:z(k)≠0}∣∣≤δn/2}.

By the facts established above (injectivity of , (3.3), and (3.2)) and a well-known inequality for binomial sums (see [37]),

 M ≤(3K/δ)δn/2δn/2∑k=0(nk)≤(3K/δ)δn/2⋅2H(δ/2)n,

which completes the proof of the lemma. ∎

With this lemma in place, we now turn to the proof of Theorem 1.13.

Proof of Theorem 1.13. The inequality follows easily from (3.1) and the definition of entropy. The remainder of the proof is devoted to showing the reverse inequality. As in the definition of entropy, we let

 U={(fθ∘Tkθ(x))k≥0:x∈X,θ∈Θ}.

Since and are compact and the map is continuous, there exists such that for all and . Thus .

Let , and let be a maximal -separated set for with respect to . Note that . Now let , and let be an -covering set for with respect to with minimal cardinality. Note that . By the union bound,

 (3.4)

Applying Lemma 3.1 and the fact that in (3.4), we obtain

 (3.5) M∞n(U,δ)≤N(U,ϵ,dn,p)(3K/δ)δn/22H(δ/2)n.

Since any maximal -separated set must be a -covering set, we have , and then from (3.5), we see that

 N(U,δ,dn,∞)≤N(U,ϵ,dn,p)(3K/δ)δn/22H(δ/2)n.

Taking logarithm and dividing by yields

 1nlogN(U,δ,dn,∞)≤1nlogN(U,ϵ,dn,p)+δ2log(3K/δ)+H(δ/2)log2.

Thus letting tend to infinity gives

 h∞(U,δ)≤hp(U,ϵ)+δ2log(3K/δ)+H(δ/2)log2.

Since , taking the limit as decreases to zero allows us to conclude that .

### 3.2. Variational characterization of mean width

Here we collect a few facts regarding mean width, which are used elsewhere. Let us begin with the fact that the sequence of finite sample mean widths is subadditive.

###### Remark 3.2.

Using definition (1.3), one may easily check that that for ,

 κm+n(D:ε) ≤E[supx,θm−1∑k=0fθ∘Tkθ(x)⋅εk]+E[supx,θm+n−1∑k=mfθ∘Tkθ(x)⋅εk] ≤κm(D:ε)+κn(D:ε).

The last inequality above is a consequence of the stationarity of and the fact that for any and . Thus the sequence is subadditive, and therefore the limit in (1.4) exists.

The following result provides a variational characterization of the mean width.

###### Theorem 3.3.

If is a family of dynamical models satisfying (D1)-(D3) and is a stationary ergodic process with finite mean, then

and the supremum is achieved.

###### Proof.

Using the same system appearing in Section 2 as the model system, the noise process in place of the observation process, and a different cost function (product instead of loss), Theorem 3.3 is a consequence of [26, Theorem 1.4]. ∎

### 3.3. Connecting entropy and mean width

In this section we investigate connections between the notions of entropy and mean width for continuous families of dynamical models. We begin by proving that a family with zero entropy must have zero mean width relative to centered i.i.d. processes. Our proof relies on a foundational result of Furstenberg concerning joinings, stated below as Theorem B.

Let be a stationary stochastic process taking values in a separable completely metrizable space . Let be a finite Borel partition of . For , and , let

and consider

 Hn(U,π)=−∑An−10∈πnp(An−10)logp(An−10),

with the convention that . By subadditivity, we may take the limit as tends to infinity:

 h(U,π)=limn1nHn(U,π).

Then taking the supremum over all finite Borel partitions of gives the entropy of the process :

 h(U)=supπh(U,π).

In proving Theorem 1.24, we will rely on the following result of Furstenberg.

###### Theorem B.

[8, Theorem I.2] If and is i.i.d., then the only joining of and is the independent joining.

Proof of Theorem 1.24. Let . Since , Lemma 1.17 implies that . Then by the result of Furstenberg (Theorem B above), the only joining of with is the independent joining. Thus, for any joining with , we have