No-Regret Prediction in Marginally Stable Systems

# No-Regret Prediction in Marginally Stable Systems

## Abstract

We consider the problem of online prediction in a marginally stable linear dynamical system subject to bounded adversarial or (non-isotropic) stochastic perturbations. This poses two challenges. Firstly, the system is in general unidentifiable, so recent and classical results on parameter recovery do not apply. Secondly, because we allow the system to be marginally stable, the state can grow polynomially with time; this causes standard regret bounds in online convex optimization to be vacuous. In spite of these challenges, we show that the online least-squares algorithm achieves sublinear regret (improvable to polylogarithmic in the stochastic setting), with polynomial dependence on the system’s parameters. This requires a refined regret analysis, including a structural lemma showing the current state of the system to be a small linear combination of past states, even if the state grows polynomially. By applying our techniques to learning an autoregressive filter, we also achieve logarithmic regret in the partially observed setting under Gaussian noise, with polynomial dependence on the memory of the associated Kalman filter.

bib.bib

## 1 Introduction

We consider the problem of sequential state prediction in a linear time-invariant dynamical system, subject to perturbations:

 xt=Axt−1+But−1+ξt.

This is a central object of study in control theory and time-series analysis, dating back to the foundational work of Kalman [kalman1960new], and has recently received considerable attention from the machine learning community. In a typical learning setting, the system parameters are unknown, and only the past states and exogenous inputs are observed. Sometimes, another layer of difficulty is imposed: the latent state can only be observed noisily or through a low-rank transformation. These models serve as an abstraction for learning from correlated data in stateful environments, and have helped to understand empirical successes in reinforcement learning and of recurrent neural networks.

Many recent works are concerned with finite-sample system identification, in which the matrices can be recovered due to structure in the perturbations or inputs. These results rely on matrix concentration inequalities and careful error propagation applied to classic primitives in linear system identification, and have settled some important statistical questions about these methods. However, as in the classical control theory literature, these theorems require unrealistic assumptions of i.i.d. isotropic random perturbations and recoverability of the system, and often require the user to select the ‘‘exploration’’ inputs . Furthermore, under model misspecification, the guarantees of parameter identification pipelines break down.

Another line of work seeks to obtain more flexible guarantees via the online learning framework. Here, the goal of parameter recovery is replaced with regret minimization, the excess prediction loss compared to the best-fit system parameters in hindsight. This approach gives rise to algorithms which adapt to adversarially perturbed data and model misspecification, and can be extended beyond prediction to obtain new methods for robust control. However, these algorithms can diverge significantly from the classical parameter identification pipeline. In particular, they can be improper, in that they may use an intentionally misspecified (e.g. overparameterized) model. Thus, these algorithms can be incompatible with parameter recovery and downstream methods.

In this work, we show that the same algorithm used for parameter identification (online least squares) has a no-regret guarantee, even in the challenging setting of prediction under marginal stability and adversarial perturbations, where recovery is impossible. More precisely, in this setting, where the state is allowed to grow polynomially with time, we show that the regret of this algorithm is sublinear, with a polynomial dependence on the system’s parameters. This does not follow from the usual analysis of online least squares: the magnitude of loss functions (and associated gradient bounds) can scale polynomially with time, causing standard regret bounds to become vacuous. Instead, we conduct a refined regret analysis, including a structural volume doubling lemma showing to be a small linear combination of past states.

By replacing the worst-case structural lemma with a stronger martingale analysis, we also show a polylogarithmic regret bound for least squares in the stochastic setting. Again, this analysis does not go through parameter convergence, and thus applies in the setting of unidentifiable systems and non-isotropic noise. The same techniques allow us to prove a logarithmic regret bound in the partially observed setting under Gaussian noise, with polynomial dependence on the memory of the associated Kalman filter.

Paper structure. This paper is organized as follows. In Section 2, we formally introduce the problem and the natural online least squares algorithm. In Section 3, we give an overview of related work. In Section 4, we state our main results. In Section 5, we sketch the proofs. In Section B, we show that a structural condition on a time series gives a regret bound for online least squares. We then prove our main theorems for fully observed LDS in the adversarial setting (Section C), fully observed LDS in the stochastic setting (Section D), and for partially observed LDS in the stochastic setting (Section E), through establishing this structural condition.

## 2 Problem setting and algorithm

The problem of online state prediction for a LDS falls within the framework of online least squares. We first introduce the general problem of online least squares (Section 2.1), and then specialize to the prediction problem for a fully observed (Section 2.2) or partially observed (Section 2.3) LDS. Because the observations come from a sequential process, they have extra structure that we will leverage to obtain better guarantees than for black-box online least squares. In Section 2.4, we describe the challenges associated with marginally stable systems.

### 2.1 The online least squares problem

In the problem of online least squares, at each time we are given , and asked to predict . We choose a matrix and predict . The desired output is then revealed and we suffer the squared loss . A natural goal in this setting is to predict as well as if we had known the best matrix in hindsight; hence, the performance metric is given by the regret with respect to , defined by

 RT(A) =T∑t=1∥Atxt−yt∥22−T∑t=1∥Axt−yt∥22.

Define the regret with respect to a given set by . In general, we would like to achieve that is sublinear in , or equivalently, average regret that converges to 0. In some cases, we can do better, and achieve regret that is polylogarithmic in .

A natural algorithm for online least squares is to choose that minimizes the total squared prediction error for all the pairs , seen so far, plus a regularization term:

We state the standard regret bound for Algorithm 1.

###### Theorem 2.1 (OLS regret bound; Thm. 11.7, [cesa2006prediction]).

In the online least squares setting, suppose that for all . Then, Algorithm 1 incurs regret

 RT(A) ≤μ∥A∥2F+max0≤t≤T−1∥yt−Atxt∥22mnln(1+TM2n).

Thus, if there is a uniform bound on the prediction errors , online least squares achieves logarithmic regret. This follows immediately in the usual OLS setting, where and are bounded. However, in the case where they can grow with time, as in marginally stable systems, a more sophisticated analysis will be necessary to get sublinear regret bounds.

### 2.2 Prediction in fully-observed linear dynamical systems

A special case of online least squares is state prediction in a time-invariant linear dynamical system (LDS), defined as follows. Given an initial state , matrices and , inputs and a sequence of perturbations , the LDS produces a time series of states according to the following dynamics:

 xt=Axt−1+But−1+ξt,1≤t≤T. (1)

This setting generalizes the linear Gaussian model from control theory and time-series analysis, in which each is drawn i.i.d. from a Gaussian distribution. Aside from modeling disturbances, can also represent model uncertainty or misspecification.

In the prediction problem for LDS, we are asked to predict as a linear function of the current and the input . We can treat this as an online least squares problem, by casting as the input at time , and as the desired output. At each step, the learner produces and and predicts . Thus, we can adapt Algorithm 1 to this setting with the substitution , , , , and obtain Algorithm 2. Translated to this setting, the goal of regret minimization becomes that of predicting as accurately as if one had known the system’s underlying matrices and .

Note that in the stochastic setting, when the covariance of the noise is lower-bounded in each direction, OLS gives a consistent estimator for and ; convergence rates for recovery are analyzed in [sarkar2019optimal, simchowitz2018learning]. However, in the adversarial setting, recovery of is an ill-posed problem. The perturbations can be biased or rank-deficient, causing the recovery problem to be underdetermined in general, and the optimal may change as time.

### 2.3 Prediction in partially-observed linear dynamical systems

A partially-observed linear dynamical system is defined by

 xt =Axt−1+But−1+ξt (2) yt =Cht+ηt, (3)

where are inputs, are hidden states, are observations, , and are matrices, and and are perturbations. We consider the stochastic setting, so that and are independent zero-mean noise terms. Crucially, only the , and not the , are observed.

For prediction in this setting, we use Algorithm 3, regressing with the previous observations and inputs, so we slightly modify the definition of the regret in (117) to start accruing from :

 RT(A,B,C) =T∑t=ℓ+1∥ˆyt−yt∥2−T∑t=ℓ+1∥∥ˆy\textupKF,t−yt∥∥2,

where is the prediction of the steady-state Kalman filter for the system ; see Appendix E for a review and formal definitions. Note that we will learn the system in an improper manner: that is, we will predict using a general autoregressive filter, rather than the Kalman filter of some system.

### 2.4 Marginally stable systems

In this work, we are interested in prediction in marginally stable systems. In both the fully-observed and partially-observed cases, the spectral radius of the system is defined to be the magnitude of the largest eigenvalue of the transition matrix , as in Equations 1 and 2. An LDS is marginally stable if .

As opposed to strictly stable systems (ones for which ), these systems model phenomena where the state does not reset itself over time, often representing physical systems which experience little or no dissipation. As discussed in Section 3, their capacity to represent long-term dependences presents algorithmic and statistical challenges. An inverse spectral gap factor appears in the computational and statistical guarantees for many learning algorithms in these settings (see, e.g. [hardt2016gradient]), as a finite-impulse truncation length or mixing time, rendering those results inapplicable.

Among marginally stable systems, the hardest cases are those with large Jordan blocks corresponding to large eigenvalues. Defining the Jordan matrix

 Jλ,r:=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝λ10⋯00λ1⋯0⋮⋮⋮⋱⋮000λ10000λ⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠∈Rr×r,

we see that for marginally stable systems, can grow polynomially in . These occur naturally in physical systems as discrete-time integrators of -th degree ordinary differential equations. The primary challenge we overcome in this work is to show the sublinear regret of least squares, even when the state grows polynomially. As is also the case in our work, recent advances in parameter recovery of marginally stable systems [simchowitz2018learning, sarkar2019optimal] exhibit exponential dependences on the largest Jordan block order .

## 3 Related work

Linear dynamical systems have been studied in a number of disciplines, including control theory [ghahramani1996parameter, kalman1963mathematical], astronomy [chiappori1992sunspot], econometrics [hendry1994modelling], biology [saunders1994evolution], and chemical kinetics. They capture many popular models in statistics and machine learning [ghahramani1996parameter]. We first describe the results on parameter estimation and prediction in fully observed systems, and then describe results more broadly applicable to partially observed systems. Unless noted otherwise, the results hold under the assumption that the noise is i.i.d.; some results also require that it be Gaussian.

Fully-observed LDS. [dean2017sample] show that when given independent rollouts of a LDS, the least-squares estimator of the parameters is sample-efficient. Using this, they obtain sub-optimality bounds for control. [simchowitz2018learning] consider the more challenging case when only a single trajectory is given, and show that the least-squares estimator is still efficient, despite correlations across timesteps. Their results hold for marginally stable systems. Improving over [simchowitz2018learning] and [faradonbeh2018finite], [sarkar2019optimal] offer bounds applicable even to explosive systems, with the restriction that explosive eigenvalues have unit geometric multiplicity. In order to obtain results for parameter recovery, all these results assume that the covariance of the noise is lower bounded. Because we are concerned with prediction, this requirement will not be necessary for our results.

Partially-observed LDS. For a system with known parameters, the celebrated Kalman filter [kalman1960new] provides an analytic solution for the posterior distribution of the latent states and future observations given a series of observations. When the underlying parameters are unknown, the Expectation Maximization (EM) algorithm can be used to learn them [ghahramani1996parameter]. However, due to nonconvexity of the problem, EM is only guaranteed to converge to local optima. In the absence of process noise , [hardt2016gradient] show that gradient descent on the maximum likelihood objective converges to the global minimum; however, they require the roots of the associated characteristic polynomial to be well-separated and the system to be strictly stable.

Subspace identification methods circumvent the nonconvexity of maximum likelihood estimation. For strictly stable systems, [oymak2018non] demonstrate that the Ho-Kalman algorithm learns the Markov parameters of the underlying system at an optimal rate (in ), and identifies the parameters approximately up to an equivalent realization (at a rate of ) under further assumptions of observability1 and controllability. [simchowitz2019learning] showed that a prefiltered variant of least-squares offers stronger guarantees that apply even to marginally stable systems and systems with adversarial noise. For strictly stable stochastic systems, [sarkar2019finite] improve upon previous works to give an optimal rate for parameter identification. We note that these works require the control inputs, and often the noise, to be Gaussian. This may not hold when the control inputs are exogenous (not under user control). In contrast, our results can handle arbitrary (bounded) control inputs.

In a notable departure from this trend, [Tsiamis2019FiniteSA] demonstrate optimal recovery of system parameters in the absence of control inputs for marginally stable systems. Under similar conditions, [tsiamis2019sample] prove the first result that integrates former system identification results with a perturbation analysis for the Kalman filter to obtain error bounds on prediction. These results only apply to stochastic systems subject to persistent excitation, which our stochastic-case result does not require.

Prediction via improper learning. For strictly stable partially observed systems without process noise, it is sufficient to learn a finite impulse response (FIR) filter on the inputs, as observed by e.g. [hardt2016gradient]. [tu2017non] give near-optimal sample complexity bounds for learning a FIR filter under design inputs. [hazan17learning, hazan2018spectral, arora2018towards] instead use spectral filtering on the inputs to achieve regret bounds that apply in the presence of adversarial dynamics and marginally stable systems, much like the present work. However, in the presence of process noise, the regret compared to the optimal filter can grow linearly. This is an inherent limitation of any FIR-based approach; see [lee2019robust] for a discussion.

[Kozdoba2019OnLineLO] note if the LDS is observable, the associated Kalman filter is strictly stable, and hence can be arbitrarily well-approximated by an autoregressive (AR) model. [AnavaHMS13] give algorithms for prediction in ARMA models with adversarial noise. However, their results hold under conditions more stringent than even strict stability. [Kozdoba2019OnLineLO] shows that online gradient descent on the AR model gives regret bounds scaling with the size of the observations. As discussed in Section 2.4 In the marginally stable case, this could be polynomial in the time . [lee2019robust] give guarantees for learning an autoregressive filter in a stricter notion of norm, but require the system to be strictly stable.

Online learning. We use tools from online learning (see [hazan2016introduction, shalev2012online, cesa2006prediction] for a survey). The standard regret bounds for online least-squares scale with an upper bound on the maximum instantaneous loss, through the gradient norm or the exp-concavity factor [zinkevich2003online, hazan2007logarithmic]. Our core argument shows that this quantity is sublinear in in the LDS setting. This cannot be true for online least-squares for arbitrary polynomially growing , so black-box results cannot apply; see Appendix A. We note the similarity of our approach to [rakhlin2012online, rakhlin2013optimization] where the authors show that approximate knowledge of cost functions or gradients revealed one step in advance can give ‘‘beyond worst-case’’ regret bounds.

## 4 Our results

### 4.1 Fully-observed LDS

For prediction in a fully-observed LDS, we show that we can achieve sublinear regret in the adversarial setting and polylogarithmic regret in the stochastic setting.

We will make the following assumptions for both theorems:

###### Assumption 4.1.

The linear dynamical system

 xt=Axt−1+But−1+ξt

satisfies the following:

• The initial state is bounded: .

• The inputs are bounded: .

• The perturbations are bounded: for .

• , , and can be written in Jordan form as where has Jordan blocks of size and .

• satisfies .

We will let .

We note that the bound on the perturbations is necessary. This prevents, for example, the pathological case when the system switches between two very different linear dynamical systems and and linear regret is unavoidable. We also note that (the condition number of ) is a standard quantity that often appears in learning guarantees.

Our main theorem in the adversarial setting is the following; see the appendix more precise dependences on individual constants.

###### Theorem 4.2 (Sublinear regret in the adversarial setting).

Suppose Assumption 4.1 holds. Then, there is an explicit choice of regularizer such that Algorithm 1 achieves regret

 RT(A,B) ≤T2r+12r+2\rm poly(Csys,R,(d+m),(lnT),(lnCsys)) (4) +T12r+2\rm poly(Csys,R,(d+m)r,(lnT)r,(lnCsys)r).

If and ,

 RT(A,B)=O(m3d3(m+d)r2⋅T2r+12r+2ln3T)

as . The dependence of on is .

###### Remark.

This is a pessimistic bound. The worst case is when the eigenvalues of the large Jordan blocks are close to 1. If , then we can replace the dependence on with , and instead suffer a dependence.

In the case where is diagonalizable, Theorem 4.2 implies regret:

###### Corollary 4.3.

Suppose Assumption 4.1 holds, and further suppose is diagonalizable. There is an explicit choice of regularizer such that Algorithm 2 achieves regret

 RT(A,B) ≤T3/4\rm poly(Csys,R,d,m,lnT). (5)

When and , we have as . The dependence of on is .

Our main theorem in the stochastic setting is the following.

###### Theorem 4.4 (Polylogarithmic regret in stochastic setting).

Suppose Assumption 4.1 holds, and further that is a random variable satisfying . Then with probability at least , Algorithm 2 with achieves regret

 RT(A,B) ≤\rm poly(Csys,R,dr,(lnT)r,(lnCsys)r,ln(1/δ)). (6)

Note that there is no requirement that the noise be i.i.d., nor that their covariance is greater than some multiple of the identity. At the expense of a factor, the theorem can be applied to subgaussian random variables, by first conditioning on the event that for all .

### 4.2 Partially-observed LDS

Our assumptions in the partially observed setting are the following. We will assume that the noise is i.i.d. Gaussian; this is the analogue of the linear-quadratic estimation (LQE) setting where we only care about predicting the observation.

###### Assumption 4.5.

The partially-observed LDS defined by (2)–(3) satisfies the following:

• The initial state has steady-state covariance: with .

• The inputs are bounded: .

• The perturbations are Gaussian: and .

• , , and can be written in Jordan form as where has Jordan blocks of size and .

• and satisfy and .

We will let .

For simplicity, our result assumes that has steady-state covariance. If this is not the case, then one would need quantitative bounds on how quickly the time-varying Kalman filter converges to the steady-state Kalman filter, to bound the additional regret incurred by using a fixed filter.

The theorem also depends on the sufficient length of the Kalman filter, which is roughly the length at which we can truncate the unrolled filter to incur an error of at most ; see Definition E.2 for a precise account. If the filter decays exponentially, then . It remains an interesting problem to handle the case where the filter is also marginally stable.

###### Theorem 4.6 (Polylogarithmic regret for LQE).

Assume Assumption 4.5. Suppose the corresponding Kalman filter is given by , , and . Let denote the sufficient length of the Kalman filter system, and choose . Let be the unrolled Kalman filter, and suppose .

Then with probability , Algorithm 3 with achieves regret

 RT(A,B,C) ≤\rm poly(ℓ,Csys,R,(d+m)r,(lnT)r,(lnC)r,ln(1/δ)). (7)

## 5 Outlines of main proofs

In this section, we explain the key ideas behind our results, using a fully observable LDS with adversarial noise as an example. For simplicity, we sketch the proof in the case that the matrix is diagonalizable, matrix is zero, and , which already captures the core difficulty of the problem. In the proof sketch, we assume that all relevant parameters for the LDS are . Full proofs are in Sections B--E.

### 5.1 Regret bounds for online least squares with large inputs

Our starting point is the regret bound for online least squares, Theorem 2.1, which depends on the maximum prediction error . To obtain sublinear regret using this bound, we must show that the maximum prediction error is . We show in Theorem B.2 that this holds as long as the following structural condition (formally defined in Definition B.1) on the regressor sequence holds:

###### Definition 5.1 (Anomaly-free sequences; informal).

A sequence is anomaly-free if whenever the projection of any onto a unit vector is large, then there must have been indices for which the projection of to has norm at least .

Intuitively, the inputs are anomaly-free if no input is large in a direction where we have not already seen many inputs. Note this does not hold in the general case of polynomially-bounded ; see the counterexample in Appendix A. To prove Theorem B.2, we first express in terms of the preceding states and errors (Lemma B.3). Next, we show this expression is bounded in the 1-dimensional case (Lemma B.4). Finally, we reduce the general -dimensional case to the 1-dimensional case by diagonalizing the sample covariance matrix . In the reduction, we project onto the eigendirections; this is why we want there to be many large inputs when projected to any direction .

Note that Theorem B.2 is stated more generally, allowing indices where is large. This allows superlinear growth in the , and hence can be applied to dynamical systems with matrices having Jordan blocks.

### 5.2 Proving LDS states are anomaly-free

Our main result (Theorem 4.2) follows by verifying that LDS states are anomaly-free. The main idea is that the evolution of the LDS ensures that is always approximately a linear combination of past states with small coefficients. More precisely, we need with small and (Lemma C.5). Once we have this, projecting onto gives , showing that one of the projections is large. To obtain many indices for which this is large, we apply the same argument to the -step dynamical systems defined by , , and so forth, keeping track of how many times an index can be overcounted. This shows the states are anomaly-free and finishes the proof of Theorem 4.2.

We provide two approaches to decompose into previous states as needed. The simpler approach provides coefficients of size , while a more involved approach provides sized coefficients. In order to have a dependence in the final regret bound, the latter approach is necessary.

### 5.3 exp(d)-sized coefficients using the Cayley-Hamilton theorem

To show that is always a linear combination of previous states with small coefficients, a first idea is to use the Cayley-Hamilton theorem. In the noiseless case, the theorem implies that the satisfy a recurrence , where are the coefficients of the characteristic polynomial of . Adding the noise back, we may get an error term of size by inspecting how the noise propagates through this recurrence. Even though can be as large as , this suffices to get a bound that is sublinear in while exponential in . For ease of reading, we first present this weaker result in Lemma C.2 in Section C.1.

### 5.4 \rm poly(d)-sized coefficients via volume doubling

As an alternative, we now present a novel volume-doubling argument leading to a recurrence with only coefficient size, which may be of independent interest. For the ease of presentation, we introduce some notation below:

• -span of . . For short, we denote .

• Norm with respect to . .

• Set of ‘‘outlier’’ indices. .

Now, our goal is summarized as bounding . To this end, we first prove a upper bound the size of using a general potential-based argument, and then relate it to the norm by unrolling the dynamics appropriately.

Bounding the number of outliers. We have to show that is at most polynomial in and logarithmic in . We prove a general lemma that if is large enough (), then adding to the increases its volume significantly: (Lemma C.7). Applied to our situation, this shows that if , then . The total volume is bounded by , so the number of outliers at or before time is bounded by , which is polynomial in and logarithmic in (Lemma C.8).

Bounding and . We obtain an inequality showing that is not much larger than for small delays . To see this, note that is generated from by evolving the LDS times, keeping track of noise. The contribution from the noise here is at most . Now, it suffices to find a small such that is small, or in other words is not an outlier. Because the number of outliers is , we can choose , and we conclude is at most . This argument is formalized in the proof of Lemma C.5.

Controlling overcounting using prime index gaps. One technicality is that we have to apply the same argument to the dynamical systems for different values of . For each value of , we get an index such that is large. To make sure we obtain enough indices this way, in the proof of Lemma C.11 we only take to be prime, and we use a lower bound on primorials to show that we can collect enough distinct indices.

### 5.5 Stochastic cases

Finally, we provide a brief comment on how to prove Theorems 4.4 and 4.6. In the fully-observed setting, we can use a martingale argument, rather than the structural result, to obtain regret. To do this, we show that with high probability, is bounded by (Lemma D.1).

Let be the matrix predicted by online least squares with regularization parameter . By Lemma B.3, . The main term we need to bound is the first one. Let . Pretending for a moment that is -valued, by Azuma’s inequality it suffices to bound the variation . We have already shown that with , so we can bound the variation in terms of . By definition , so .

However, we can’t apply Azuma’s inequality directly, since depends on , and thus is not -valued. However, the dependence on non--valued random variables is only through , which does not depend on , so we can use Azuma’s inequality on an -net of possible values for . More precisely, note that has the property that is small, so it suffices to bound for a -net of such that is small. These ’s live in a -dimensional space, so we only incur factors of .

For the partially-observed setting, we reduce to Theorem 4.4 by lifting the state: we choose a large enough horizon such that truncating the unrolled Kalman filter to length incurs an approximation error of at most . Then, by letting the state space be the past observations and inputs, the partially observed LDS is approximately described by a fully-observed LDS. This incurs an additional polynomial factor in the length .

## 6 Conclusion

We have shown that online least-squares, with a carefully chosen regularization and a refined analysis, has a sublinear regret guarantee in marginally stable linear dynamical systems, even in the most difficult cases when the state can grow polynomially. In the stochastic setting, adopting the same view of low-regret prediction as opposed to parameter recovery, we have shown logarithmic regret bounds and bypassed usual isotropic noise assumptions. Several fundamental questions come to mind:

1. Is the rate optimal? Even in the diagonalizable () case, this is unresolved.

2. Is there a simpler way to get coefficients? The number-theoretic lemma required by Lemma C.11 to control overcounting is somewhat delicate, and we may have overlooked an elementary proof.

3. What is the rate for partially observed LDS when the noise is not Gaussian? We stated Theorem 4.6 for Gaussian noise only, but a similar result will hold as long as at steady state, is given by a linear function of , , and the estimated state . This is required in order for the random variable to be a linear function of past observations and inputs, plus a random variable with zero mean. In general, if the noise is not Gaussian, then is not zero-mean (even if is zero-mean). We use the same machinery as in the proof of Theorem 4.4 to conclude Theorem 4.6, so our proof strategy cannot handle arbitrary zero-mean noise .

For non-Gaussian zero-mean noise , we can instead treat the as adversarial noise, and use the machinery behind Theorem 4.2 to obtain a regret bound. It is an interesting question whether we can obtain polylogarithmic regret with respect to the best linear filter in this case.

We leave these for future work. \printbibliography

## Appendix A Impossibility of sublinear regret without the LDS

For sake of completeness, we include a simple one-dimensional counterexample showing the insufficiency of black-box regret bounds for online least-squares when it is only assumed that an upper bound for the regressors grows with time. Even without adversarial perturbations, this information-theoretic lower bound shows that we require a refined notion of gradual growth of regressors , as in the structural results of Section 5.2, to achieve sublinear regret.

###### Proposition A.1.

There is a joint distribution over and length- sequences , for which and , but any online algorithm incurs at least expected regret.

###### Proof.

We construct this distribution, choosing with equal probability. We choose for all , then choose , so that . In this example, at time , all previous feedback is independent of , so the best prediction at time is , which suffers expected least-squares loss . ∎

## Appendix B Anomaly-free inputs imply sublinear OLS regret

We begin by defining a structural condition on OLS inputs, whereby if any time is large in a direction then many previous ’s are also large in that direction.

###### Definition B.1 (Anomaly-free sequences).

A sequence is -anomaly-free if for any and any unit vector , if , there exist at least indices such that .

We show that when every is obtained from anomaly-free from a fixed linear transformation, plus a bounded perturbation, then OLS attains sublinear regret.

###### Theorem B.2.

For constants , , , suppose an online least-squares problem satisfies the following conditions:

1. There exists such that for all , with .

2. The input vectors are bounded: .

3. is -anomaly-free.

Then online least squares with regularization parameter incurs regret

 Rt(A) ≤μ∥A∥2F+O(m3nmax⎧⎪ ⎪⎨⎪ ⎪⎩Cξc2tμ+∥A∥2c,Cξc22,1c1α+21c2α+22⎛⎜⎝∥A∥2μ1α+2+Cξt12μ12−1α+2⎞⎟⎠⎫⎪ ⎪⎬⎪ ⎪⎭2 ⋅ln(1+tB2n)).

If (1a) , (1b) , (2) , and (3) , then If , , and satisfies these conditions, then

 Rt(A)=O⎛⎜ ⎜ ⎜⎝m2+1α+1ntα+22(α+1)ln(1+tB2n)c1α+11c2α+12⎞⎟ ⎟ ⎟⎠ (8)

Note that if while the other constants are fixed, the inequalities do hold for the choice of . The only inequality that is not immediately clear is (1a); it follows from comparing the exponents of : for .

By Theorem 2.1, it suffices to bound , which is done in Lemma B.5. We start with the following calculation.

###### Lemma B.3.

Let , , for . Suppose for each . Let . Let

 Σt =μIm+t−1∑s=0xsx⊤s.

Then

 Atxt−yt =t−1∑s=0ξsx⊤sΣ−1txt−μAΣ−1txt−ξt. (9)
###### Proof.

We calculate

 At =argminA[μ∥A∥2F+t−1∑s=0∥yt−Axt∥22] =t−1∑s=0ysx⊤sΣ−1t=t−1∑s=0(Axsx⊤sΣ−1t+ξsx⊤sΣ−1t) A =A(t−1∑s=0xsx⊤s+μIm)Σ−1t=t−1∑s=0Axsx⊤sΣ−1t+μAΣ−1t At−A =t−1∑s=0ξsv⊤sΣ−1t−μAΣ−1t Atxt−yt =(At−A)xt−ξt =t−1∑s=0ξsx⊤sΣ−1txt−μAΣ−1txt−ξt.\qed

We now bound the size of the residual when condition 3 from Theorem B.2