A Proof of Theorem 3.1

# NonSTOP: A NonSTationary Online Prediction Method for Time Series

## Abstract

We present online prediction methods for univariate and multivariate time series that allow us to handle nonstationary artifacts present in most real time series. Specifically, we show that applying appropriate transformations to such time series can lead to improved theoretical and empirical prediction performance. Moreover, since these transformations are usually unknown, we employ the learning with experts setting to develop a fully online method (NonSTOP) for predicting nonstationary time series. This framework allows for seasonality and/or other trends in univariate time series and cointegration in multivariate time series. Our algorithms and regret analysis subsumes recent related work while significantly expanding the applicability of such methods. For all the methods, we provide sub-linear regret bounds using relaxed assumptions. We note that the theoretical guarantees do not fully capture the benefits of the nonstationary transformations, thus we provide a data-dependent analysis of the follow-the-leader algorithm for least squares loss that provides insight into the success of using nonstationary transformations. We support all of our results with experiments on simulated and real data.

## 1 Introduction

In the analysis of time series, AutoRegressive Moving Average (ARMA) models [Box et al., 2008, Brockwell and Davis, 2009, Hamilton, 1994] are simple and powerful descriptors of weakly stationary processes. As such, they have found tremendous application in many domains including linear dynamical systems, econometrics and forecasting resource consumption [Hamilton, 1994]. It is imminent that with the advent of internet of things (IoT), connected devices will generate large quantities of time series data and thus efficient estimation and prediction with such models will become much more relevant.

Despite a large amount of literature on estimation and prediction for these models, most of it remains within the confines of the statistical assumption of Gaussianity. Such assumptions are often unrealistic [Thompson, 1994] and lead to poor prediction performance. Moreover, since the noise sequence is not known beforehand, standard methods of ARMA estimation rely on conditional likelihood estimation. These methods usually lead to nonlinear estimation problems and only hold for Gaussian residual sequences and the least squares loss.

In the setting of streaming or high-frequency time series, one would ideally like to have methods that update the model, predict sequentially, and do not rely on any restricting assumptions on the noise sequence or the loss function. This brings attention to the paradigm of online learning [Cesa-Bianchi and Lugosi, 2006]. In that vein, Anava et al. [2013] recently presented online gradient and online Newton methods (ARMA-OGD and ARMA-ONS) for ARMA prediction. Using a truncated auto-regressive (AR) representation of an ARMA process, the authors provide online ARMA prediction algorithms with sublinear regret, where the regret is with respect to the best conditionally expected one-step ARMA prediction loss in hindsight (See Anava et al. [2013] for more details).

While Anava et al. [2013] make no assumption about the stationarity of the generating ARMA process, in the absence of appropriate modifications, the empirical performance of ARMA-OGD suffers in the presence of seasonality and/or trends which are very common in real time series [Box et al., 2008]. To handle a trend in the time series, Liu et al. [2016] recently presented ARIMA-OGD, a straigtforward extension of ARMA-OGD also with sublinear regret. This still doesn’t acount for seasonality in the time series or the presense of both a trend and seasonality. More importantly, the trend transform and its parameters are assumed to be known. This is rarely the case in most online settings as one typically needs adequate data to test for such nonstationarities. Also, a fixed transform may not adapt well to the changes in the incoming data. Moreover, existing methods (Liu et al. [2016] or Anava et al. [2013]) do not carry over to the multivariate domain. These shortcomings of existing work necessitate the development of broader methods that take into account different types of nonstationarities (with unknown parameters) with extensions to multivariate time series.

In this paper, we present general methods for online time series prediction that account for possible nonstationarities in the data. When the form of these nonstationarities are known, this leads to transformation of the data (depending on the type of nonstationarity) before prediction in the univariate case. For the general multivariate case with cointegration, we propose a novel algorithm for prediction in potentially nonstationary vector time series generated by error corrected VARMA (EC-VARMA) [Tsay, 2013, Lütkepohl, 2005] processes. Estimating EC-VARMA models are non-trivial in general and the algorithm we propose simultaneously estimates both the cointegrating rank and the VARMA matrix parameters. Please see Tsay [2013] for more details on error corrected models. In the realistic unknown transform setting, we unify the above methods into a meta-algorithm called NonSTOP that is essentially the weighted majority method (Cesa-Bianchi and Lugosi [2006]) wherein each expert corresponds to different parameter settings of the nonstationary transformation (e.g. trend only, trend and seasonality, no trend/seasonality, etc). The weighted majority paradigm allows us to quickly hone in on the correct transformation and make improved predictions. As expected, NonSTOP also allows for flexibility in adapting to changes in the data. Relaxing the assumptions compared to previous work, we provide sublinear regret guarantees for all methods.

Our regret guarantees and that of [Anava et al., 2013, Liu et al., 2016] do not completely explain why the convergence for these online prediction methods is faster for seasonal/trend adjusted data. We conjecture that these bounds are missing data dependent terms that capture correlations inherent in many real nonstationary time series. To give a flavor of what a satisfactory data dependent regret bound might look like, we analyze the regret for the Follow-The-Leader (FTL) algorithm in the case of least squares loss and show that these bounds depend on a data dependent term and can be compared across the different spectrum of real time series (stationary/trend/seasonal etc).

### 1.1 Contributions

Our contributions in this paper can be highlighted as follows:

1. We provide general methods for time series prediction using Online Gradient Descent (OGD) [Zinkevich, 2003] that allow for appropriate modifications to nonstationary time series before making a prediction. Our approach, being more general, includes existing work while expanding the applicability of such online methods to realistic time series settings.

1. For the univariate case, we propose simple modifications to ARMA-OGD that lead to improved empirical and theoretical performance.

2. In the multivariate setting, we provide a novel algorithm that learns the error correction term alongside the VARMA matrix coefficients.

2. Since the appropriate transformations are usually unknown in the online setting, we propose NonSTOP, an algorithm that allows us to model each transformation as an expert in the weighted majority setting. This lets us deal with the uncertainty in the transform and changes in the incoming data in case the nonstationarity changes over time.

3. Our regret analysis only requires invertibility of the moving average polynomial (See Box et al. [2008] for a discussion of invertibility), while the assumptions in Anava et al. [2013] and Liu et al. [2016] are less natural. Moreover, we don’t require an upper bound on the data as nonstationary data can be unbounded.

4. To emphasize the effect of the these nonstationary transformations, we prove a data dependent regret guarantee for FTL (for least squares loss) that gives insights into why adjusting for nonstationarities can give faster convergence.

Note that we can easily extend our algorithms and analysis for the online Newton step method of Anava et al. [2013] . But for simplicity and efficiency, we limit our analysis to OGD. All proofs of theorems can be found in the Supplement.

## 2 Preliminaries: Time Series Modeling

In this section, we provide a summary of seasonal and/or integrated extensions to standard ARMA processes. We also provide a brief introduction to VARMA and EC-VARMA processes. For an introduction to online convex optimzation and online gradient descent, please see [Shalev-Shwartz, 2011, Zinkevich, 2003, Hazan et al., 2007]. We introduce standard notation for time series in Table 1.

### 2.1 SARIMA Processes

Time series exhibiting seasonal patterns can be modeled by Seasonal AutoRegressive Integrated Moving Average (SARIMA) Processes. SARIMA processes are described by the following equation:

 ϕ(L)Φ(Ls)ΔdΔ~Dsxt=θ(L)Θ(Ls)εt (1)

where , and . Note that

1. implies a ARIMA() process.

2. implies a ARMA() process.

SARIMA processes explicitly model trend and seasonal nonstationarities by assuming that the differenced process is an ARMA process with AR lag polynomial and MA lag polynomial . We denote the order of the underlying AR and MA lag polynomials as and , respectively. For SARIMA processes, Eq. 1 gives us that and .

If the MA lag polynomial has all of its roots outside of the complex unit circle, then the SARIMA process is defined as invertible. Let be the scalar coefficients of the MA lag polynomial. Invertibility is equivalent to saying that the companion matrix

 F=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣−β1−β2……−βlm10……0010…⋮⋮0⋱⋱⋮0⋮⋮10⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (2)

has eigenvalues less than 1 in magnitude. If this is the case, then the underlying ARMA process can be written as an AR process and can be approximated by a finite truncated AR process.

### 2.2 VARMA Processes

Vector AutoRegressive Moving Average (VARMA) processes provide a parsimonious description of modeling linear multivariate time series. Let . A VARMA process is described by:

 xt=p∑i=1Φixt−i+q∑i=1Θiεt−i+εt (3)

which can also be written in lag polynomial form:

 Φ(L)xt=Θ(L)εt (4)

with The requirements for invertibility are very similar to the univariate case. We require that must have all of its roots outside of the complex unit circle. Again, this is equivalent to saying that the companion matrix has eigenvalues less than 1 in magnitude [Lütkepohl, 2005, Tsay, 2013]. If the process is invertible, then it can be rewritten as a VAR process.

### 2.3 EC-VARMA Model

In many cases, a collection of time series may follow a common trend. This phenomenon, known as cointegration, is ubiquitous in economic times series [Tsay, 2013]. Let be described by Eq. 3. Formally, is cointegrated if is stationary and there exists a vector such that is a stationary process. If is cointegrated, then we can rewrite the original VARMA representation of as

 Δxt=Πxt−1+p−1∑i=1ΓiΔxt−i+q∑i=1Θiεt−i+εt (5)

where is low rank, and for . Eq. 5 is known as an Error-Corrected VARMA (EC-VARMA) model. Note that this looks like a VARMA process in the differenced series , except that there is the error-correction term . See [Lütkepohl, 2005, Tsay, 2013] for more details.

This family of models is the multivariate analogue of the ARIMA model. However, differencing in multivariate time series may lead to a phenomenon called over-differencing and thus we must consider the notion of error correction models. We refer the reader to Tsay [2013] for a discussion of this issue.

## 3 Univariate Methods

In this section, we present algorithmic methods for online univariate time series prediction which subsumes recent works such as ARMA-OGD as presented in Anava et al. [2013] and its the extension to trend nonstationarities ARIMA-OGD as presented in Liu et al. [2016].

Precisely, we show that time series with certain characteristics (such as a trend and/or seasonality) can be appropriately transformed before prediction to give better theoretical and empirical results. To achieve this goal, we present a unified template for time series prediction using OGD, denoted TSP-OGD, that allows for prediction of transformed time series. We will show that the choice of the transformation, dependent on the underlying data generation process (DGP), can lead to improved regret guarantees, partially explaining why these transformations lead to better empirical performance.

This framework includes some of the commonly used transformations of seasonal and non-seasonal differencing [Box et al., 2008]. Table 2 shows the explicit form of of such transformations. In practice, the order of differencing is usually determined by statistical tests (e.g. Elliot et al. [1996]) on a given dataset, which is not realistic when considering the online setting.

### 3.1 Tsp-Ogd

We assume the following for the remainder of this section:

• is generated by a DGP such that there exists a transformation which results in an invertible ARMA process. Moreover, there corresponds an inverse transformation that satisfies . Examples of such processes are ARMA, ARIMA, and SARIMA processes.

• The noise sequence of the process is independent. Also, it satisfies that .

• is a convex loss function with Lipschitz constant .

• We assume the companion matrix (as defined in Eq. (2)) of the MA lag polynomial is diagonalizable, i.e. where is a diagonal matrix of eigenvalues. Denote as the magnitude of the largest eigenvalue ( by definition of invertibility), and such that .

In Algorithm 1, the model parameters of the stochastic process are fixed by an adversary. At time , and are generated by the DGP. Before is revealed to us, the learner (see Algorithm 1) makes a prediction which incurs a prediction loss of . This prediction is preceded by a transform (See Table 2) that may require data points from previous rounds (we suppress that dependence in the notation for convenience). The prediction is computed using an AR model of order to approximate the underlying invertible ARMA process. Then it is inverted with and incurs a loss

 ℓMt(γ):=ℓt(xt,ζ(τ(~xt)))=ℓt(xt,ζ(M∑i=1γiτ(xt−i))) (6)

where is the vector of parameters of the approximating AR model. The prediction performance is evaluated using an “extended” notion of regret that looks at the prediction loss of the best process in hindsight. Precisely, let denote the set of AR and MA parameters, respectively, of the underlying ARMA process . Define

 ft(α,β)=ℓt(xt,ζ(E[τ(xt)|{τ(xt)}t−1t=1;α,β])) (7)

Note that depends on the transformations in U1. The extended regret is defined as comparing the accumulated loss in Eq. (6) to the loss of the best process in hindsight:

 Regret=T∑t=1ℓMt(γ(t))−minα,β∈KT∑t=1E[ft(α,β)] (8)

where is the set of invertible ARMA processes. Note that the randomness in the expectation is w.r.t. the noise sequence while the data is fixed.

Furthermore, let be a convex set of approximating AR models, i.e. . should be chosen to be large enough to include a valid approximation to the DGP described in U1. However, since the DGP is unknown in practice, one usually chooses something simple such as . Let , and for some monotonically increasing . This assumption stems from the fact that we allow the time series to be potentially unbounded. As an example, the norm of the gradient for the squared loss depends on the bound on the data. Let denote the projection operator onto the set .

We present a general regret bound for Algorithm 1:

###### Theorem 3.1.

Let . Then for any data sequence that satisfies assumptions U1-U4, Algorithm 1 generates a sequence in which

 T∑t=1ℓMt(γ(t))−minα,β∈KT∑t=1E[ft(α,β)]=O(DG(T)√T)

Remark 1: Note that plugging in the ARMA transformation and ARIMA transformation in Table 2 to Algorithm 1 recovers ARMA-OGD as presented in Anava et al. [2013] and ARIMA-OGD as presented in Liu et al. [2016], respectively. Plugging in the SARIMA transformation results in a variation which we will denote as SARIMA-OGD.

For the following remarks, assume that is squared loss, the DGP is a SARIMA process, and (note that the transformation is commonly employed as a variance stabilizer in many time series domains).

Remark 2: With these assumptions, Table 3 shows the regret bounds obtained by using different transformations/algorithms. The differencing transforms remove any growth trends in the data; as a consequence the transformed time series is bounded by a constant. In our case, this implies (a constant), which leads to an improvement over the regret bound obtained from ARMA-OGD. This improvement can be seen in the empirical results section of Liu et al. [2016].

Remark 3: When the DGP is assumed to be SARIMA, we require that as mentioned in Section 2, i.e. both need to essentially be a multiplicative factor larger than . This affects the length of the required AR approximation as described in line 1 of Algorithm 1.

### 3.2 Data Transformation Dependent Regret

The transformations discussed in the previous sections essentially diminish the effect of serial correlation in the data due to any existing nonstationary trends. However, our regret bounds (shown in Table 3) do not accurately reflect this. We conjecture that these bounds are missing data-dependent terms that capture correlations inherent in many nonstationary time series. To give a flavor of what a satisfactory data dependent regret bound might look like, we analyze the regret for the FTL algorithm for the case of least squares loss

 ℓt(γ)=12(xt−γ⊺ψt)2 (9)

and show that these bounds depend on a data dependent term. We look at the standard notion of regret and hence the result in this section is much more general than time series prediction and is also relevant to general regression problems.

The FTL algorithm follows a simple update [Shalev-Shwartz, 2011]:

 γt+1∈\operatornamewithlimitsargminγt∑i=1ℓt(γ) (10)

Plugging Eq. 9 in Eq. 10 reveals that the FTL algorithm for least squares loss is just recursive least squares (RLS). Using the relevant RLS update equations Ljung [1998], Lai and Wei [1982], we present a data dependent regret bound for FTL with least squares loss.

###### Theorem 3.2.

Let be defined in Eq. 9 with Lipschitz constant . Then FTL generates a sequence in which

 T∑t=1ℓt(γt)−minγT∑t=1ℓt(γ)=O(T∑t=11tλmin(t))

where

 λmin(t)=λmin(1tt∑i=1ψiψ⊺i).

At the heart of our framework in Algorithm 1, we are approximating an ARMA process with an AR model. In order to apply Theorem 3.2 to our time series prediction setting for DGPs as described in assumption U1, we use FTL and least squares loss to predict the underlying ARMA process with an AR model , where and . This results in , which is the empirical non-centered autocovariance of the transformed data. Ideally, we want this quantity to be large, meaning that the invidividual samples are not highly correlated.

To empirically assess the regret bound when accounting for the appropriate nonstationarities, we calculate the bound for the three transforms in Table 2. We simulated a SARIMA process times with . We then averaged the calculated regret bound across the 50 simulated datasets using each transformation. The result is shown in Figure 1. The appropriate transformations essentially decrease correlations making the data more like realizations of a stationary ARMA process; we can see that accounting for the appropriate nonstationarities results in tighter regret bounds.

Missing from the analysis present in this section and other related works is the case of multivariate time series. The issue of nonstationarity in such series is complicated by the fact that differencing transforms do not preserve the conintegrating structure prevalent in many real time series. Thus, online methods must additionally infer this relationship from the data.

## 4 Multivariate Methods

Online prediction using multivariate nonstationary models present an additional difficulty due to the notion of cointegration (Section 2). Estimating EC-VARMA models in the static setting is non-trivial in general since the cointegrating rank is unknown and is typically determined by statistical tests (e.g. trace statistic of Johansen [1988]), which again is not realistic in the online setting. We propose a novel online method for cointegrated vector valued time series that simultaneously updates both the cointegrating matrix (including its rank) and the approximating VAR matrix parameters in order to accurately adapt to the underlying DGP and make predictions.

### 4.1 Approximating an EC-VARMA Process

Given that an EC-VARMA process starts at some fixed time with fixed initial values, we can write Eq. 5 in a pure EC-VAR form Lütkepohl [2006]:

 Δxt=Π∗xt−1+t−1∑i=1Γ∗iΔxt−i+εt,   t∈N (11)

This allows us to approximate an EC-VARMA process with an EC-VAR model. To use EC-VARMA as a DGP in Algorithm 1, we edit line 5 to be:

 Δ~xt=^Πxt−1+M∑i=1^ΓiΔxt−i

where are the approximating EC-VAR parameters.

### 4.2 Online Prediction for EC-VARMA Models

We generalize the assumptions U1-U4 to the multivariate setting:

• is generated by an EC-VARMA process. The noise sequence of the underlying VARMA process is independent. Also, it satisfies that .

• We overload notation for the vector case and let be a convex loss function with Lipschitz with constant .

• We assume the companion matrix of the MA lag polynomial is diagonalizable. and are the same as in assumption U4.

The resulting algorithm is summarized in Algorithm 2, denoted EC-VARMA-OGD. The setup of this algorithm is the same as described in Section 3. We overload more notation to generalize Equations 6 and 7:

 ℓMt(γ) Missing \left or extra \right (12) ft(Π,Γ,Θ) =ℓt(xt,xt−1+Πxt−1+p−1∑i=1ΓiΔxt−i+q∑i=1Θiεt−i)

The regret as defined in Eq. 8 can be generalized to

 Regret=T∑t=1ℓMt(γt)−minΠ,Γ,Θ∈KT∑t=1E[ft(Π,Γ,Θ)] (13)

where is the set of invertible EC-VARMA processes.

To encourage to be low rank, we project it onto , which is the nuclear norm ball of radius . This involves projecting the singular values of onto an -ball and can be efficiently done [Duchi et al., 2008]. In our framework, this is handled by letting the convex set be described as and plugging it into OGD where projections are made at each iteration. For convenience of notation, let . As in Section 3, should be chosen to be large enough to encompass a valid approximation to the true DGP.

We present the following regret bound:

###### Theorem 4.1.

Let . Then for any data sequence that satisfies assumptions M1-M3, Algorithm 2 generates a sequence in which

 T∑t=1ℓMt(γt)−minΠ,Γ,Θ∈KT∑t=1E[ft(Π,Γ,Θ)]=O(DG(T)√T) (14)

For the remainder of the section, we assume that is the squared loss and .

Remark 1: With the above assumptions, the resulting regret bound of EC-VARMA-OGD is .

Remark 2: By setting and using in place of (i.e. not differencing) in Algorithm 2, we use a VARMA process as the DGP and achieve an equivalent regret bound as in the previous remark. Denote this adaptation as VARMA-OGD. However, if the DGP is EC-VARMA, we expect this to empirically perform worse than EC-VARMA-OGD since the latter exploits a valid transformation of the data.

Remark 3: Assume that the DGP is an EC-VARMA process and . Then the regret bound obtained is . In Section 6, we find that this choice of works well empirically.

## 5 NonSTOP

Algorithms 1 and 2 assume that the appropriate transformation is known apriori. Typically, statistical tests are used to determine the degree of differencing on a fixed dataset (e.g. Elliot et al. [1996]) and these usually come with assumptions and sample size requirements. In the online setting, these requirements are not realistic and an ideal method must adapt to the incoming data leading to a possibly time dependent sequence of transformations. We approach this problem by using the online learning with experts (OLE) setup wherein each expert corresponds to a specific transformation (including none at all). Specifically, we adapt the (randomized) weighted majority algorithm (WM, for details see [Shalev-Shwartz, 2011]) as a meta-algorithm to select a transformation at each time step and allow for potentially unbounded loss functions.

More precisely, let be the set of experts we consider. The set of experts can either be instantiations of Algorithm 1 or 2. For example, in the univariate setting, we could have with , and in the multivariate setting we can have . We assume that the seasonal period is known.

The resulting algorithm is summarized in Algorithm 3, denoted NonSTOP (NonSTationary Online Prediction). In each round, the online meta-algorithm randomly selects a prediction from one of its experts. After receiving the loss, it then updates its view about its experts, while the experts themselves are adapting to the data. We scale the loss function with a sliding window maximum such that the losses stay bounded. Since , and as shown in Algorithm 1 and 2 are now dependent on the specific transformation, we denote this as for a model . With these definitions in hand, we give the following theorem:

###### Theorem 5.1.

Let . Define . Then Algorithm 3 plays a sequence of predictions that satisfies

 T∑t=1E[ℓt(ht)]−minα,β∈KT∑t=1E[ft(α,β)]=O(max{BT,D∗G∗(T)}√T) (15)

where . Note that this bound is the same as if you used in place of .

###### Remark 5.1.1.

When using least squares loss, and the regret bound defaults to .

## 6 Empirical Results

In this section, we show empirically the effectiveness of methods described in Sections 3, 4, and 5 on synthetic and real datasets. In each scenario, we use squared loss and plot the log average squared loss vs. iteration. For all experiments, we set , initialize all parameters to 0, and set the sliding window length . For all real world datasets, we log transform the time series. Plots of these datasets can be found in the Supplement.

### 6.1 Univariate Setting

We first simulate 20 synthetic time series with from the following SARIMA model:

 ΔΔ12xt=(1−0.95L)(1−0.4L12)ϵt (16)

We compare NonSTOP to each algorithm in Table 3 on the generated data. We run the algorithms on each generated time series and average the log average loss in Figure a. As expected, SARIMA-OGD outperforms ARMA-OGD and ARIMA-OGD, converging quickly as it accounts for the appropriate nonstationarities. This behaviour is consistent with our hypothesis that in the absence of an appropriate transformation, existing methods will underperform. NonSTOP gradually adapts and learns to heavily weight the correct transformation (expert) and outperforms ARMA-OGD and ARIMA-OGD. Note that the initial bump in the convergence of ARMA-OGD is due to the fact that we are plotting (log) average loss and there is an initial period where the average loss grows faster than , and then it grows slower.

To showcase the adaptability of NonSTOP, we simulate 20 synthetic time series from Eq. (16) for timesteps, and then simulate data from an ARIMA model for another timesteps. Results are shown in Figure b. NonSTOP learns to weight SARIMA-OGD, but quickly adapts at to weight ARIMA-OGD. In fact, at the end of the run, it actually outperforms all experts, showing the power of this fully online adaptable algorithm.

Next, we consider a dataset that contains daily electricity demand in Turkey from January 1, 2000, to December 31, 2008. The seasonality in this dataset is biannual (rounded down to days). The results of running the algorithms are shown in Figure c. Again, SARIMA-OGD accounts for the appropriate nonstationarities and performs the best. ARMA-OGD suffers severely due to not accounting for nonstationarity. As such, NonSTOP quickly finds that ARMA-OGD is not a reliable expert and performs well in comparison to the other experts. For the daily recorded births in Quebec from Jan. 01, 1977 to Dec. 31, 1990 there is a weekly seasonality pattern with . Figure d reveals that the results here are similar to previous datasets. Because NonSTOP starts with an equal weight for each expert, it pays a large penalty for selecting ARMA-OGD in initial iterations. However, it approaches the performance of the other algorithms as it learns to optimally weight the correct transformation.

Lastly, we consider a dataset consisting of daily river flow values from the Saugeen River from Jan 01, 1915 to Dec 31, 1979. This data exhibits a yearly () seasonality pattern. The results are plotted in Figure e. While accounting for any non-stationarity improves convergence, accounting for the seasonality actually hurts the performance compared to accounting for the trend only as SARIMA-OGD is outperformed by ARIMA-OGD. In our experience, sometimes ARIMA can outperform SARIMA even on seasonal data. Yet NonSTOP learns to weight ARIMA-OGD and quickly approaches the best performance. This showcases the efficacy of the NonSTOP algorithm in a fully online setting.

### 6.2 Multivariate Setting

In the multivariate setting we show empirically that accounting for cointegration results in faster convergence. We look at the results of running EC-VARMA-OGD as described in Algorithm 2 vs. VARMA-OGD on two real datasets.

We collected 7 time series of stock prices from Yahoo Finance (http://finance.yahoo.com/) of large technology companies, and also includes the S&P500 index. By including the S&P500, which is essentially an weighted average of 500 company stock prices, we have partially introduced cointegration into the time series. We set and ran all algorithms with the resulting plots in Figure a. Accounting for cointegration results in considerably stronger performance. There is a bump in the convergence plot due to a spike in the data (see Supplement). We also evaluated the algorithms on the Google Flu dataset (https://www.google.com/publicdata/explore/) which contains influenza rates of countries. There are two distinct seasonality patterns: the northern hemisphere countries have flu incidents that peak in one part of the year while the southern hemisphere countries have flu incidents that peak in the other part of the year. Thus, it makes sense to believe that the time series exhibit cointegration. This dataset exhibits yearly seasonality, thus we set to be larger than one seasonal period. We choose and plot the results are given in Figure b. Again, adjusting for the cointegration dramatically increases predictive performance.

On both datasets, NonSTOP pays a penalty for selecting VARMA-OGD in the initial iterations before learning to heavily weight EC-VARMA-OGD. Note that NonSTOP outperforms VARMA-OGD by at least a factor of 3 on the original scale for both datasets.

## 7 Conclusions and Future Work

We presented general online algorithms that account for time series to account for nonstationary artifacts in both univariate and multivariate data. If known in advance, we demonstrate that these transformations lead to superior theoretical and empirical performance. In the case that these are unknown, we incorporate a finite set of possible transformations into a OLE framework called NonSTOP that can learn to appropriately weight the correct transformation. Speculating that accounting for nonstationary artifacts reduces correlation in the data, we presented a data dependent bound for FTL in the case of squared loss. In future work, we plan to explore extensions that hold for more complicated models like ARFIMA.

## Appendix A Proof of Theorem 3.1

We give a proof similar to Anava et al. [2013] and Liu et al. [2016] using our transformation notation, and with the more natural and relaxed assumption of invertibility of the MA process.

###### Proof.

Step 1: Assume that is a linear function such as the ones given in Table 2. Then are convex loss functions, and we may invoke [Zinkevich, 2003] with a fixed step size :

 T∑t=1ℓMt(γt)−minγT∑t=1ℓMt(γ)=O(DG(T)√T)

Note that the proof in [Zinkevich, 2003] uses a constant upper bound on the gradients. Since we assume is a monotonically increasing function, the proof in [Zinkevich, 2003] follows through straightforwardly.

Step 2: Let denote the parameters of the underlying ARMA process. We define a few things:

 τ(x∞t(α,β)) =la∑i=1αiτ(xt−i)+lm∑i=1βi(τ(xt−i)−τ(x∞t−i(α,β))) x∞t(α,β) =ζ(τ(x∞t(α,β)))

with initial condition for . For convenience, assume that we have fixed data so that exists. Denote

 f∞t(α,β)=ℓt(xt,x∞t(α,β))

With this definition, we can write , i.e. as a growing AR process. Next, we define

 τ(xmt(α,β)) =la∑i=1αiτ(xt−i)+lm∑i=1βi(τ(xt−i)−τ(xm−it−i(α,β))) xmt(α,β) =ζ(τ(xmt(α,β)))

with initial condition for . We relate and with this relation: . With this definition, we can write , i.e. as a fixed length AR process. Denote

 fmt(α,β)=ℓt(xt,xmt(α,β))

Let . Recall that the only random part of the expectation is . is fixed in this quantity.

Lemma A.1 gives us that

 minγT∑t=1ℓMt(γ)≤T∑t=1fmt(α∗,β∗)

Lemma A.3 says that choosing results in

 ∣∣ ∣∣T∑t=1E[fmt(α∗,β∗)]−T∑t=1E[f∞t(α∗,β∗)]∣∣ ∣∣=O(1)

Lemma A.2 gives us that

 ∣∣ ∣∣T∑t=1E[f∞t(α∗,β∗)]−T∑t=1E[ft(α∗,β∗)]∣∣ ∣∣=O(1)

Chaining all of these gives us the final result:

 T∑t=1ℓmt(γt)−minα,β∈KT∑t=1E[ft(α,β)]=O(DG(T)√T)

###### Lemma A.1.

For all and that satisfies the assumptions U1-U4, we have that

 minγT∑t=1ℓmt(γ)≤T∑t=1fmt(α∗,β∗)
###### Proof.

We simply set and get . Thus, the minimum holds trivially. Note that we assume

###### Lemma A.2.

For any data sequence that satisfies the assumptions U1-U4, it holds that

 ∣∣ ∣∣T∑t=1E[f∞t(α∗,β∗)]−T∑t=1E[ft(α∗,β∗)]∣∣ ∣∣=O(1)
###### Proof.

Let denote the parameters that generated the signal. Thus,

 ft(α′,β′)=ℓt(xt,xt−εt)

for all . Since is independent of , the best prediction at time will cause a loss of at least Since and is convex, it follows that and that

 ft(α∗,β∗)=ℓt(xt,xt−εt)

We define a few things first. Let

 yt=τ(xt)−τ(x∞t(α∗,β∗))−εt,    yt=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣ytyt−1⋮yt−q+1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

WLOG (and by assumption), we can assume that , where is some positive constant. Next we show that

 E[|yt|]=E[|τ(xt)−τ(x∞t(α∗,β∗))−εt|]≤κλtmaxρ

We have that

 τ(xt)−τ(x∞t(α∗,β∗))−εt= la∑i=1α∗iτ(xt−i)+lm∑i=1β∗iεt−i+εt −la∑i=1α∗iτ(xt−i)−lm∑i=1β∗i(τ(xt−i)−τ(x∞t−i(α∗,β∗)))−εt = −lm∑i=1β∗i(τ(xt−i)−τ(x∞t−i(α∗,β∗))−εt−i)

which shows that . The companion matrix to this difference equation is exactly as defined in Eq. 2. Thus,

 yt=Fyt−1

Next, we note that

 |yt| ≤∥yt∥2=∥Fyt−1∥2 =∥F2yt−2∥2 =∥Fty0∥2 =∥TΛtT−1y0∥2 ≤∥T∥2∥T−1∥2∥Λt∥2∥y0∥2 =σmax(T)σmin(T)λtmax∥y0∥2 ≤κλtmax∥y0∥2

Taking the expectation gives us

Now we combine this with the Lipschitz continuity of to get

 ∣∣E[f∞t(α∗,β∗)]−E[ft(α∗,β∗)]∣∣ =∣∣E[ℓt(xt,x∞t(α∗,β∗))]−E[ℓt(xt,xt−εt)]∣∣ ≤E[|ℓt(xt,x∞t(α∗,β∗))−ℓt(xt,xt−