1 Introduction

Penalised inference for autoregressive moving average models with time-dependent predictors

[3mm] Hamed Haselimashhadi and Veronica Vinciotti

Department of Mathematics, Brunel University London, UK

Abstract: Linear models that contain a time-dependent response and explanatory variables have attracted much interest in recent years. The most general form of the existing approaches is of a linear regression model with autoregressive moving average residuals. The addition of the moving average component results in a complex model with a very challenging implementation. In this paper, we propose to account for the time dependency in the data by explicitly adding autoregressive terms of the response variable in the linear model. In addition, we consider an autoregressive process for the errors in order to capture complex dynamic relationships parsimoniously. To broaden the application of the model, we present an penalized likelihood approach for the estimation of the parameters and show how the adaptive lasso penalties lead to an estimator which enjoys the oracle property. Furthermore, we prove the consistency of the estimators with respect to the mean squared prediction error in high-dimensional settings, an aspect that has not been considered by the existing time-dependent regression models. A simulation study and real data analysis show the successful applications of the model on financial data on stock indexes.

Keywords: time series, high dimensional models, lasso

## 1 Introduction

This paper deals with fitting a general time series-regression model using regularized inference. In the context of linear models, penalized approaches have received great interest in recent years as they allow to perform variable selection and parameter estimation simultaneously for any data, including high-dimensional datasets, where classical approaches for parameter estimation break down, e.g. [14, 7, 10, 17, 12]. [17] have shown that a model where penalties are adapted to each individual regressor enjoys oracle properties. Most of the advances in regularized regression models have been for the case of independent and identically distributed data. A recent line of research has concentrated on regularized models in time dependent frameworks. Amongst these, [15] showed the successful application of penalised inference in the context of autocorrelated residuals for a fixed order, by proposing the model

 yt=r∑i=1x′tiβi+q∑j=1θjϵt−j+et,

and studied the properties of this model in low-dimensional settings. [11] studied the theoretical properties of a regularized autoregressive process on for both low and dimensional cases, whereas [13] studied the estimation of vector autoregressive models. In both cases, no exogenous variables are included in the model. [9] studied the asymptotic properties of adaptive lasso in high dimensional time series models when the number of variables increases as a function of the number of observations. Their model covers a lagged regression in the presence of exogenous variables, but does not consider autocorrelated residuals. Recently, [16] proposed an extension of the model of [15] by adding a moving average term, that is they propose a model of the form

 yt=r∑i=1x′tiβi+ϵt,ϵt=q∑j=1θjϵt−j+et+q∑j=1ϕjet−j.

Similarly to [15], they proved the consistency of the model in low-dimensional cases. Despite the generality of this model, considering an ARMA process for the errors results in a complex model with a challenging implementation.

In this paper, we propose to account for the time dependency in the data by explicitly adding autoregressive terms of the response variable in the linear model, as in [11], as well as an autocorrelated process for residuals, as in [15], in order to capture complex dynamics parsimoniously. In particular, given fixed orders and , we propose the model

 yt=x′tβ+p∑j=1ϕjyt−j+q∑i=1θiϵt−i+et. (1)

We name the terms in the right hand side of (1) as REGression term, AutoRegressive term and Moving Average term respectively and call the resulting model REGARMA. We assume that all time dependent components in REGARMA are stationary and ergodic. Figure (1) illustrates a schematic view of the model.

In Section 2, we formulate the model and present an penalized likelihood approach for the estimation of the parameters. In Section 3, we prove the asymptotic properties of the model and show how the adaptive lasso penalty leads to estimators which enjoy the oracle property. Furthermore, we prove the consistency of the model with respect to the mean squared prediction error in high-dimensional settings, an aspect that has not been considered by the existing time-dependent regression models. In section 4, we discuss the implementation of REGARMA. A simulation study, given in section 5, will accompany the theoretical results. In section 6 we apply the model to two real datasets in finance and macroeconomic, respectively. Finally, we draw some conclusions in section 7.

## 2 L1 penalised parameter estimation of REGARMA

The general form of REGARMA consists of a lagged response variable, covariates and autocorrelated residuals. Consider the following Gaussian REGARMA model of order and ,

 yt=x′tβ+p∑j=1ϕjyt−j+q∑i=1θiϵt−i+et,etiid∼N(0,σ2),t=1,2,3,…,T

where is the row of the matrix of predictors , and follow stationary time series processes, that is all roots of the polynomials and are unequal and outside the unit circle, s are independent and identical Gaussian noises with mean of zero and finite fourth moments, and and are both less than the number of observations . Moreover, we assume that the errors and explanatory variables in are independent of each other. To remove the constants from the model we follow the literature on regularized models, e.g. [14, 5], and standardize the covariates and response to zero means and unit variance.

Given the first observations, maximizing the penalized conditional likelihood of the model is equivalent to minimizing

 Qn(Θ)= T∑t=T∘+1((yt−x′tβ)−p∑i=1ϕiyt−i−q∑j=1θjϵt−j)2+r∑i=1λ|βi|+p∑j=1γ|ϕj|+q∑k=1τ|θk| (2)

where are tuning parameters and is the vector of regression, autoregressive and moving average parameters. Following the literature, and given the superior properties of adaptive lasso models [17], we also propose an adaptive version of REGARMA penalised estimation as follows

 Q∗n(Θ)= T∑t=T∘+1((yt−x′tβ)−p∑i=1ϕiyt−i−q∑j=1θjϵt−j)2+r∑i=1λ∗i|βi|+p∑j=1γ∗j|ϕj|+q∑k=1τ∗k|θk|

where are tuning parameters.

### 2.1 Matrix representation of the model

For convenience, we write the model in matrix representation. Let be a matrix including lags of autoregressive (), moving average (), and explanatory variables (). Let denote the vector of corresponding parameters, be the vector of errors, and , as previously defined. Then, in matrix form, the model can be written as

 Y=H′Θ+e

and the penalized conditional likelihood given the first observation is equivalent to

 Qn(Θ)=L(Θ)+λ′|β|+γ′|ϕ|+τ′|θ|,

where . Similarly, the adaptive form of the model is given by

 Q∗n(Θ)=L(Θ)+λ′∗|β|+γ′∗|ϕ|+τ′∗|θ|, (3)

where the parameters are given by

 λ∗′=(λ∗1,λ∗2,…,λ∗r),γ∗′=(γ∗1,γ∗2,…,γ∗p),τ∗′=(τ∗1,τ∗2,…,τ∗q),Θ=(β′,ϕ′,θ′).

## 3 Theoretical properties of REGARMA and adaptive-REGARMA

In order to study the theoretical properties of REGARMA and adaptive-REGARMA, we define the true coefficients by and assume that some of these coefficients are zero. The indexes of non-zero coefficients in each group of coefficients, and , are denoted by and respectively, whereas are the complementary sets and contain the indexes of zero coefficients. We also define and their corresponding (REGARMA) estimations by . Similarly, adaptive-REGARMA estimations are denoted by . Finally, different combinations of model parameters are going to be used, with obvious meaning, in particular , , , , , .

### 3.1 Assumptions

To prove the theoretical properties of the estimators, in line with the literature, we make use of the following assumptions:

1. s are i.i.d Gaussian random variables with finite fourth moments

2. The covariates, , and response variable, , are stationary and ergodic with finite second order moments. Also, we assume that none of the roots of and/or are equal and outside of the unit circle

3. are independent of the errors

4. and .

Assumptions are standard assumptions for dealing with stationary time series. Assumption are used to guarantee that explanatory variables have finite expectations.

### 3.2 Theoretical properties of REGARMA when r<n

In the following theorems, we extend the theorems of [15] to cover a model with a lagged response.

###### Theorem 1.

Assume , , and . Then under assumptions , it follows that where

 k(δ)=−2δ′W+δ′UBδ +λ∘r∑i=1{uisign(β∘i)I(β∘i≠0)+|ui|I(β∘i=0)} +γ∘p∑j=1{vjsign(ϕ∘j)I(ϕ∘j≠0)+|vi|I(ϕ∘j=0)} +τ∘q∑k=1{wksign(θ∘k)I(θ∘k≠0)+|wk|I(θ∘k=0)}

with is a vector of parameters in , and .

The proof is given in the Appendix. Theorem (1) shows that the REGARMA estimator has a Knight-Fu type asymptotic property [7] and it implies that the tuning parameters in cannot shrink to zero at a speed faster than . Otherwise, are zero and becomes a standard quadratic function,

 k(δ)=−2δW+δ′UBδ

which does not produce a sparse solution. In addition, the proof of theorem (1) requires the errors to be independent and identically distributed but we do not make a strong assumption on the type of distribution for the errors, due to the use of the martingale central limit theorem for large .

[7] proves that a lasso optimization returns estimates of non-zero parameters that suffer an asymptotic bias. This applies also to the REGARMA model, as we show with the following remark.

###### Remark 1.

Consider a special case of REGARMA when but and for . If minimizing can correctly identify , it means that and . That is, must satisfy

 ∂k(δ)∂u =∂k(u,0,0)∂u =∂∂u(−2(u′,0,0)W+(u′,0,0)′UB(u′,0,0)+(nλ′n|β∘+u√n|−nλ′n|β∘|)) =−2W1:r+2u′UB1:r+λ∘1r×1=0 ⟹u′=12(2W1:r−λ∘1r×1)U−1B1:r.

Then using Theorem 1, where is the matrix with the first rows of corresponding to the covariates.

If and are positive, remark (1) shows that suffers an asymptotic bias and is different from the oracle estimator, . In other words, REGARMA is not asymptotically consistent unless . The following remark can be extended to other groups of coefficients.

### 3.3 Theoretical properties of adaptive-REGARMA when r<n

Following the notation of section 2, we consider the adaptive version of the penalised likelihood and estimate the model parameters by minimizing

 Q∗n(Θ)=Ln(Θ)+nλ′∗|β|+nγ′∗|ϕ|+nτ′∗|θ|

where

 Ln(Θ)=(Y−X′β−H(p)ϕ+H(q)θ)′(Y−X′β−H(p)ϕ+H(q)θ)λ∗′={λ∗}′r×1,γ∗′={γ∗}′p×1,τ∗′={τ∗}′q×1,Θ=(β′,ϕ′,θ′).

Following [15] and [3], we define the maximum and minimum penalties for significant and insignificant coefficients by

 an=max(λ∗i1,γ∗i2,τ∗i3;i1∈s1,i2∈s2,i3∈s3), bn=min(λ∗ic1,γ∗ic2,τ∗ic3;ic1∈sc1,ic2∈sc2,ic3∈sc3),

and prove a number of results on the theoretical properties of adaptive REGARMA.

###### Theorem 2.

Assume as . Then under assumptions , there is a local minimiser of such that

 (^Θ∗−Θ∘)=Op(n−1/2+an).

The proof of the theorem is in the Appendix. Let , then, theorem (2) proves that there exists a local minimiser , when the tuning parameters (for significant variables) of REGARMA converge to zero at the speed faster than (since ).

As the next step, we prove that if the tuning parameter associated with insignificant variables in REGARMA shrink to zero at a speed slower than , then their associated REGARMA coefficients will be estimated exactly equal to zero with probability tending to 1.

###### Theorem 3.

Assume and then

 Pr(^β∗sc1=0)→1,Pr(^ϕ∗sc2=0)→1,Pr(^θ∗sc3=0)→1.

The proof of the theorem is in the Appendix. Theorem (2) and (3) indicate that estimator satisfies under certain conditions on the tuning parameters, leading to the following result:

###### Theorem 4.

Assume and . Then, under assumptions , the component of the local minimiser of in Theorem 3 satisfies

 √n(^Θ∗1−Θ∘1)d→MVN(O,σ2U−10)

where is the sub-matrix corresponding to .

The proof of the theorem is in the Appendix. Theorem (4) implies that if tends to zero at the speed faster than and simultaneously increases at the speed slower than , then adaptive REGARMA is asymptotically an oracle estimator. In the next subsection, we consider the theoretical properties of adaptive REGARMA for high-dimensional problems.

### 3.4 Theoretical properties of adaptive REGARMA when n≪r

In the proofs of the low-dimensional results (refer to proof of theorem 1), we rely on a unique path of reaching the maximum of the log likelihood. This is not true in high-dimensional cases, so different results are needed in this case. In this section we follow a similar strategy to [2] to prove theorems in the high dimensional case, an aspect which has not been considered by existing time-dependent regression models, such as those of [15] and [16].

In order to study the consistency of REGARMA in high-dimensional situations, we show that under assumptions , REGARMA is consistent with respect to the mean squared prediction error.
Without loss of generality, we define the REGARMA model as a constrained optimization [14]. Thus, we have

 min{(y−X′β−H(p)ϕ−H(q)θ)′(y−X′β−H(p)ϕ−H(q)θ)} (4) Subject to r∑j=1|βj|≤Kλ,p∑k=1|ϕk|≤Kγ,q∑l=1|θl|≤Kτ, andKλ≥0,Kγ≥0,Kτ≥0

where there is a one-to-one correspondence between and in REGARMA, and and in (4). Define the Mean Squared Prediction Error, , and its estimated value, , by

 MSPE(^β,^ϕ,^θ)=E(∥^Y−Y∘∥2), ˆMSPE(^β,^ϕ,^θ)=1n∥^Y−Y∘∥2

where and are the REGARMA predictions of based on the true parameters and REGARMA estimates from (4), respectively. Then the following theorem holds.

###### Theorem 5.

Under assumptions   and , and , let and be the REGARMA estimates, and such that

 r∑j=1|βj|≤Kλ<∞,p∑k=1|ϕk|≤Kγ<∞,q∑l=1|θl|≤Kτ<∞.

Then

 ˆMSPE(^β,^ϕ,^θ)≤2KmaxMmaxσ√n(√2log(2r)+√2log(2p)+√2log(2q)). (5)

The proof of the theorem is in the Appendix. Note that in the situation where and , equation (5) results in the standard lasso consistency formula in [2]. When is correctly chosen, equation (5) also shows that REGARMA is prediction consistent when .
It is also possible to extend this result to .

###### Remark 2.

Under the same conditions as Theorem (5),

 MSPE(^β,^ϕ,^θ)≤2KmaxMmaxσ√n3∑i=1(√2log(2ai))+8K∗3∑i,j=1⎛⎝MiMj√2log(2aiaj)n⎞⎠,

where and are defined as before, is defined in the Appendix and and .

Remark (2) shows that if then REGARMA is consistent. Given that relatively small orders and are sufficient for most time series analyses, the consistency of the estimator is mainly dominated by the high-dimensional regression part. If then the above equation approximately reduces to a form similar to the standard lasso results in [2],

 MSPE(^β,^ϕ,^θ)≤2KλM1σ√n√2log(2r)+8K∗M21√2log(2r2)n.

## 4 Algorithm

Since , that is REGARMA is a subset of adaptive-REGARMA, and given the improved properties of adaptive REGARMA, we mainly focus on adaptive REGARMA in this section. Our formulation of the model lends itself naturally to its implementation, in contrast to the more complex implementation of the model of [16].

As the model contains regressions, moving averages and autoregressive coefficients, we use the two-step optimization procedure

 First step: ^ϵ=Y−X′^β−H(p)^ϕ,Second step: Y=X′β+H(p)ϕ+^H(q)θ.

Steps 1 and 2 provide a solution to REGARMA using the adaptive-Lasso algorithm of [17].

In terms of the selection of the penalties , and , these can be chosen using K-fold cross-validation or using an information criterion such as BIC or AIC, CP similarly to [15][16] and [4]. The weights in adaptive-REGARMA are defined by using the (non-adaptive) REGARMA estimates. Some notes are needed about the selection of the orders and in the REGARMA model. We propose two general approaches to choose the optimal orders for the model: (a) setting an upper bound and and choosing the model that minimizes BIC or AIC inside these bounds (b) setting an upper bound and and letting the model choose the best orders by keeping or eliminating the time series coefficients under sparsity constraints. These two approaches are very similar but there is a slight difference between them: in the second approach, the fitting is based on time points, whereas in the first approach, the number of time points depends on the orders p and q. Then a rule of thumb is to use the first approach when the number of observations is low and choose the second approach when there are enough observations.

We are in the process of implementing the methods into an R package. This is particularly needed in this area as, to the best of our knowledge, there is no implementation available. The current version of the package is available at http://people.brunel.ac.uk/~mastvvv/Software/.

## 5 Simulation study

We design a simulation study to compare the REGARMA model with existing methods. In the simulation, we:

1. Set the proportion of zero coefficients to 90%, 50% or 10%.

2. Assign unequal random numbers in to each non-zero coefficient.

3. Generate the design matrix, , using stationary Gaussian processes, with and .

4. Generate where ,

5. Set unequal AR and MA parameters, under the constraint that the roots of stationary polynomials site outside the unit circle and simulate data from a REGARMA model with and .

6. Repeat each combination of models 10 times.

We compare the adaptive REGARMA model with adaptive lasso, as it is the closest model in the literature for which an implementation is available. Similar results were found in the comparison of the non-adaptive versions (results not reported). BIC was used to choose the optimal penalties, whereas the autoregressive and moving average orders were fixed as the true ones.

Figure (2) to (5) compare adaptive REGARMA and adaptive Lasso with respect to mean squared prediction error, BIC and mean squared error of for , and . The figures show overall how REGARMA dominates lasso both for low and high-dimensional problems. Figure (2) shows that as the number of data points increases, the relative outperformance of REGARMA versus lasso with respect to MSPE increases. Figures (3) shows how REGARMA achieves lower BIC values than lasso, particularly when but also for some high-dimensional cases. Figure (4) compares REGARMA and lasso with respect to the mean squared error of , averaged over the different regression coefficients. This plot shows the advantage of using REGARMA on time-dependent data in comparison with lasso. Finally, Figure (5) shows an outperformance of REGARMA over lasso, regardless of the level of noise .

## 6 Real data analysis

For the first application, we consider REGARMA in a low-dimensional problem. In particular, we consider financial data on daily returns of the Istanbul Stock Exchange(ISE) with seven other international indices, SP, DAX, FTSE, NIKKEI, BOVESPA, MSCE EU, MSCI EM, for a period of two years from 2009 to 2011. The data are publicly available at http://archive.ics.uci.edu/ml and are considered also by [1]. The goal of the analysis is to detect the most effective indices in relation to the ISE index.
We set a maximum order of 4 for both and and use BIC to select the optimal penalty parameters (i.e. method b on page 4). Table (1) shows a comparison of REGARMA with adaptive lasso. For REGARMA, we consider also the sub-model with only autoregressive terms, REGAR, and the one with only moving average terms, REGMA (which is essentially the model of [15]), as well as the full REGARMA model. All four models choose BOVESPA, EU and EM as the most effective indices for the Istanbul exchange market. These are within the 6 variables selected by [1]. From Table (1) and the residual analysis in Figure (6), we can conclude that the REGARMA family shows a better performance with respect to Mean Squared Error (MSE), Mean Absolute Error (MAE) and BIC compared to adaptive lasso.

For a hight dimensional example, we consider S&P500 indices. S&P500 is one of the leading stock market index for US equity: it is based on 500 leading companies and captures approximately 80% coverage of the available market capitalization. The goal of the analysis is to find the S&P500 indices most related to the AT&T Inc index based on monthly data in a period of fourteen years from 2000 to 2014. These data are publicly available at http://thomsonreuters.com/. After removing variables with the majority of missing values, the dataset contains 416 variables and 170 datapoints. As before, we apply adaptive lasso and adaptive REGARMA to these data and choose the optimal penalties by BIC. Moreover, we set a maximum order of 4 for both and and let the model choose the optimal orders (using method b on page 4).

Table (2) summarises the results in terms of MSE, MAE, BIC and the number of non-zero coefficients. Moreover, Figure (7) illustrates the residual analysis of these four models. Both Table (2) and Figure (7) show an improved performance of REGARMA compared to the other methods.

## 7 Conclusion

In this paper we extend the idea of regression-time series models proposed in [15] to a more general class of models, thus covering a wide spectrum of applications involving multivariate time-dependent data. In particular, we study an autoregressive moving average model with time-dependent explanatory variables and present penalised inference for the estimation of its parameters. Our model lends itself naturally to parameter estimation and implementation, contrary to the linear regression with ARMA errors of [16]. We prove asymptotic properties of the proposed model in low and high dimensional situations, with the latter not considered by the existing literature on time-series regression models. We test the performance of the model on a simulation study and show a successful application on financial data.

## 8 Appendix (Proof of theorems)

###### Proof 1 (Theorem 1).

Assume , and . Define

 kn(δ)=Qn(Θ∘+n−(1/2)δ)−Qn(Θ∘). (6)

Note that achieves a minimum at . Using (2), it implies that

 kn(δ)= (Ln(Θ∘+δ√n)−Ln(Θ∘)) (7a) +(nλ′n|β∘+u√n|−nλ′n|β∘|) (7b) +(nγ′n|ϕ∘+v√n|−nγ′n|ϕ∘|) (7c) +(nτ′n|θ∘+w√n|−nτ′n|θ∘|). (7d)

The last three terms can be simplified as

 (nλ′n|β∘+u√n|−nλ′n|β∘|) =(√nuλ′n(|β∘+u/√n|−|β∘|u/√n)) →n→∞λ∘r∑i=1{(uisign(β∘i)I(β∘i≠0))+|ui|I(β∘i=0)}.

Similarly, for the other two terms, we have

 (???) →n→∞γ∘p∑j=1{(vjsign(ϕ∘j)I(ϕ∘j≠0))+|vj|I(ϕ∘j=0)} (???) →n→∞τ∘q∑k=1{(wksign(θ∘k)I(θ∘k≠0))+|wk|I(θ∘k=0)}.

For (7a), we have

 (???)= −e′e+((Y−H(q)θ∘−H(p)ϕ∘−X′β∘)−(X′,H(p),H(q))δ√n))′ × ((Y−H(q)θ∘−H(p)ϕ∘−X′β∘)−(X′,H(p),H(q))δ√n)).

Set and recall that . Then, we have

 Qn(Θ∘ +δ√n)−Qn(Θ∘)=(e′−δ′√nA′)(e−Aδ√n)−e′e+(???)+(???)+(???).

The right-hand side of the last equation is equivalent to

 (δ′A′√n)(Aδ√n)−(δ′A′√n)e−e′(Aδ√n)+(???)+(???)+(???). (8)

From left to right, we now prove that the first term in (8) is bounded and the two other terms follow (asymptotically) normal distributions, i.e.

 (δ′A′√n)(Aδ√n)→O(1) (9) (δ′A′√n)e→f1 (10) e′(Aδ√n)→f′1=f1. (11)

Let . Then,

 H′2e=1√n(X′,H(p),H(q))′e √nH′2e=(X′,H(p),H(q))′e H∘t=√nH′2tet=(X′t,H(p)t,H(q)t)′et.

is a martingale difference sequence(in short mds) because

 E(H∘t|t=t−1,t−2,…,t−(p+q)) =E((X′t,H(p)t,H(q)t)′et|

In order to establish that the conditions of the mds central limit theorem are satisfied (refer to  martin2012econometric [8, p 51] for the mds central limit theorem and conditions), define

 ¯μ=1nT∑t=T0+1H∘t,¯σ2=1nT∑t=T0+1Var(H∘t)=σ4UB.

To show the boundedness condition in the martingales central limit theorem, choose , so that

 E(|H∘t|4)=E(e4t)E(X′t,H(p)t,H(q)t)4.

Under assumption , and it can be shown that , provided and are stationary and ergodic. Moreover

 1n T∑t=T0+1e2t(Xt,H′(p)t,H′(q)t)2= 1nT∑t=T0+1(e2t−σ2)(X′t,H(p)t,H(q)t)2+σ21nT∑t=T0+1(X′t,H(p)t,H(q)t)2. (12)

The first term in (1) is a mds, which has mean zero. So using the weak law of large numbers, we have that . The second term in the right hand side of (1) also tends to . As a result

 1nT