Penalised inference for autoregressive moving average models with timedependent predictors
[3mm] Hamed Haselimashhadi and Veronica Vinciotti
Department of Mathematics, Brunel University London, UK
Abstract: Linear models that contain a timedependent response and explanatory variables have attracted much interest in recent years. The most general form of the existing approaches is of a linear regression model with autoregressive moving average residuals. The addition of the moving average component results in a complex model with a very challenging implementation. In this paper, we propose to account for the time dependency in the data by explicitly adding autoregressive terms of the response variable in the linear model. In addition, we consider an autoregressive process for the errors in order to capture complex dynamic relationships parsimoniously. To broaden the application of the model, we present an penalized likelihood approach for the estimation of the parameters and show how the adaptive lasso penalties lead to an estimator which enjoys the oracle property. Furthermore, we prove the consistency of the estimators with respect to the mean squared prediction error in highdimensional settings, an aspect that has not been considered by the existing timedependent regression models. A simulation study and real data analysis show the successful applications of the model on financial data on stock indexes.
Keywords: time series, high dimensional models, lasso
1 Introduction
This paper deals with fitting a general time seriesregression model using regularized inference. In the context of linear models, penalized approaches have received great interest in recent years as they allow to perform variable selection and parameter estimation simultaneously for any data, including highdimensional datasets, where classical approaches for parameter estimation break down, e.g. [14, 7, 10, 17, 12]. [17] have shown that a model where penalties are adapted to each individual regressor enjoys oracle properties. Most of the advances in regularized regression models have been for the case of independent and identically distributed data. A recent line of research has concentrated on regularized models in time dependent frameworks. Amongst these, [15] showed the successful application of penalised inference in the context of autocorrelated residuals for a fixed order, by proposing the model
and studied the properties of this model in lowdimensional settings. [11] studied the theoretical properties of a regularized autoregressive process on for both low and dimensional cases, whereas [13] studied the estimation of vector autoregressive models. In both cases, no exogenous variables are included in the model. [9] studied the asymptotic properties of adaptive lasso in high dimensional time series models when the number of variables increases as a function of the number of observations. Their model covers a lagged regression in the presence of exogenous variables, but does not consider autocorrelated residuals. Recently, [16] proposed an extension of the model of [15] by adding a moving average term, that is they propose a model of the form
Similarly to [15], they proved the consistency of the model in lowdimensional cases. Despite the generality of this model, considering an ARMA process for the errors results in a complex model with a challenging implementation.
In this paper, we propose to account for the time dependency in the data by explicitly adding autoregressive terms of the response variable in the linear model, as in [11], as well as an autocorrelated process for residuals, as in [15], in order to capture complex dynamics parsimoniously. In particular, given fixed orders and , we propose the model
(1) 
We name the terms in the right hand side of (1) as REGression term, AutoRegressive term and Moving Average term respectively and call the resulting model REGARMA. We assume that all time dependent components in REGARMA are stationary and ergodic. Figure (1) illustrates a schematic view of the model.
In Section 2, we formulate the model and present an penalized likelihood approach for the estimation of the parameters. In Section 3, we prove the asymptotic properties of the model and show how the adaptive lasso penalty leads to estimators which enjoy the oracle property. Furthermore, we prove the consistency of the model with respect to the mean squared prediction error in highdimensional settings, an aspect that has not been considered by the existing timedependent regression models. In section 4, we discuss the implementation of REGARMA. A simulation study, given in section 5, will accompany the theoretical results. In section 6 we apply the model to two real datasets in finance and macroeconomic, respectively. Finally, we draw some conclusions in section 7.
2 penalised parameter estimation of REGARMA
The general form of REGARMA consists of a lagged response variable, covariates and autocorrelated residuals. Consider the following Gaussian REGARMA model of order and ,
where is the row of the matrix of predictors , and follow stationary time series processes, that is all roots of the polynomials and are unequal and outside the unit circle, s are independent and identical Gaussian noises with mean of zero and finite fourth moments, and and are both less than the number of observations . Moreover, we assume that the errors and explanatory variables in are independent of each other. To remove the constants from the model we follow the literature on regularized models, e.g. [14, 5], and standardize the covariates and response to zero means and unit variance.
Given the first observations, maximizing the penalized conditional likelihood of the model is equivalent to minimizing
(2) 
where are tuning parameters and is the vector of regression, autoregressive and moving average parameters. Following the literature, and given the superior properties of adaptive lasso models [17], we also propose an adaptive version of REGARMA penalised estimation as follows
where are tuning parameters.
2.1 Matrix representation of the model
For convenience, we write the model in matrix representation. Let be a matrix including lags of autoregressive (), moving average (), and explanatory variables (). Let denote the vector of corresponding parameters, be the vector of errors, and , as previously defined. Then, in matrix form, the model can be written as
and the penalized conditional likelihood given the first observation is equivalent to
where . Similarly, the adaptive form of the model is given by
(3) 
where the parameters are given by
3 Theoretical properties of REGARMA and adaptiveREGARMA
In order to study the theoretical properties of REGARMA and adaptiveREGARMA, we define the true coefficients by and assume that some of these coefficients are zero. The indexes of nonzero coefficients in each group of coefficients, and , are denoted by and respectively, whereas are the complementary sets and contain the indexes of zero coefficients. We also define and their corresponding (REGARMA) estimations by . Similarly, adaptiveREGARMA estimations are denoted by . Finally, different combinations of model parameters are going to be used, with obvious meaning, in particular , , , , , .
3.1 Assumptions
To prove the theoretical properties of the estimators, in line with the literature, we make use of the following assumptions:

s are i.i.d Gaussian random variables with finite fourth moments

The covariates, , and response variable, , are stationary and ergodic with finite second order moments. Also, we assume that none of the roots of and/or are equal and outside of the unit circle

are independent of the errors

and .
Assumptions are standard assumptions for dealing with stationary time series. Assumption are used to guarantee that explanatory variables have finite expectations.
3.2 Theoretical properties of REGARMA when
In the following theorems, we extend the theorems of [15] to cover a model with a lagged response.
Theorem 1.
Assume , , and . Then under assumptions , it follows that where
with is a vector of parameters in , and .
The proof is given in the Appendix. Theorem (1) shows that the REGARMA estimator has a KnightFu type asymptotic property [7] and it implies that the tuning parameters in cannot shrink to zero at a speed faster than . Otherwise, are zero and becomes a standard quadratic function,
which does not produce a sparse solution. In addition, the proof of theorem (1) requires the errors to be independent and identically distributed but we do not make a strong assumption on the type of distribution for the errors, due to the use of the martingale central limit theorem for large .
[7] proves that a lasso optimization returns estimates of nonzero parameters that suffer an asymptotic bias. This applies also to the REGARMA model, as we show with the following remark.
Remark 1.
Consider a special case of REGARMA when but and for . If minimizing can correctly identify , it means that and . That is, must satisfy
Then using Theorem 1, where is the matrix with the first rows of corresponding to the covariates.
If and are positive, remark (1) shows that suffers an asymptotic bias and is different from the oracle estimator, . In other words, REGARMA is not asymptotically consistent unless . The following remark can be extended to other groups of coefficients.
3.3 Theoretical properties of adaptiveREGARMA when
Following the notation of section 2, we consider the adaptive version of the penalised likelihood and estimate the model parameters by minimizing
where
Following [15] and [3], we define the maximum and minimum penalties for significant and insignificant coefficients by
and prove a number of results on the theoretical properties of adaptive REGARMA.
Theorem 2.
Assume as . Then under assumptions , there is a local minimiser of such that
The proof of the theorem is in the Appendix. Let , then, theorem (2) proves that there exists a local minimiser , when the tuning parameters (for significant variables) of REGARMA converge to zero at the speed faster than (since ).
As the next step, we prove that if the tuning parameter associated with insignificant variables in REGARMA shrink to zero at a speed slower than , then their associated REGARMA coefficients will be estimated exactly equal to zero with probability tending to 1.
Theorem 3.
Assume and then
The proof of the theorem is in the Appendix. Theorem (2) and (3) indicate that estimator satisfies under certain conditions on the tuning parameters, leading to the following result:
Theorem 4.
Assume and . Then, under assumptions , the component of the local minimiser of in Theorem 3 satisfies
where is the submatrix corresponding to .
The proof of the theorem is in the Appendix. Theorem (4) implies that if tends to zero at the speed faster than and simultaneously increases at the speed slower than , then adaptive REGARMA is asymptotically an oracle estimator. In the next subsection, we consider the theoretical properties of adaptive REGARMA for highdimensional problems.
3.4 Theoretical properties of adaptive REGARMA when
In the proofs of the lowdimensional results (refer to proof of theorem 1), we rely on a unique path of reaching the maximum of the log likelihood. This is not true in highdimensional cases, so different results are needed in this case. In this section we follow a similar strategy to [2] to prove theorems in the high dimensional case, an aspect which has not been considered by existing timedependent regression models, such as those of [15] and [16].
In order to study the consistency of REGARMA in highdimensional situations, we show that under assumptions , REGARMA is consistent with respect to the mean squared prediction error.
Without loss of generality, we define the REGARMA model as a constrained optimization [14]. Thus, we have
(4)  
where there is a onetoone correspondence between and in REGARMA, and and in (4). Define the Mean Squared Prediction Error, , and its estimated value, , by
where and are the REGARMA predictions of based on the true parameters and REGARMA estimates from (4), respectively. Then the following theorem holds.
Theorem 5.
Under assumptions and , and , let and be the REGARMA estimates, and such that
Then
(5) 
The proof of the theorem is in the Appendix. Note that in the situation where and , equation (5) results in the standard lasso consistency formula in [2]. When is correctly chosen, equation (5) also shows that REGARMA is prediction consistent when .
It is also possible to extend this result to .
Remark 2.
Under the same conditions as Theorem (5),
where and are defined as before, is defined in the Appendix and and .
Remark (2) shows that if then REGARMA is consistent. Given that relatively small orders and are sufficient for most time series analyses, the consistency of the estimator is mainly dominated by the highdimensional regression part. If then the above equation approximately reduces to a form similar to the standard lasso results in [2],
4 Algorithm
Since , that is REGARMA is a subset of adaptiveREGARMA, and given the improved properties of adaptive REGARMA, we mainly focus on adaptive REGARMA in this section. Our formulation of the model lends itself naturally to its implementation, in contrast to the more complex implementation of the model of [16].
As the model contains regressions, moving averages and autoregressive coefficients, we use the twostep optimization procedure
Steps 1 and 2 provide a solution to REGARMA using the adaptiveLasso algorithm of [17].
In terms of the selection of the penalties , and , these can be chosen using Kfold crossvalidation or using an information criterion such as BIC or AIC, CP similarly to [15], [16] and [4]. The weights in adaptiveREGARMA are defined by using the (nonadaptive) REGARMA estimates. Some notes are needed about the selection of the orders and in the REGARMA model. We propose two general approaches to choose the optimal orders for the model: (a) setting an upper bound and and choosing the model that minimizes BIC or AIC inside these bounds (b) setting an upper bound and and letting the model choose the best orders by keeping or eliminating the time series coefficients under sparsity constraints. These two approaches are very similar but there is a slight difference between them: in the second approach, the fitting is based on time points, whereas in the first approach, the number of time points depends on the orders p and q. Then a rule of thumb is to use the first approach when the number of observations is low and choose the second approach when there are enough observations.
We are in the process of implementing the methods into an R package. This is particularly needed in this area as, to the best of our knowledge, there is no implementation available. The current version of the package is available at http://people.brunel.ac.uk/~mastvvv/Software/.
5 Simulation study
We design a simulation study to compare the REGARMA model with existing methods. In the simulation, we:

Set the proportion of zero coefficients to 90%, 50% or 10%.

Assign unequal random numbers in to each nonzero coefficient.

Generate the design matrix, , using stationary Gaussian processes, with and .

Generate where ,

Set unequal AR and MA parameters, under the constraint that the roots of stationary polynomials site outside the unit circle and simulate data from a REGARMA model with and .

Repeat each combination of models 10 times.
We compare the adaptive REGARMA model with adaptive lasso, as it is the closest model in the literature for which an implementation is available. Similar results were found in the comparison of the nonadaptive versions (results not reported). BIC was used to choose the optimal penalties, whereas the autoregressive and moving average orders were fixed as the true ones.
Figure (2) to (5) compare adaptive REGARMA and adaptive Lasso with respect to mean squared prediction error, BIC and mean squared error of for , and . The figures show overall how REGARMA dominates lasso both for low and highdimensional problems. Figure (2) shows that as the number of data points increases, the relative outperformance of REGARMA versus lasso with respect to MSPE increases. Figures (3) shows how REGARMA achieves lower BIC values than lasso, particularly when but also for some highdimensional cases. Figure (4) compares REGARMA and lasso with respect to the mean squared error of , averaged over the different regression coefficients. This plot shows the advantage of using REGARMA on timedependent data in comparison with lasso. Finally, Figure (5) shows an outperformance of REGARMA over lasso, regardless of the level of noise .
6 Real data analysis
For the first application, we consider REGARMA in a lowdimensional problem. In particular, we consider financial data on daily returns of the Istanbul Stock Exchange(ISE) with seven other international indices, SP, DAX, FTSE, NIKKEI, BOVESPA, MSCE EU, MSCI EM, for a period of two years from 2009 to 2011.
The data are publicly available at http://archive.ics.uci.edu/ml and are considered also by [1]. The goal of the analysis is to detect the most effective indices in relation to the ISE index.
We set a maximum order of 4 for both and and use BIC to select the optimal penalty parameters (i.e. method b on page 4). Table (1) shows a comparison of REGARMA with adaptive lasso. For REGARMA, we consider also the submodel with only autoregressive terms, REGAR, and the one with only moving average terms, REGMA (which is essentially the model of [15]), as well as the full REGARMA model. All four models choose BOVESPA, EU and EM as the most effective indices for the Istanbul exchange market. These are within the 6 variables selected by [1]. From Table (1) and the residual analysis in Figure (6), we can conclude that the REGARMA family shows a better performance with respect to Mean Squared Error (MSE), Mean Absolute Error (MAE) and BIC compared to adaptive lasso.
MAX ARMA orders: (4,4)  ADAPTIVELASSO  REGAR(2)  REGMA(1)  REGARMA(2,1) 

MEAN SQUARED ERROR  0.4194  0.4191  0.4192  0.4079 
MEAN ABSOLUTE ERROR  0.4932  0.4927  0.4927  0.4907 
BIC  552.44  549.12  549.12  541.41 
For a hight dimensional example, we consider S&P500 indices. S&P500 is one of the leading stock market index for US equity: it is based on 500 leading companies and captures approximately 80% coverage of the available market capitalization. The goal of the analysis is to find the S&P500 indices most related to the AT&T Inc index based on monthly data in a period of fourteen years from 2000 to 2014. These data are publicly available at http://thomsonreuters.com/.
After removing variables with the majority of missing values, the dataset contains 416 variables and 170 datapoints. As before, we apply adaptive lasso and adaptive REGARMA to these data and choose the optimal penalties by BIC. Moreover, we set a maximum order of 4 for both and and let the model choose the optimal orders (using method b on page 4).
Table (2) summarises the results in terms of MSE, MAE, BIC and the number of nonzero coefficients. Moreover, Figure (7) illustrates the residual analysis of these four models. Both Table (2) and Figure (7) show an improved performance of REGARMA compared to the other methods.
MAX ARMA orders: (4, 4)  ADAPTIVELASSO  REGAR(2)  REGMA(1)  REGARMA(2,3) 

MEAN SQUARED ERROR  1.34  1.84  4.18  .083 
MEAN ABSOLUTE ERROR  87.01  112.30  182.07  75.34 
BIC  28.83  27.81  31.2  22.63 
NONZERO COEFFICIENTS  272  266  282  263 
7 Conclusion
In this paper we extend the idea of regressiontime series models proposed in [15] to a more general class of models, thus covering a wide spectrum of applications involving multivariate timedependent data. In particular, we study an autoregressive moving average model with timedependent explanatory variables and present penalised inference for the estimation of its parameters. Our model lends itself naturally to parameter estimation and implementation, contrary to the linear regression with ARMA errors of [16]. We prove asymptotic properties of the proposed model in low and high dimensional situations, with the latter not considered by the existing literature on timeseries regression models. We test the performance of the model on a simulation study and show a successful application on financial data.
8 Appendix (Proof of theorems)
Proof 1 (Theorem 1).
Assume , and . Define
(6) 
Note that achieves a minimum at . Using (2), it implies that
(7a)  
(7b)  
(7c)  
(7d) 
The last three terms can be simplified as
Similarly, for the other two terms, we have
For (7a), we have
Set and recall that . Then, we have
The righthand side of the last equation is equivalent to
(8) 
From left to right, we now prove that the first term in (8) is bounded and the two other terms follow (asymptotically) normal distributions, i.e.
(9)  
(10)  
(11) 
Let . Then,
is a martingale difference sequence(in short mds) because
In order to establish that the conditions of the mds central limit theorem are satisfied (refer to martin2012econometric [8, p 51] for the mds central limit theorem and conditions), define
To show the boundedness condition in the martingales central limit theorem, choose , so that
Under assumption , and it can be shown that , provided and are stationary and ergodic. Moreover
(12) 
The first term in (1) is a mds, which has mean zero. So using the weak law of large numbers, we have that . The second term in the right hand side of (1) also tends to . As a result