Sparse Identification and Estimation of High-Dimensional Vector AutoRegressive Moving Averages

# Sparse Identification and Estimation of High-Dimensional Vector AutoRegressive Moving Averages

Ines Wilms111Corresponding Author. E-mail and URLs: ines.wilms@kuleuven.be, http://feb.kuleuven.be/ines.wilms (I. Wilms), sumbose@cornell.edu, http://faculty.bscb.cornell.edu/b̃asu/ (S. Basu), jbien@usc.edu, http://faculty.bscb.cornell.edu/b̃ien/ (J. Bien), matteson@cornell.edu, http://www.stat.cornell.edu/m̃atteson/ (D.S. Matteson)., Sumanta Basu, Jacob Bien, and David S. Matteson
Department of Statistical Science, Cornell University, Ithaca, NY, USA
Faculty of Economics and Business, KU Leuven, Belgium
Data Sciences and Operations, University of Southern California, Los Angeles, CA, USA

Abstract. The Vector AutoRegressive Moving Average (VARMA) model is fundamental to the study of multivariate time series. However, estimation becomes challenging in even relatively low-dimensional VARMA models. With growing interest in the simultaneous modeling of large numbers of marginal time series, many authors have abandoned the VARMA model in favor of the Vector AutoRegressive (VAR) model, which is seen as a simpler alternative, both in theory and practice, in this high-dimensional context. However, even very simple VARMA models can be very complicated to represent using only VAR modeling. In this paper, we develop a new approach to VARMA identification and propose a two-phase method for estimation. Our identification and estimation strategies are linked in their use of sparsity-inducing convex regularizers, which favor VARMA models that have only a small number of nonzero parameters. We establish sufficient conditions for consistency of sparse infinite-order VAR estimates in high dimensions, a key ingredient for our two-phase sparse VARMA estimation strategy. The proposed framework has good estimation and forecast accuracy under numerous simulation settings. We illustrate the forecast performance of the sparse VARMA models for several application domains, including macro-economic forecasting, demand forecasting, and volatility forecasting. The proposed sparse VARMA estimator gives parsimonious forecast models that lead to important gains in relative forecast accuracy.

Keywords. Forecasting, High-dimensional Time Series, Identifiability, Multivariate Time Series, Sparse Estimation, VARMA

## 1 Introduction

Vector AutoRegressive Moving Average (VARMA) models enjoy many theoretical properties which make them fundamental tools for modeling multivariate time series. In a model, a stationary -dimensional vector time series is modeled as a function of its own past values and lagged error terms. More precisely,

 yt=p∑ℓ=1Φℓyt−ℓ+q∑m=1Θmat−m+at, (1)

where are autoregressive coefficient matrices, are moving average coefficient matrices, and denotes a -dimensional mean-zero white noise vector time series with nonsingular contemporaneous covariance matrix . We assume, without loss of generality, that all time series are mean-centered such that no intercept is included. VARMA models are, however, rarely used in practice due to identifiability concerns and computational challenges. As such, multivariate time series are primarily modeled using Vector AutoRegressive (VAR) models rather than VARMA models. A VAR model is a special case of the VARMA since the time series are only modeled as a function of their own past values and no moving average coefficients are included. Hence, VAR models—in contrast to VARMA models—do not account for serial correlation in the errors. Applications of Vector AutoRegressions are found in diverse fields such as biostatistics (e.g., Kirch et al., 2015), finance (e.g., Tao et al., 2011), economics (e.g., Matteson and Tsay, 2011), and marketing (e.g., Gelper et al., 2016).

The dominant focus on VAR models has led to a well-developed and active research area on VAR estimation. Recently, a growing interest has arisen in VAR models that are high-dimensional, i.e., having a large number of time series relative to the time series length . As a result, many advances have been made in developing regularized estimators for these models (e.g., Hsu et al., 2008, Davis et al., 2016, Gelper et al., 2016 or Nicholson et al., 2017), and in discussing theoretical properties of such regularized estimators (e.g., Basu and Michailidis, 2015). Many of these papers focus on adding an -penalty (Tibshirani, 1996) to the objective function. As such, sparse parameter estimates are obtained, meaning that many autoregressive parameters are estimated as exactly zero.

Although VAR models are more intensively investigated and used by practitioners, several reasons exist for preferring the more general class of VARMA over VAR. First and foremost, VARMA models allow for more parsimonious representations of the data generating process. This parsimonious finite VARMA parametrization of potentially infinite-order VARs gives important advantages in terms of improved estimation and prediction accuracy while also being more amenable to interpretation. Kascha (2012) discuss the improved estimation performance of VARMA over VAR models; Anthanasopoulos and Vahid (2008) provide empirical evidence of the improved prediction performance of VARMA over VAR. Further theoretical reasons for VARMA models over VAR models are discussed by Lütkepohl and Poskitt (1996).

Despite these justifications for using VARMA models over VAR models, only a few estimation procedures (e.g., Chan et al., 2016) for high-dimensional VARMA models have been introduced. This paper aims to (i) introduce a new identification strategy for VARMA models, together with (ii) a computationally efficient estimation procedure for high-dimensional VARMA, and (iii) discuss theoretical properties of the proposed sparse VARMA estimator.

In view of achieving identifiability, it is important to formulate a reasonably parsimonious model (Dufour and Jouini, 2014). We provide a novel and parsimonious identification strategy. Unlike pre-existing identification strategies—such as the echelon form or the final equations form (Lütkepohl, 2005; Chapter 12)—our identification strategy is purposely aligned with our estimation strategy. Our estimator involves two phases and is simple to compute in both low-dimensional and high-dimensional settings. In Phase I, we approximate the true lagged errors of the VARMA model (1) by fitting a high-order VAR. To ensure estimation is feasible in high-dimensions, we use a sparse estimator. Unlike most papers on sparse VAR estimation, we build on recent advances to incorporate lag selection into the estimation procedure by using an “HLag” penalty (Nicholson et al., 2016) instead of the standard -penalty. In Phase II, we approximate the VARMA model with a lagged regression model by replacing the true lagged errors in equation (1) with the approximated lagged errors from Phase I, and sparsely estimate the model parameters. We show in several simulation settings and forecast applications that this parsimonious VARMA model leads to important gains in forecast accuracy compared to a sparsely estimated VAR.

Furthermore, we discuss some theoretical properties of the proposed sparse VARMA estimator. To our knowledge, theoretical properties of VARMA estimates have been studied only in low-dimensional settings (Kascha, 2012; Dufour and Jouini, 2005). Some recent works have focused on asymptotic analysis of -penalized VAR() estimates in high-dimension (Basu and Michailidis, 2015; Han et al., 2015) for finite . In comparison to VAR, asymptotic analysis of VARMA imposes additional challenges due to its two-phase estimation strategy, where Phase I alone requires a VAR() estimation. Dufour and Jouini (2005) rely on the analysis of Lewis and Reinsel (1985) for Phase I estimation. In this work, we extend some of these results to the high-dimensional setting and establish sufficient conditions for consistency of Phase I AR estimates and residuals. As a first step, we only consider Gaussian VARMA and -penalization in this work and leave generalization to non-Gaussian models and the HLag penalty for future work.

The remainder of this paper is organized as follows. In Section 2, we introduce a sparse identification procedure for the VARMA model. The estimation methodology is discussed in Section 3, including details on the computational algorithm. Theoretical results are investigated in Section 4. Simulations are presented in Section 5. Forecast applications are given in Section 6. Section 7 concludes. The appendix includes proofs, an overview of the computational algorithm, and figures of the data used in the forecast applications.

Notation. We use to denote the Euclidean norm of a vector, and operator norm of a matrix. We reserve , and to denote the number of nonzero elements, and norms of a vector or the vectorized version of a matrix, and to denote the Frobenius norm of a matrix. We use and to denote the maximum and minimum eigenvalues of a (symmetric or Hermitian) matrix. We use to denote the absolute value of a real number or complex number. We denote the sets of integers, real, and complex numbers by , , and , respectively.

## 2 Sparse identification of the VARMA

Establishing identification of VARMA models is a notoriously difficult problem, largely explaining why no previous attempts have been made to introduce identification strategies for high-dimensional VARMA models. In this section, we introduce a parsimonious identification strategy for both low- and high-dimensional VARMA models. The identification strategy detailed below favors parsimonious autoregressive and moving average matrices that contain only few non-zero elements.

Consider the Vector AutoRegressive Moving Average model of equation (1) with fixed autoregressive order and moving average order . We assume the model to be invertible and stable (see e.g., Brockwell and Davis 1991, Chapter 11). The model can be written in compact lag operator notation as

 Φ(L)yt=Θ(L)at,

where the AR and MA operators are respectively given by

 Φ(L)=Id−Φ1L−Φ2L2−…−ΦpLp  and  Θ(L)=Id+Θ1L+Θ2L2+…+ΘqLq,

with the lag operator defined as . The process has an infinite-order VAR representation

 Π(L)yt=at,

where

 Π(L)=Θ−1(L)Φ(L)=Id−Π1L−Π2L2−⋯

The -matrices can be computed recursively from the autoregressive matrices and moving average matrices (see, e.g., Brockwell and Davis 1991, Chapter 11). The VARMA is uniquely defined in terms of the operator , but not in terms of the AR and MA operators and . That is, for a given , , and , one can define an equivalence set of AR and MA matrix pairs,

 Ep,q(Π(L))={(Φ,Θ):Φ(L)=Θ(L)Π(L)},

where , and . This set can in general consist of more than one such pair, meaning that identification restrictions on the AR and MA matrices are needed. Typically, the echelon form or the final equations form are used to specify a set of conditions that ensure uniqueness of the VARMA representation in terms of the AR and MA matrices. For more details, we refer the reader to Tsay (2014), Chapter 4 or Lütkepohl (2005), Chapter 12.

In this paper, we rely on the concept of sparsity, or parsimony, to establish identification for VARMA models. Among all feasible AR and MA matrix pairs, we look for the one that gives the most parsimonious representation of the VARMA. Specifically, we measure parsimony through a pair of sparsity-inducing convex regularizers, and . A standard choice for these regularizers would be the -norm; however, our identification results apply equally well to any other convex function. Our focus will be on the HLag penalty, to be defined in Section 3, since it provides a structured form of sparsity that is particularly simple to interpret. Define the sparse equivalence set of VARMA representations by

 SEp,q(Π(L))=argminΦ,Θ {PAR(Φ)+PMA(Θ)  s.t.  Φ(L)=Θ(L)Π(L)}. (2)

This sparse equivalence set is a subset of the equivalence set , containing the sparse VARMA representations, meaning that many of the elements of the autoregressive matrices and/or moving average matrices are exactly equal to zero. If the objective function in (2) is strictly convex, then the sparse equivalence set consists of one unique AR and MA matrix pair, in which case identification is established. However, for the HLag penalty and for the -norm, the objective function is convex but not strictly convex. Hence, to ensure identification, we add two extra terms to the objective function and consider

 (Φ(α),Θ(α))=argminΦ,Θ {PAR(Φ)+P%MA(Θ)+α2∥Φ∥2F+α2∥Θ∥2F  s.t.  Φ(L)=Θ(L)Π(L)}, (3)

where and analogously for . The optimization problem in (3) is strongly convex and thus has a unique solution pair for each value of .

We are now ready to define our sparsely identified VARMA representation. For any stable, invertible VARMA process, its unique sparse representation in terms of the autoregressive and moving average matrices is defined as

 (Φ(0),Θ(0))=limα→0+(Φ(α),Θ(α)). (4)

The following theorem, proved in Appendix A.1, establishes that is in the sparse equivalence set and furthermore is the unique pair of autoregressive and moving average matrices in this set having smallest Frobenius norm.

###### Theorem 1.

The limit in (4) exists and is the unique pair in whose Frobenius norm is smallest:

 (Φ(0),Θ(0))=argminΦ,Θ{∥Φ∥2F+∥Θ∥2F  s.t.  (Φ,Θ)∈SEp,q(Π(L))}.

## 3 Sparse estimation of the VARMA

The Vector AutoRegressive Moving Average model of equation (1) can be seen as a multivariate regression of responses on the predictors . However, this multivariate regression cannot be directly estimated since it contains the unobservable (latent) lagged errors . Therefore, we proceed in two phases, in the spirit of Spliid (1983), Dufour and Jouini (2014), and references therein. In Phase I, we approximate these unobservable errors. In Phase II, we estimate the VARMA of equation (1) with the approximated lagged errors instead of the unobservable lagged errors.

#### Phase I: Approximating the unobservable errors.

The VARMA of equation (1) has a pure Vector AutoRegressive VAR() representation if the VARMA process is invertible (see e.g., Brockwell and Davis 1991, Chapter 11). Therefore, we propose to approximate the error terms by the residuals of a VAR() given by

 yt=˜p∑τ=1Πτyt−τ+εt, (5)

for , with a finite number, the autoregressive coefficient matrices, and a mean-zero white noise vector time series with nonsingular contemporaneous covariance matrix .

Estimating the VAR() of equation (5) is challenging since needs to be sufficiently large such that the residuals accurately approximate the errors . As a consequence, a large number of parameters, relative to the time series length , needs to be estimated. Therefore, we propose to use penalized estimation. We discuss the corresponding Phase I estimator in Section 3.1.

#### Phase II: Estimating the VARMA.

In Phase II, we use the approximated lagged errors instead of the true errors in equation (1). The resulting model

 yt=p∑ℓ=1Φℓyt−ℓ+q∑m=1Θmˆεt−m+ut, (6)

is a lagged regression model of responses on the predictors , and with a mean-zero white noise vector time series with nonsingular contemporaneous covariance matrix .

Model (6) is, however, heavily overparametrized, i.e., a large number of “regression” parameters needs to be estimated. To deal with this overparametrization problem, we again use penalized estimation. Details on the Phase II estimator are given in Section 3.2.

### 3.1 Phase I estimator

For ease of notation, first rewrite model (5) in compact matrix notation

 Y=ΠZ+E,

where and

To obtain , we use a Hierarchical VAR estimator (Nicholson et al., 2016), which places a lag-based hierarchical group lasso penalty on the autoregressive parameters. The autoregressive estimates are obtained as

 ˆΠ=argminΠ{12∥Y−ΠZ∥2F+λΠd∑i=1d∑j=1˜p∑τ=1∥Π(τ:˜p),ij∥2}, (7)

where we use the squared Frobenius norm as a loss function and a hierarchical “HLag” penalty with . This estimator performs automatic lag selection by forcing lower lagged coefficients of a time series in one of the equations of the VAR to be selected before its higher order lagged coefficients. At the same time, each time series has its own lag structure in each equation of the VAR. The penalty parameter regulates the degree of sparsity in : the larger , the sparser , meaning that more of its elements will be set to zero. The optimization problem in equation (7) can be efficiently solved using Algorithm 1 in Nicholson et al. (2016).

### 3.2 Phase II estimator

For ease of notation, rewrite the lagged regression model (6) in compact matrix notation

 Y=ΦZ+ΘX+U,

where , with with , with , and Our Phase II estimator is similar to the Phase I estimator from Section 3.1. We use a penalized estimator with a hierarchical penalty on both the autoregressive parameters and the moving average parameters . The autoregressive and moving average estimates are obtained as

 (ˆΦ,ˆΘ)=argminΦ,Θ{12∥Y−ΦZ−ΘX∥2F+λΦd∑i=1d∑j=1p∑ℓ=1∥Φ((ℓ:p)),ij∥2+λΘd∑i=1d∑j=1q∑m=1∥Θ(m:q),ij∥2}, (8)

where are two penalty parameters regulating the degree of sparsity in and , respectively. Similar to the Phase I optimization problem in (7), the Phase II optimization problem in (8) can be solved in parallel via the Proximal Gradient Algorithm in Appendix B. A schematic overview of the entire Two-Phase VARMA estimation procedure is provided in Algorithm 1.

### 3.3 Practical implementation

In practice, one needs to specify the maximal order in Phase I, and the penalty parameter needs to be selected. Similarly, in Phase II, the maximal autoregressive order and moving average order need to be specified, and the penalty parameters need to be selected.

#### Specifying the maximal lag orders.

Throughout the paper, we take . We performed several numerical experiments to investigate the sensitivity of the outcome of the algorithm to these choices. The results are not sensitive to these choices, provided that they are chosen large enough. Overselecting is less severe than underselecting since the Hierarchical VAR estimator performs automatic lag selection. As such, it can reduce the effective maximal order of each time series in each equation of the VAR (in Phase I), or the VARMA (in Phase II).

#### Selecting the penalty parameter.

Following Friedman et al. (2010), we use in Phase I a grid of penalty parameters starting from , the smallest value for which all parameters are zero, and then decreasing in log linear increments until the value is reached (we take 10 values along this grid).

We select the penalty parameter using the following time series cross-validation approach. For each time point , with , we estimate the model and obtain parameter estimates. This results in ten different parameter estimates, one for each value of the penalty parameter in the grid. From these parameters estimates, we compute -step ahead direct forecasts obtained with penalty parameter . We select that value of the penalty parameter that gives the most regularized model whose Mean Squared Forecast Error is within one standard error (see “one-standard error rule”, Hastie et al., 2009; Chapter 7) of the minimal Mean Squared Forecast Error,

 MSFE(λ)h=1T−h−S+1T−h∑t=S1d∥yt+h−ˆy(λ)t+h∥2,

In the simulation study, we take . In the forecast applications, we also consider other forecast horizons. In Phase II, we proceed similarly but using a two-dimensional grid of penalty parameters .

## 4 Theoretical properties

Theoretical analysis of our proposed VARMA estimate can be carried out in two parts: (a) analyses of AR estimates and residuals in Phase I; (b) analyses of AR and MA estimates and in Phase II. In this section, we focus on part (a), i.e., analysis of Phase I estimates, in high dimensions. Note that Phase I is essentially a VAR() estimation problem and does not rely on the VARMA identification strategy. In the low-dimensional VARMA analysis of Dufour and Jouini (2005), consistency of Phase I estimates was established using the results of Lewis and Reinsel (1985) on infinite-order VAR. Our analysis aims to provide consistency results similar to Lewis and Reinsel (1985) in a high-dimensional regime.

We start with a brief discussion of the spectral density based dependence measures (Basu and Michailidis, 2015) used in our analysis. Next, we provide deterministic upper bounds on the Phase I estimation error under a suitable restricted eigenvalue (RE) condition (Loh and Wainwright, 2012; Basu and Michailidis, 2015) on the autoregressive design matrix. These deterministic upper bounds depend on the fixed realization of the random variables . We then present two results to show how these upper bounds change and how often the RE assumption holds, when the random variables are generated from a sparse VARMA process.

For our theoretical analysis, we consider , a -dimensional centered, Gaussian stationary time series generated from a stable, invertible VARMA model

 yt=p∑ℓ=1Φℓyt−ℓ+q∑m=1Θmat−m+at, (9)

where and are matrices, is a centered Gaussian white noise process with contemporaneous variance-covariance matrix . Recall from previous sections the compact VARMA representation , where , , and the infinite-order VAR representation , where is an infinite-order lag polynomial. In Phase I, we work with a truncated version of , i.e., , for an integer .

### 4.1 Spectral density based measures of dependence

We adopt spectral density based measures of dependence for Gaussian time series, introduced in Basu and Michailidis (2015), to conduct our theoretical analysis. Consider a -dimensional centered stationary Gaussian time series with autocovariance function for . We make the following assumption:

###### Assumption 1.

The spectral density function

 fx(θ):=12π∞∑ℓ=−∞Γx(ℓ)e−iℓθ,θ∈[−π,π] (10)

exists, and its maximum and minimum eigenvalues are bounded a.e. on , i.e.,

 M(fx):=ess supθ∈[−π,π]Λmax(fx(θ))<∞,   \EuFrakm(fx):=ess infθ∈[−π,π]Λmin(fx(θ))>0. (11)

For stable, invertible VARMA processes of the form (9), this assumption is satisfied whenever , and the spectral density takes the form

 fx(θ)=12πΦ−1(e−iθ)Θ(e−iθ)ΣaΘ∗(e−iθ)[Φ−1(e−iθ)]∗.

This provides the following upper and lower bounds on and :

 \EuFrakm(fy)≥12πΛmin(Σa)μmin(Θ)μmax(Φ),M(fy)≤12πΛmax(Σa)μmax(Θ)μmin(Φ), (12)

where

 μmin(Φ):=min|z|=1Λmin(Φ∗(z)Φ(z)),μmax(Φ):=max|z|=1Λmax(Φ∗(z)Φ(z)), (13)

and , are defined accordingly. Note that for stable, invertible VARMA processes, and do not vanish on the unit circle . Proposition 2.2 in Basu and Michailidis (2015) provides additional results on how these quantities are related to the model parameters and .

In the analysis of Phase I, we often work with the polynomial . When the VARMA process is stable and invertible, . In addition, we make the following assumption on the data generating VARMA process:

###### Assumption 2.

There exists such that for all .

In other words, we assume that finite-order VAR approximations of the stable, invertible VARMA process are also stable as long as the order of approximation is large enough.

More generally, for any two -dimensional processes and , the cross-spectral density is defined as

 fxy(θ)=12π∞∑ℓ=−∞Γx,y(ℓ)e−iℓθ,θ∈[−π,π],

where , for . If the joint process satisfies Assumption 1, we define the following cross-spectral measure of dependence

 M(fxy):=ess supθ∈[−π,π]√Λmax(f∗xy(θ)fxy(θ)).

### 4.2 Estimation Error in Phase I

Suppose we have data indexed in the form . In Phase I, we regress on its most recent lags:

 yt=~p∑τ=1Πτyt−τ+εt, ~{}% ~{} where ~{}~{}εt=(at+∞∑τ=~p+1Πτyt−τ) (14)

The autoregressive design takes the form

 ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣(yT)⊤(yT−1)⊤⋮(y1)⊤⎤⎥ ⎥ ⎥ ⎥ ⎥⎦YT×d=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(yT−1)⊤(yT−2)⊤…(yT−~p)⊤(yT−2)⊤(yT−3)⊤…(yT−1−~p)⊤⋮⋮⋱⋮(y0)⊤(y−1)⊤…(y1−~p)⊤⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦XT×d~p⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣Π⊤1Π⊤2⋮Π⊤~p⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦Bd~p×d+⎡⎢ ⎢ ⎢ ⎢ ⎢⎣(εT)⊤(εT−1)⊤⋮(ε1)⊤⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ET×d (15)

Vectorizing the above, we have a regression problem with samples and parameters:

 vec(Y)Y=(I⊗X)Zvec(B)β∗+vec(E)E

In Phase I, we consider an -penalized least squares (-LS) estimate

 ^β=argminβ∈Rr1N∥Y−Zβ∥2+λ∥β∥1. (16)

Note that the first stage estimates , , are related to as , where . We also denote the residuals of Phase I regression as .

Our next proposition provides a deterministic upper bound on the estimation error of the above regression for a given realization of data points from the VARMA model (9). We first provide an upper bound on the deviation of the estimated residuals around without making any assumption on the design matrix . This is essentially the so-called “slow rate” of the convergence of Lasso estimates (Greenshtein and Ritov, 2004). Next we provide a tighter upper bound on the above deviation, and the deviation of around , under a restricted eigenvalue (RE) condition used in (Basu and Michailidis, 2015):

###### Assumption 3 (Restricted Eigenvalue, RE).

A symmetric matrix satisfies restricted eigenvalue (RE) condition with curvature and tolerance if

 v⊤Γv≥α∥v∥2−τ∥v∥21, ~{}~{} for all ~{% }~{}v∈Rr. (17)

These upper bounds involve curvature and tolerance parameters as well as the quantity , and do not relate directly to the VARMA model parameters. The next two propositions provide insight into how these quantities depend on VARMA parameters, when we have a random realization from a stable, invertible VARMA model (9).

###### Proposition 1.

Consider any solution of the -LS problem (16) using a given realization of from the VARMA model (9). Then, for any choice of the penalty parameter satisfying , we have

 1TT∑t=1∥^εt−εt∥2≤2λ~p∑τ=1∥Πτ∥1. (18)

In addition, assume the matrices are sparse in the sense , 222The assumption of strict sparsity is made for ease of exposition. In fact, the coefficients may not be exactly sparse even when the coefficients of and are. However, some form of weak sparsity is expected to hold, i.e., there are a few large and many small coefficients. The results presented here can be generalized to this setting using recent advances in the theory of misspecified Lasso (cf. van de Geer 2016). and the sample Gram matrix satisfies RE() of Assumption 3 for some model dependent quantities , such that . Then, for any choice of , we have the following upper bounds

 ~p∑τ=1∥∥ˆΠτ−Πτ∥∥1≤64√kλ/α (19) [~p∑τ=1∥∥ˆΠτ−Πτ∥∥2F]1/2≤64kλ/α (20) 1TT∑t=1∥^εt−εt∥2≤128kλ2/α. (21)

The proof of this proposition follows standard arguments in the literature of high-dimensional statistics (Bickel et al., 2009; Loh and Wainwright, 2012; Basu and Michailidis, 2015). We provide a proof in the appendix for completeness.

Our next proposition provides a non-asymptotic upper bound on which holds with high probability for large . If is chosen to be of this order, inequalities (18)-(19) then show how the numerators of the upper bounds vary with model parameters.

###### Proposition 2 (Deviation Condition).

If is a random realization from a stable, invertible VARMA model (9), then there exist universal constants , independent of the model parameters, such that with probability at least ,

 ∥Z⊤E/T∥∞≤2π[M(fy)+μmax(Π[~p])+M(fy)μmax(Π[~p])]√(log~p+2logd)/T+2πM(fy)∞∑τ=~p+1∥Πτ∥.

The second term on the right hand side is not a function of sample size , and depends only on the tail decay of the AR coefficients . For this term to be ignorable with large , one needs to choose large enough so that the tail sum is of smaller order than . Note that for stable, invertible VARMA models, there exist and such that (Dufour and Jouini, 2005).

The last proposition of this section investigates sample size requirements for the RE condition to hold with high probability, and also provides insight into how the tolerance and curvature parameters depend on the VARMA model parameters.

###### Proposition 3 (Verifying Restricted Eigenvalue Condition).

Consider a random realization of data points for a VARMA model (9) satisfying Assumptions 1 and 2. Then there exist universal constants such that for , the matrix satisfies RE() with probability at least , where

 α=π\EuFrakm(fy),ω=c3~pM(fy)/\EuFrakm(fy),τ=αmax{ω2,1}(logd+log~p)/T.

## 5 Simulation Study

We investigate the performance of the sparse VARMA estimator through a simulation study. We generate data from a with time series length . To ensure identification of the VARMA, we take , , diagonal matrices and set each diagonal element of equal to . For the autoregressive order, we take . Concerning the error covariance matrix, we take .

We consider several settings for the moving average parameters. We take , , lower triangular with , for and , for . The parameter regulates the strength of the moving average signal. The parameter regulates the moving average order. We investigate the effect of the

1. moving average signal strength: we vary the parameter . The larger , the stronger the moving average signal. Note that for , the true model is a VAR.

2. moving average order: we vary the parameter . The larger, the higher the moving average order.

3. number of time series: we vary the number of time series . The higher, the more parameters need to be estimated relative to the fixed time series length .

In all considered settings, the resulting VARMA models are invertible and stable.

#### Estimators.

We compare the performance of four estimators: (i) “VARMA()”: the VARMA estimator of model (1) with an oracle providing the true errors and the orders and . (ii) “VARMA()” the VARMA estimator of model (6) with approximated errors and an oracle providing the orders and . (iii) “VARMA()”: the VARMA estimator of model (6) with approximated errors and specified orders , (iv) “VAR(”: the VAR estimator of model (5) with specified order .

#### Performance Measure.

We compare the performance of the estimators in terms of out-of-sample forecast accuracy. We generate time series of length and use the last observation to measure forecast accuracy. We compute the Mean Squared Forecast Error

 MSFE=1NN∑s=11d∥y(s)T+1−ˆy(s)T+1∥2,

where is the vector of time series at time point in the simulation run, and is its predicted value. The number of simulations is . We focus on out-of-sample forecast accuracy in the simulation study, in line with the discussion of the applications in Section 6. We also compare the estimators in terms of the estimation accuracy. Similar conclusions are obtained.

### 5.1 Effect of the moving average signal strength

Figure 1 shows the Mean Squared Forecast Errors (averaged over the simulation runs) of the four estimators for different values of the parameter ={0,0.3,0.6,0.9}, which regulates the moving average signal strength. We report the results for .

If the true model is a VARMA (i.e., ), the VARMA estimators perform better than the VAR estimator, as expected. The larger , the stronger the moving average signal and the larger the gain of the VARMA estimators over the VAR estimator. The differences in prediction accuracy between the VARMA estimators and the VAR estimator are all significant, as confirmed by a paired -test. Among the VARMA estimators, “VARMA()” and “VARMA()” perform equally well for . Hence, there is no considerable loss in using the two-phase approach. For , the VARMA with true errors performs best. The loss in prediction accuracy by using the approximated errors instead of the true errors is, however, limited to, on average, 15%. Furthermore, the VARMA estimator with estimated errors and selected orders (i.e., “VARMA()”) performs, for all values of , very similarly to the one with known orders. There is, hence, no considerable loss in not knowing the autoregressive or moving average order.

If the true model is a VAR (i.e., ), the VARMA estimators with known orders both reduce to a VAR() estimator since , hence . They give the lowest MSFE of, on average, 1.25. However, in practice, the orders of the model are not known. For unknown orders, the VARMA estimator (MSFE=1.32) performs significantly better than the VAR estimator (MSFE=1.36). The VARMA estimator gives, on average, a more parsimonious model than the VAR estimator. For the VARMA model, on average, only 20% of the autoregressive and moving average parameters are estimated as non-zero. For the VAR model, on average, 40% of the autoregressive parameters are estimated as non-zero. We thus find the more parsimonious VARMA model to significantly improve prediction accuracy over the VAR model even if the true data generating process is a VAR.

### 5.2 Effect of the moving average order

Figure 2 shows the Mean Squared Forecast Errors of the four estimators for different values of the moving average order ={1,2,4,8}. We report the results for .

For all values of , the VARMA estimators perform significantly better than the VAR estimator. The VARMA estimator with true errors performs best, but is closely followed by the VARMA estimator with approximated errors (and known orders). The latter’s MSFEs are, on average, less than 10% higher than the former’s MSFEs.

The differences between the “VARMA()” and the “VARMA()” estimators is most pronounced for . For , the moving average orders are (on average) underestimated. This occurs since the moving average parameters at the highest lags are very small in magnitude. Nevertheless, the “VARMA()” estimator still significantly outperforms the VAR estimator for : MSFE of 3.20 compared to 3.36 respectively.

### 5.3 Effect of the number of time series

Figure 3 shows the Mean Squared Forecast Errors of the four estimators for different values of the number of time series ={5,10,20,40}. We report the results for . As the number of time series increases relative to the fixed time series length , it becomes more difficult to accurately estimate the model. As a consequence, the MSFEs of all estimators increase if the number of time series is increased.

For all values of , the VARMA estimators attain lower values of the MSFE than the VAR estimator. All differences are significant, except for the difference between the VAR estimator and the “VARMA()” estimator for . Nevertheless, for larger values of the latter significantly improves prediction accuracy over the VAR estimator by, on average, 14%. Furthermore, for all values of , the loss in prediction accuracy of not knowing the autoregressive and moving average order is limited since the two VARMA estimators “VARMA()” and “VARMA()” perform very similar. For , both estimators even perform comparable to the VARMA estimator with true errors.

### 5.4 Summary of simulation results

The VARMA estimators attain good forecast performance in the considered simulation scenarios. If the true data generating process is a VARMA, VARMA estimators perform better than the VAR estimator in both low- and high-dimensional settings (cf. Section 5.3). The improvement of VARMA over VAR is larger if the moving average signal is stronger (cf. Section 5.1). For VARMA models with limited moving average order, there is no substantial loss in forecast accuracy of not knowing the autoregressive or moving average order (cf. Section 5.2). Finally, even if the true data generating process is a VAR, VARMA estimators might perform better than VAR estimators by giving more parsimonious models (cf. Section 5.1).

## 6 Forecast applications

We present three forecast applications of the sparse VARMA estimator. We consider (i) macro-economic forecasting for a data set consisting of time series each of length , (ii) demand forecasting for a data set consisting of time series each of length , and (iii) volatility forecasting for a data set consisting of time series each of length . In all considered cases, the number of time series is large relative to the time series length .

#### Macro-economic data.

We use data on quarterly macro-economic time series collected by Stock and Watson (2002). There are macro-economic indicators of length starting in 1966, Quarter 3. A full list of the time series is available in Koop (2013) (Table BIII of the Data Appendix), along with the transformation codes to make the data approximately stationary. Time plots of the macro-economic time series are given in Figure 6 and 7, Appendix C.

#### Demand data.

Weekly sales (or demand) data (in dollars) are collected for product categories of Dominick’s Finer Foods, a US supermarket chain. Sales data are collected over a period from January 1993 to July 1994, hence , and are publicly available from https://research.chicagobooth.edu/kilts/marketing-databases/dominicks. To ensure stationarity of the time series, we take each sales time series in log-differences and, hence, look at sales growth. Time plots of these time series are given in Figure 8, Appendix C. The Augmented Dickey-Fuller tests confirm that the sales growth time series are stationary.

#### Volatility data.

We use data on monthly realized variances for stock market indices. Realized variance measures computed from five minute returns are obtained from Oxford-Man Institute of Quantitative Finance (publicly available at http://realized.oxford-man.ox.ac.uk/data/download). Data are collected over the period January 2000 to April 2016, hence . Following standard practice, realized variances are log-transformed. Time plots of the log-realized variances are given in Figure 9, Appendix C. The Augmented Dickey-Fuller tests confirm that the log-realized variances are stationary.

First, we discuss the model parsimony of the estimated VARMA and VAR models. Secondly, we compare the forecast accuracy of the sparse VARMA model to the sparse VAR model for different forecast horizons.

### 6.1 Model parsimony

Since the sparse VARMA and VAR estimators both perform automatic lag selection, they give information on the effective maximum autoregressive and moving average orders. Consider the moving average lag matrix of the estimated VARMA model whose elements are equal to

 ˆLˆΘ,ij=max{m:ˆΘm,ij≠0},

and where if for all . This lag matrix shows the maximal moving average lag for each time series in each equation of the corresponding estimated VARMA model. If entry is zero, this means that all lagged moving average coefficients of time series on time series are estimated as zero. If entry is, for instance, three, this means that three lagged moving average terms of time series on time series are estimated as non-zero. Similarly, one can construct the autoregressive lag matrix of the estimated VARMA model and the autoregressive lag matrix of the estimated VAR model.

Figure 4 shows the lag matrices of the estimated VARMA and VAR model on the demand data set. Similar findings are obtained for the macro-economic and volatility data set and are therefore omitted. The moving average lag matrix of the estimated VARMA (top left) is very sparse: 243 out of 256 entries are equal to zero. Though sparse, there are still some lagged moving average parameters estimated as non-zero. By adding these terms to the model, serial correlation in the error terms is captured. As a result, a more parsimonious VARMA model is obtained, compared to the estimated VAR model. Indeed, 85 out of the autoregressive and moving average VARMA parameters (or 1.28%) are estimated as non-zero. In contrast, 901 out of the autoregressive VAR parameters (or 27.07%) are estimated as non-zero. We find the more parsimonious VARMA to give more accurate forecasts than the VAR model, as discussed in the following subsection.

### 6.2 Forecast Accuracy

We compare the forecast accuracy of the VARMA and VAR estimator. To assess forecast performance, we use an expanding window forecast approach. Let be the forecast horizon. At each time point , we use the sparse estimators to estimate the VARMA and VAR model. For each data set, we take such that forecasts are computed for the last 15% of the observations. We estimate the model on the standardized time series. We then obtain -step-ahead direct forecasts and corresponding forecast errors for each time series . The overall forecast performance is measured by computing the Mean Squared Forecast Error

 MSFEh=1T−h−S+1T−h∑t=S1d∥yt+h−ˆyt+h∥2,

where the average is taken over all time points and all time series. The Mean Squared Forecast Error depends on the forecast horizon . For the quarterly macro-economic data set, we report the results for forecast horizons . For the weekly demand data set, we report the results for forecast horizons . For the monthly volatility data set, we report the results for forecast horizons . To assess the difference in prediction performance between the VARMA and VAR estimator, we use a Diebold-Mariano test (DM-test, Diebold and Mariano, 1995).

#### Results macro-economic data.

The Mean Squared Forecast Errors for the VARMA and VAR estimator are given in Table 1. For all considered horizons, the VARMA estimator gives a lower MSFE than the VAR estimator. At horizon , the VARMA estimator attains the largest gain in MSFE over the VAR estimator: it improves forecast accuracy by (on average) 40%. The -value of the Diebold-Mariano test confirms that the MSFE of the VARMA is significantly lower than the one from the VAR estimator at horizon (at the 5% significance level), and at horizon (at significance level 10%).

#### Results demand data.

The Mean Squared Forecast Errors for the VARMA and VAR estimator are given in Table 2. The VARMA estimator improves forecast accuracy compared to the VAR estimator, regardless of the considered forecast horizon. On average, the VARMA attains a gain of 16% compared to the VAR model. For both estimators, the forecast accuracy lowers with the forecast horizon since the forecast uncertainty increases.

#### Results volatility data.

The Mean Squared Forecast Errors for the VARMA and VAR estimator are given in Table 3. For all considered horizons, the VARMA estimator gives a lower MSFEs than the VAR estimator. The Diebold-Mariano test confirms all improvements in forecast accuracy to be significant at the 5% significance level. The gain in forecast accuracy over the VAR estimator is the largest for the longer forecast horizons. The MSFE of the VAR estimator at is more than three times larger than the one from the VARMA estimator.