High Dimensional Forecasting via Interpretable Vector Autoregression^{1}^{1}1The authors thank Gary Koop for providing his data transformation script. This research was supported by an Amazon Web Services in Education Research Grant. JB was supported by NSF DMS1405746 and DSM was supported by NSF DMS1455172 and a Xerox Corporation Ltd. faculty research award.
Abstract
Vector autoregression (VAR) is a fundamental tool for modeling multivariate time series. However, as the number of component series is increased, the VAR model becomes overparameterized. Several authors have addressed this issue by incorporating regularized approaches, such as the lasso in VAR estimation. Traditional approaches address overparameterization by selecting a low lag order, based on the assumption of short range dependence, assuming that a universal lag order applies to all components. Such an approach constrains the relationship between the components and impedes forecast performance. The lassobased approaches work much better in highdimensional situations but do not incorporate the notion of lag order selection.
We propose a new class of regularized VAR models, called hierarchical vector autoregression (HVAR), that embed the notion of lag selection into a convex regularizer. The key modeling tool is a group lasso with nested groups which guarantees that the sparsity pattern of lag coefficients honors the VAR’s ordered structure. The HVAR framework offers three structures, which allow for varying levels of flexibility. A simulation study demonstrates improved performance in forecasting and lag order selection over previous approaches, and two macroeconomic applications further highlight forecasting improvements as well as HVAR’s convenient, interpretable output.
1 Introduction
Vector autoregression (VAR) has emerged as the standardbearer for macroeconomic forecasting since the seminal work of sims1980. VAR is also widely applied in numerous fields, including climatology, neuroscience, and signal processing. The number of VAR parameters grows quadratically with the the number of component series, and, in the words of Sims, this “profligate parameterization” becomes intractable for large systems of variables. Without further assumptions, VAR modeling is infeasible except in limited situations in which the number of components and the lag order are small.
Many approaches have been proposed for reducing the dimensionality of vector time series models, including canonical correlation analysis (box1977canonical), factor models (pena1987identifying), scalar component models (tiao1989model), independent component analysis (back1997first), principal component analysis (stock2002forecasting), generalized dynamic factor models (forni2000generalized), and dynamic orthogonal component models (MattesonTsay2011).
Many recent approaches have instead focused on imposing sparsity in the estimated coefficient matrices through the use of convex regularizers such as the lasso (tibs). Most of these methods are adapted from the standard regression setting and do not specifically leverage the ordered structure inherent to the lag coefficients in a VAR. We propose a new class of regularized VAR models, called hierarchical vector autoregression (HVAR), that embed lag order selection into a convex regularizer to provide more reliable forecasts and more interpretable output.
The HVAR shifts the focus from obtaining estimates that are generally sparse (as measured by the number of nonzero autoregressive coefficients) to attaining estimates with low maximal lag order. While our motivating goal is to produce interpretable models with improved forecast performance, a convenient byproduct of the HVAR framework is a flexible and computationally efficient method for lag order selection.
Lag order selection procedures have been developed since the inception of VAR. Early attempts utilize least squares estimation with an information criterion or hypothesis testing (lutk1). The asymptotic theory of these approaches is well developed in the fixeddimensional setting, in which the length of the series grows while the number of components and the maximal lag order are held fixed (white2001asymptotic); however, for small sample sizes, it has been observed that no criterion works well (nick). Gonz find that for fixed and , when is relatively small, Akaike’s Information Criterion (AIC) tends to overfit whereas Schwartz’s Information Criterion (BIC) tends to severely underfit. Despite their shortcomings, AIC, BIC, and corrected AIC (AICc, Hurvich89) are still the preferred tools for lag order selection by most practitioners (Lutk, Pfaff, tsay2013multivariate).
A drawback with such approaches is that they typically require the strong assumption of a single, universal lag order that applies across all components. While this reduces the computational complexity of model selection, it has little statistical or economic justification, it unnecessarily constrains the dynamic relationship between the components, and it impedes forecast performance. karlsson show that violation of the universal lag order assumption can lead to overparameterized models or the imposition of false zero restrictions. They instead suggest considering componentwise specifications that allow each marginal regression to have a different lag order (which is sometimes referred to as an asymmetric VAR). One such procedure (hsiao) starts from univariate autoregressive models and sequentially adds lagged components according to Akaike’s “Final Prediction Error” criterion (akaike). However, this requires an a priori ranking of components based on their perceived predictive power, which is inherently subjective. Keating offers a more general method which estimates all potential componentwise VAR specifications and utilizes AIC or BIC for lag order selection. Such an approach is computationally intractable and standard asymptotic justifications are inapplicable if the number of components is large. Ding present several specifications which allow for varying lag order within a Bayesian framework. Markov chain Monte Carlo estimation methods with spike and slab priors are proposed, but these are computationally intensive, and estimation becomes intractable in high dimensions.
Given the difficulties with lag order selection in VAR models, many authors have turned instead to shrinkagebased approaches, which impose sparsity, or other economicallymotivated restrictions, on the parameter space to make reliable estimation tractable. Early shrinkage methods, such as Litterman1979, take a pragmatic Bayesian perspective. Many such approaches apply the Minnesota prior, which uses natural conjugate priors to shrink the VAR toward either an interceptonly model or toward a vector random walk, depending on the context. The prior covariance is specified so as to incorporate the belief that a series’ own lags are more informative than other lags and that lower lags are more informative than higher lags. With this prior structure, coefficients with high lags will have a prior mean of zero and a prior variance that decays with the lag. Hence, coefficients with higher lags are shrunk more toward zero; however, as in ridge regression, coefficients will not be estimated as exactly zero.
More recent shrinkage approaches have incorporated the lasso (tibs). Hsu considers the lasso with common information criterion methods for model selection. The use of the lasso mitigates the need to conduct an exhaustive search over the space of all possible models but does not explicitly encourage lags to be small. BickelSong use a group lasso (yuan) penalty structure and downweight higher lags via scaling the penalty parameter by an increasing function of the coefficients’ lag. The authors note that the functional form of these weights is arbitrary, but the estimates are sensitive to the choice of weights.
In Section 2 we introduce the HVAR framework, which attacks the traditional lag order selection problem through convex regularization. HVAR forces low lag coefficients to be selected before corresponding high lag coefficients, thereby specifically shrinking toward low lag order solutions. This is in contrast to approaches such as BickelSong, which increase the weight of the penalty parameter with the coefficients’ lag without explicitly enforcing a lowlag structure. In Section 2.1 we introduce three hierarchical lag structures that may be desirable when fitting VAR models to data. These structures vary in the degree to which lag order selection is common across different components. For each lag structure, a corresponding HVAR model is detailed in Section 2.2 for attaining that sparsity structure. The proposed methodology allows for flexible HVAR estimation in the high dimensional setting with a single tuning parameter. We develop algorithms in Section 3 that are computationally efficient and parallelizable across components. A simulation study in Section 4 and two macroeconomic applications in Section 5 highlight the advantages of HVAR in forecasting and lag order selection. The appendix provides additional details of our simulation methodology.
2 Methodology
Let denote a dimensional vector time series of length . A th order vector autoregression may be expressed as a multivariate regression
(2.1) 
conditional on initial values , in which denotes an intercept vector, denote lag coefficient matrices, and denotes a mean zero white noise (serially uncorrelated) vector time series with an unspecified nonsingular contemporaneous covariance matrix .
In the classical lowdimensional setting in which , one may perform least squares to fit the model, minimizing
(2.2) 
over and , where denotes the Euclidean norm of a vector . We will find it convenient to express the VAR using compact matrix notation:
Equation (2.1) is then simply
and the least squares procedure (2.2) can be expressed as minimizing
over and , where denotes the Frobenius norm of the matrix , that is the Euclidean norm of (not be mistaken for the operator norm, which does not appear in this paper).
Estimating the parameters of this model is challenging unless is sufficiently large. We therefore seek to incorporate reasonable structural assumptions on the parameter space to make estimation tractable for moderate to small . Multiple authors have considered using the lasso penalty, building in the assumption that the lagged coefficient matrices are sparse (BickelSong, Davis, Hsu); theoretical work has elucidated how such structural assumptions can lead to better estimation performance even when the number of parameters is large (basu2013estimation). In what follows, we define a class of sparsity patterns, which we call hierarchical lag structures, that arises in the context of multivariate time series.
2.1 Hierarchical Lag Structures
In Equation (2.1), the parameter controls the dynamic dependence of the th component of on the th component of . In describing hierarchical lag structures, we will use the following notational convention: for , let
Consider the matrix of elementwise coefficient lags defined by
in which we define if for all . Therefore, each denotes the maximal coefficient lag (maxlag) for component in the regression model for component . In particular, is the smallest such that . Note that the maxlag matrix is not symmetric, in general. There are numerous hierarchical lag structures that one can consider within the context of the model. The simplest such structure is that for all and , meaning that there is a universal (U) maxlag that is shared by every pair of components. Expressed in terms of Equation (2.1), this would say that and that for all . While the methodology we introduce can be easily extended to this and many other potential hierarchical lag structures, in this paper we focus on the following three fundamental structures.

Componentwise (C). A componentwise hierarchical lag structure allows each of the marginal equations from (2.1) to have its own maxlag, but all components within each equation must share the same maximal lag:
Hence in Equation (2.1), this implies and for all and . This componentwise hierarchical lag structure is illustrated in Figure 3.

OwnOther (O). The ownother hierarchical lag structure is similar to the componentwise structure, but with an added withinlag hierarchy that imposes the mild assumption that a series’ own lags are more informative than other lags . Thus, diagonal elements are prioritized before offdiagonal elements within each lag, componentwise (i.e., rowwise). In particular,
This hierarchical lag structure allows each component of to have longer range lagged selfdependence than lagged crossdependencies. This ownother hierarchical lag structure is illustrated in Figure 3.

Elementwise (E). Finally, we consider a completely flexible hierarchical lag structure in which the elements of have no stipulated relationships. This elementwise hierarchical lag structure is illustrated in Figure 3.
In the next section, we introduce the proposed class of HVAR estimators aimed at estimating models while shrinking the elements of towards zero by incorporating the three hierarchical lag structures described above.
2.2 HVAR: Hierarchical Group Lasso for Lag Structured VAR Modeling
In this section, we introduce convex penalties specifically tailored for attaining the three lag structures presented in the previous section. Our primary modeling tool is the hierarchical group lasso (Zhao09), which is a group lasso (yuan) with a nested group structure. The group lasso is a sum of (unsquared) Euclidean norms and is used in statistical modeling as a penalty to encourage groups of parameters to be set to zero simultaneously. Using nested groups leads to hierarchical sparsity constraints in which one set of parameters being zero implies that another set is also zero. This penalty has been applied to multiple statistical problems including regression models with interactions (Zhao09, Jenatton10, Radchenko10, Bach12, Bien13, lim2013learning, haris2014convex, She2014), covariance estimation (bien2014convex), additive modeling (lou2014sparse), and time series (suo2014ordered). This last work focuses on transfer function estimation, in this case scalar regression with multiple timelagged covariates whose coefficients decay with lag.
For each hierarchical lag structure presented above, we propose an estimator based on a convex optimization problem.

The HVAR objective is a componentwise hierarchical lag structure and is defined by
(2.3) in which denotes the Euclidean norm of , for a matrix . As the penalty parameter is increased, we have for more , and for smaller . This componentwise hierarchical group structure builds in the condition that if , then for all , for each . This structure favors lower maxlag models componentwise, rather than simply giving sparse estimates with no particular structure.

The HVAR objective aims for a ownother hierarchical lag structure and is defined by
(2.4) in which , and where we adopt the convention that . The first term in this penalty is identical to that of (2.3). The difference is the addition of the second penalty term, which is just like the first except that it omits . This penalty allows sparsity patterns in which the influence of component on itself may be nonzero at lag even though the influence of other components is thought to be zero at that lag. This model ensures that, for all , implies and implies . This accomplishes the desired ownother hierarchical lag structure such that and , componentwise.

The HVAR objective aims for an elementwise hierarchical lag structure and is defined by
(2.5) Here, each of the pairs of components can have its own maxlag, such that may occur for different values of for each pair and . While this model is the most flexible of the three, it also borrows the least strength across the different components. When differ for all and , we expect this method to do well, whereas when, for example , we expect it to be inefficient relative to (2.3).
Since all three objectives are based on hierarchical group lasso penalties, a unified computational approach to solve each is detailed in the next section.
3 Optimization Algorithm
We begin by noting that since the intercept does not appear in the penalty terms, it can be removed if we replace by and by . All three optimization problems are of the form
(3.1) 
and (2.3), (2.4), and (2.5) only differ by the form of the norm . A key simplification is possible by observing that the objective above decouples across the rows of :
in which and . Hence, Equation (3.1) can be solved in parallel by solving the “onerow” subproblem
(3.2) 
jenatton show that hierarchical group lasso problems can be efficiently solved via the proximal gradient method. This procedure can be viewed as an extension of traditional gradient descent methods to nonsmooth objective functions. Given a convex objective function of the form , where is differentiable with a Lipschitz continuous gradient, the proximal gradient method produces a sequence with the guarantee that
is (cf. beck). For , its update is given by
where is an appropriately chosen step size and is the proximal operator of the function , which is evaluated at the gradient step we would take if we were minimizing alone. The proximal operator is defined as the unique solution of a convex optimization problem involving but not :
(3.3) 
The proximal gradient method is particularly effective when the proximal operator can be evaluated efficiently. In our case, is a sum of hierarchically nested Euclidean norms. jenatton show that for such penalties, the proximal operator has essentially a closed form solution, making it extremely efficient. It remains to note that has gradient and that the step size can be determined adaptively through a backtracking procedure or it can be set to the Lipschitz constant of , which in this case is (where denotes the largest singular value of ).
beck develop an accelerated version of the proximal gradient method, which they call the Fast Iterative SoftThresholding Algorithm (FISTA). This leads to a faster convergence rate and improved empirical performance with minimal additional overhead. Our particular implementation is based on Algorithm 2 of tseng2008accelerated. It repeats, for to convergence,
and converges at rate (compared to the unaccelerated proximal gradient method’s rate). The full procedure is detailed in Algorithm 1 and is applicable to all three HVAR estimators.
The algorithms for these methods differ only in the evaluation of their proximal operators (since each method has a different penalty ). However, all three choices of correspond to hierarchical group lasso penalties, allowing us to use the result of jenatton, which shows that the proximal operator has a remarkably simple form. We write these three problems generically as
(3.4) 
where . The key observation in jenatton is that the dual of the proximal problem (3.3) can be solved exactly in a single pass of blockwise coordinate descent. By strong duality, this solution to the dual provides us with a solution to problem (3.3). Furthermore, the updates of each block are extremely simple, corresponding to a groupwisesoftthresholding operation. Algorithm 2 shows the solution to (3.4), which includes all three of our penalties as special cases.
4 Simulation Study
In this section we compare the proposed HVAR methods with competing VAR modeling approaches. After detailing these comparison methods below, we describe three simulation scenarios and discuss the forecast and lag order selection performance of each estimator. Finally, we examine the performance of the proposed HVAR methods in a lowdimensional simulation setting while allowing the maximal lag order to vary.
4.1 Comparison Methods
A standard method in lower dimensional settings is to fit a with least squares for and then to select a universal lag order using AIC or BIC. Per Lutk, the AIC and BIC of a are defined as
(4.1)  
(4.2) 
in which is the residual sample covariance matrix having used least squares to fit the . The lag order that minimizes or is selected. This method of lag order selection is only possible when since otherwise least squares is not welldefined. In the first set of simulations that follow, we cannot use least squares for , thus for a simple benchmark we instead estimate a by least squares:
where . We also include two wellknown regularization approaches. The lasso estimates the VAR using an penalty:
where denotes . The lasso does not intrinsically consider lag order, hence BickelSong propose a lagweighted lasso penalty in which a weighted penalty is used with weights that increase geometrically with lag order:
The tuning parameter determines how fast the penalty weight increases with lag. While this form of penalty applies greater regularization to higher order lags, it is less structured than our HVAR penalties in that it does not necessarily produce sparsity patterns in which all coefficients beyond a certain lag order are zero.
Finally, we compare against two naive approaches to serve as simple baselines: the unconditional sample mean corresponds to the interceptonly model,
which makes onestepahead forecasts of the form ; and the vector random walk model, which corresponds to
and makes onestepahead forecasts of the form .
4.2 Simulation Settings
In order to demonstrate the efficacy of the HVAR methods in applications with various lag structures, we evaluate forecasts produced by the proposed methods under several simulation scenarios. In these scenarios, we have components, a maximal lag order of , and a series length of ; the error covariance is taken to be . All simulations are generated from stationary coefficient matrices. The steps taken to ensure the stationarity of the simulation structures are described in detail in Section LABEL:SimScenGen of the appendix. Simulation results are based on 100 iterations.
The penalty parameters were selected using the rolling crossvalidation approach utilized by BickelSong and BGR, with the middle third of the data used for penalty parameter selection and the last third for forecast evaluation. In the case of the lagweighted lasso, and were jointly selected. Given an evaluation period , we use meansquared onestepahead forecast error (MSFE) as a measure of forecast performance:
where denotes the forecast of a method for time and component based on observing the series up to time .
We evaluate the methods under three lag structures.
Simulation Scenario 1: Componentwise Lag Structure. In this scenario, we simulate according to an HVAR structure. In particular, we choose the maxlag matrix
This maxlag matrix is rowwise constant, meaning that all components within a row have the same maxlag; we partition the rows into 5 groups of size 12, each group taking on a distinct maxlag in . A coefficient matrix with maxlag matrix is used in Scenario 1’s simulations and its magnitudes are depicted in Figure 8. The prediction performance of the methods under study is shown in Table 1.
Class  Method  MSFE 

HVAR  Componentwise  0.0464 (0.0005) 
Ownother  0.0470 (0.0005)  
Elementwise  0.0546 (0.0005)  
VAR  Lasso  0.0598 (0.0006) 
Lagweighted lasso  0.0547 (0.0006)  
Least squares  0.1599 (0.0020)  
Other  Sample mean  0.1295 (0.0034) 
Random walk  0.3054 (0.0107) 
The componentwise and ownother HVAR methods perform best, which is to be expected since both methods are geared explicitly toward a lag structure as in Scenario 1. The lagweighted lasso and elementwise HVAR perform similarly, and both of them do better than the regular lasso. With a total of coefficient parameters to estimate, the methods that assume an ordering are greatly advantaged over a method like the lasso that does not exploit this knowledge. One exception is the model that is fit using least squares: Despite this method’s explicit orientation toward modeling recent behavior, it suffers both because it misses important longer range lag coefficients and because it is an unregularized estimator of and therefore has high variance.
Simulation Scenario 2: OwnOther Lag Structure. In this scenario, we create the matrix in such a manner that it differentiates between own and other coefficients. The coefficients of a series’ “own lags” (i.e., ) are larger in magnitude than those of “other lags” (i.e., with ). The magnitude of coefficients decreases as the lag order increases. The HVAR model we simulate is depicted in Figure 5. The first 20 rows can be viewed as univariate autoregressive models in which only the own term is nonzero; in the next 20 rows, for the first coefficients, the coefficient on a series’ own lags is larger than “other lags,” and, for the next coefficients, only own coefficients are nonzero; the final 20 rows have nonzeros throughout the first coefficients, with own coefficients dominating other coefficients in magnitude.
The results from this scenario are shown in Table 2. Here, all three HVAR methods lead the competing methods. As one would expect, the ownother HVAR procedure achieves the best forecasting performance, with the componentwise HVAR performing only slightly worse. We find that the lagweighted lasso performs worse than the elementwise HVAR, but much better than the lasso without weights. As with the previous scenario, the leastsquares approach is not competitive.
Class  Method  MSFE 

HVAR  Componentwise  0.0193 (0.0001) 
Ownother  0.0183 (0.0001)  
Elementwise  0.0210 (0.0002)  
VAR  Lasso  0.0270 (0.0003) 
Lagweighted lasso  0.0228 (0.0003)  
Least squares  0.0544 (0.0007)  
Other  Sample mean  0.2948 (0.0178) 
Random walk  0.1381 (0.0086) 
Simulation Scenario 3: Elementwise Lag Structure. In this final scenario, we simulate under an HVAR model, meaning that the maxlag is allowed to vary not just across rows but also within rows. The maxlag matrix is given by
A coefficient matrix corresponding to this lag structure is depicted in Figure 6, and the results are shown in Table 3.
Class  Method  MSFE 

HVAR  Componentwise  0.0323 (0.0007) 
Ownother  0.0335 (0.0007)  
Elementwise  0.0295 (0.0004)  
VAR  Lasso  0.0379 (0.0007) 
Lagweighted lasso  0.0380 (0.0007)  
Least squares  0.1071 (0.0016)  
Other  Sample mean  0.0915 (0.0038) 
Random walk  0.1253 (0.0039) 
As expected, the elementwise HVAR method outperforms all others. The chosen violates the ownother lag structure in 45 of the 60 rows (and it violates the componentwise lag structure in every row). Even so, these two HVAR methods outperform the lasso and weighted lasso methods.
4.3 Lag Order Selection
While our primary intent in introducing the HVAR framework is better forecast performance and improved interpretability, one can also view HVAR as an approach for selecting lag order. Below, we examine the performance of the proposed methods in estimating the maxlag matrix defined in Section 2.1. Based on an estimate of the autoregressive coefficients, we can likewise define a matrix of estimated lag orders:
where we define if for all . It is well known in the regularized regression literature (cf., leng2006note) that the optimal tuning parameter for prediction is different from that for support recovery. Nonetheless, in this section we will proceed with the rolling crossvalidation procedure used previously with only two minor modifications intended to ameliorate the tendency of crossvalidation to select a value of that is smaller than optimal for support recovery. First, we crossvalidate a relaxed version of the regularized methods in which the estimated nonzero coefficients are refit using ridge regression. This refitting procedure is described in detail in Section LABEL:RVAR in the appendix. This modification makes the MSFE more sensitive to being larger than necessary. Second, we use the “onestandarderror rule” heuristic discussed in elements, in which we select the largest value of whose MSFE is no more than one standard error above that of the best performing model (since we favor the most parsimonious model that does approximately as well as any other).
We measure a procedure’s lag order selection performance based on the sum of absolute differences between and :
In particular, Tables 4, 5, and 6 report this value for various methods relative to that of the sample mean (which chooses for all and ).
Class  Method  

HVAR  Componentwise  0.5898 (0.0495) 
Ownother  0.5792 (0.0447)  
Elementwise  0.9684 (0.0034)  
VAR  Lasso  0.9968 (0.0013) 
Lagweighted lasso  0.9936 (0.0023)  
Other  Sample mean  1.0000 (0.0000) 
In Scenario 1, the ownother and componentwise HVARs achieve the best performance; every other approach scarcely outperforms the benchmark. The fact that the ownother and componentwise HVARs perform best is no surprise given that they both are geared toward a lag structure as in Scenario 1.
Class  Method  

HVAR  Componentwise  0.5376 (0.0060) 
Ownother  0.4329 (0.0061)  
Elementwise  0.9561 (0.0009)  
VAR  Lasso  1.0395 (0.0023) 
Lagweighted lasso  1.0551 (0.0031)  
Other  Sample mean  1.0000 (0.0000) 
In Scenario 2, the ownother HVAR achieves the best performance followed by the componentwise HVAR; the elementwise HVAR performs much worse than these but still better than all other methods.
Class  Method  

HVAR  Componentwise  1.3260 (0.0382) 
Ownother  1.2573 (0.0365)  
Elementwise  0.8782 (0.0035)  
VAR  Lasso  1.0121 (0.0020) 
Lagweighted lasso  1.0144 (0.0043)  
Other  Sample mean  1.0000 (0.0000) 
Interestingly, in Scenario 3, the elementwise HVAR approach is the only method to perform better than the samplemean baseline. We see that the HVAR methods that incorrectly assume that maxlag should be constant (or near constant) within a row pay a price in lag order selection, making them even worse than the lasso methods.
4.4 Simulation Scenario 4: Robustness of HVAR as increases
We additionally examine the impact of the upper bound for maximal lag order on HVAR’s performance. Ideally, provided that is large enough to capture the dynamics of the system, its choice should have little impact on forecast performance. However, we should expect regularizers that treat each coefficient democratically, such as the lasso, to experience degraded forecast performance as increases.
As an experiment, we simulate from an while increasing the upper bound on the maximal lag order to substantially exceed the true . All series in the first 4 rows have , the next 3 rows have , and the final 3 rows have . Figure 7 depicts the coefficient matrix and its magnitudes.
We consider varying . As p increases, we should expect the performance of the HVAR procedures to remain relatively constant whereas the lasso and informationcriterion based methods should return worse forecasts.
Class  Method  MSFE (p=1)  MSFE (p=5)  MSFE (p=12)  MSFE (p=25)  MSFE (p=50) 

HVAR  Componentwise  0.0141 (0.0010)  0.0119 (0.0077)  0.0120 (0.0008)  0.0122 (0.0009)  0.0125 (0.0010) 
Ownother  0.0142 (0.0010)  0.0120 (0.0008)  0.0121 (0.0008)  0.0123 (0.0008)  0.0126 (0.0010)  
Elementwise  0.0142 (0.0011)  0.0120 (0.0008)  0.0121 (0.0008)  0.0123 (0.0008)  0.0126 (0.0009)  
VAR  Lasso  0.0142 (0.0011)  0.0125 (0.0008)  0.0138 (0.0010)  0.0156 (0.0013)  0.0190 (0.0016) 
Lagweighted lasso  0.0143 (0.0011)  0.0121 (0.0008)  0.0122 (0.0008)  0.0123 (0.0015)  0.0129 (0.0012)  
AIC  0.0141 (0.0010)  0.0117 (0.0080)  0.0535 (0.0070)  0.0781 (0.0121)  0.0855 (0.0130)  
BIC  0.0141 (0.0010)  0.0140 (0.0011)  0.0141 (0.0011)  0.0142 (0.0012)  0.0144 (0.0013) 
At all models are misspecified. Since no method is capable of capturing the true dynamics of series 17 in Figure 8, all perform poorly. As expected, after ignoring , the componentwise HVAR achieves the best performance across all other choices for p, although ownother and elementwise HVAR’s performances are within one standard error. Among the informationcriterion based methods, AIC performs substantially worse than BIC as increases. This is likely the result of BIC assigning a larger penalty on the number of coefficients than AIC. The lasso’s performance degrades substantially as the lag order increases, while the lagweighted lasso is somewhat more robust to the lag order, but still achieves worse forecasts than every HVAR procedure under all scenarios except for where it performs comparably.
5 Data Analysis
5.1 Macroeconomic Application
We now apply the proposed HVAR estimation methods to a collection of US macroeconomic time series compiled by stockdataset and augmented by koop. The full dataset contains 168 quarterly macroeconomic indicators over 45 years, representing information about many aspects of the US economy, including income, employment, stock prices, exchange rates, etc. The full list of variables considered is available in koop, who defines various nested groups of components:

Small (): Federal Funds Rate, CPI, and GDP growth rate; this core group is commonly considered in basic Dynamic Stochastic Generalized Equilibrium modeling;

Medium (): Small group plus 17 additional variables, including indices for consumption, labor, and housing, as well as exchange rates;

MediumLarge (): Medium group plus 20 additional aggregate variables;

Large (): MediumLarge group plus 128 additional variables, consisting primarily of the components that make up the aggregated variables (e.g. subsets of Gross Domestic Product, Bond Interest Rates, Industrial Production, etc).
Forecast Comparisons. We initially focus on forecasting the MediumLarge () and Large () groups. We apply the transformation codes provided by stockdataset to make the data approximately stationary, then we standardize each series to have sample mean zero and variance one. Quarter 3 of 1977 to Quarter 3 of 1992 is used for penalty parameter selection while Quarter 4 of 1992 to Quarter 4 of 2007 is used for forecast performance comparisons. Following the convention from BGR, we set the maximal lag order to 13. In the Large group, VAR by AIC and BIC are overparameterized and not included.
The rolling out of sample onestepahead mean square forecast error (MSFE) for the MediumLarge and Large groups are summarized in Table 5.1. The proposed HVAR methods outperformed the lasso, least squares, and both informationcriterion based models for the MediumLarge group over this evaluation period. Among the HVAR methods, the more flexible ownother and elementwise structures performed similarly, and better than the componentwise structure. The sample mean and random walk forecast results are provided for comparison.