Low Rank and Structured Modeling of Highdimensional Vector Autoregressions
Abstract
Network modeling of highdimensional time series data is a key learning task due to its widespread use in a number of application areas, including macroeconomics, finance and neuroscience. While the problem of sparse modeling based on vector autoregressive models (VAR) has been investigated in depth in the literature, more complex network structures that involve low rank and group sparse components have received considerably less attention, despite their presence in data. Failure to account for lowrank structures results in spurious connectivity among the observed time series, which may lead practitioners to draw incorrect conclusions about pertinent scientific or policy questions. In order to accurately estimate a network of Granger causal interactions after accounting for latent effects, we introduce a novel approach for estimating lowrank and structured sparse highdimensional VAR models. We introduce a regularized framework involving a combination of nuclear norm and lasso (or group lasso) penalty. Further, and subsequently establish nonasymptotic upper bounds on the estimation error rates of the lowrank and the structured sparse components. We also introduce a fast estimation algorithm and finally demonstrate the performance of the proposed modeling framework over standard sparse VAR estimates through numerical experiments on synthetic and real data.
Keywords: lasso, group lasso, nuclear norm, low rank, vector autoregression, probabilistic bounds, identifiability, fast algorithm.
1 Introduction
The problem of learning the network structure among a large set of time series arises in many signal processing, economic, finance and biomedical applications. Examples include processing signals obtained from radars [1, 2], macroeconomic policy making and forecasting [3], assessing connectivity among financial firms [4], reconstructing gene regulatory interactions from timecourse genomic data [5] and understanding connectivity between brain regions from fMRI measurements [6]. Vector Autoregressive (VAR) models provide a principled framework for these tasks.
Formally, a VAR model for dimensional time series is defined in its simplest possible form involving a single timelag as
(1) 
where is a transition matrix specifying the leadlag cross dependencies among the time series and is a zero mean error process. VAR models for small number of time series (lowdimensional) have been thoroughly studied in the literature [7]. However, the above mentioned applications, where dozens to hundreds of time series are involved, created the need for the study of VAR models under high dimensional scaling and the assumption that their interactions are sparse to compensate for the possible lack of adequate number of time points (samples; see [8] and references therein). There has been a growing body of literature on sparse estimation of large scale VAR models [39], including alternative penalties beyond the popular penalty (lasso), such as the Berhu regularization introduced in [9], group lasso type penalties employed in [22, 21], as well as nonconvex penalties akin to a squarelasso one investigated in [11]. Further, [10] examine estimation of the transition matrix and the inverse covariance matrix of the error process through a joint sparse penalty. Note that the problem of sparse estimation of these two model parameters separately from a least squares and maximum likelihood viewpoints is addressed in [8, 3], respectively, where in addition probabilistic finite sample error bounds for the obtained estimates are obtained.
Nevertheless, there are occasions where the sparsity assumption may not be sufficient. For example, during financial crisis periods, returns on assets move together in a more concerted manner [4, 12], while transcription factors regulate a large number of genes that may lead to hubnode network structures [13]. Similarly, in brain connectivity networks, particular tasks activate a number of regions that cross talk in a collaborative manner [14]. Hence, it is of interest to study VAR models under high dimensional scaling where the transition matrix governing the temporal dynamics exhibits a more complex structure; e.g. it is low rank and/or (group) sparse.
In a lowdimensional regime, where the number of time points scales to infinity, but the number of time series under study remains fixed, [15] examined asymptotic properties of VAR models, where the parameters exhibit reduced rank structure and also discussed connections with canonical correlation analysis of such models presented in [16]. Specifically, the transition matrix in (2) can be written as the product of two rank matrices , i.e. , so that in the resulting model specification of the original time series is expressed as linear combinations of the original ones, and specifies the dependence between and ; namely . Hence, to obtain and [15] suggest to estimate the parameters of the original model in (1) under the constraint that and that rank. Other works include low rank approximations of Hankel matrices that represent the inputoutput structure of a linear time invariant systems and were studied in [17, 18]. Finally, a brief mention to the possibility that the VAR transition matrix may exhibit such a structure appeared as a motivating example in [19].
On the other hand, there is a mature literature on imposing low rank plus sparse, or pure group sparse structure for many learning tasks for independent and identically distributed (i.i.d.) data. Examples include group sparsity in regression and graphical modeling [20], low rank and sparse matrix approximations for dimension reduction [18], etc. However, as shown in [8], the presence of temporal dependence across observations induces intricate dependencies between both rows and columns of the design matrix of the corresponding least squares estimation problem, as well as between the design matrix and the error term, that require careful handling to establish consistency properties for the model parameters under sparsity and high dimensional scaling. These issues are further compounded when more complex regularizing norms are involved, as discussed in [21]. In this paper, the authors model grouping structures within each column of , but do not consider a lowrank component. In contrast, we focus on groups potentially spanning across different columns and allow a lowrank component in .
More recently, [37] and [38] extended the framework of [18] beyond decomposition of an observable matrix to Gaussian process identification by assuming a lowrank plus sparse structure on the inverse spectral density and the transfer function of a general VAR() system, respectively. Our work is complementary to this recent developments. We directly model the transition matrix of a VAR(1) process that enables us to identify a directed network of (group) sparse Granger causal relationships that are of interest in a number of applications; e.g. in financial economics where firms with higher outdegree are of particular interest in measuring systemic risk [12, 4]. Further, our approach explicitly address identifiability issues for extracting the respective lowrank and sparse components, which in turn are leveraged to obtain probabilistic error bounds that characterize the quality of their estimates. The latter provide insights to the practitioner on sample size requirements and tuning parameter selection for real data applications. Finally, note that our approach to the issue of identifiability builds on [19], wherein we characterize the degree of unidentifiability which guides in an explicit manner the selection of the tuning parameters used in the proposed optimization algorithm.
Further, to estimate the posited model in (1) with being lowrank and structured sparse (henceforth indicating that it could be either pure sparse or group sparse or both), we also introduce a fast accelerated proximal gradient algorithm, inspired by [30, 31], for the corresponding optimization problems. The key idea is that instead of searching for the local Lipschitz constant of the gradient of the smooth component of the objective function, the proposed algorithm utilizes a safeguarded BarzilaiBorwein (BB) initial stepsize [25] and employs relaxed line search conditions to achieve better performance in practice. The latter enables the selection of more “aggressive” stepsizes, while preserving the accelerated convergence rate of , where denotes the number of iterations required until convergence. Finally, the performance of the model parameters under different structures together with the associated estimation procedure based on the accelerated proximal gradient algorithm are calibrated on synthetic data, and illustrated on three data sets examining realized volatilities of stock prices of large financial firms before, during and after the US financial crisis.
Notation: Throughout the paper, we employ the following notation: , and denote the norm of a vector, the spectral norm and the Frobenius norm of a matrix, respectively. For a matrix , the symbol is used to denote the nuclear norm, i.e. , the sum of the singular values of a matrix, while denotes the conjugate transpose of a matrix . For any matrix , we use to denote , for and to denote . Further, if denote a partition of into nonoverlapping groups, then we use to denote , for , while denotes the number of nonzero groups in . Here, with a little abuse of notation, we use to denote . In addition, , denote the maximum and minimum eigenvalues of a symmetric or Hermitian matrix. For any integer , we use to denote the unit ball . We also use generically to denote unit vectors in , when is clear from the context. Finally, for positive real numbers , we write if there exists an absolute positive constant , independent of the model parameters, such that .
2 Model Formulation and Estimation Procedure
Consider a VAR(1) model where the transition matrix is lowrank plus structured sparse given by
(2)  
(3) 
where corresponds to the low rank component and represents either a sparse , or groupsparse component . It is further assumed that the number of nonzero elements in the sparse case is , while in the group sparse case the number of nonzero groups is , with and . The matrix captures persistence structure across all time series and enables the model to be applicable in settings where there are strong crossautocorrelations, a feature that the standard sparse VAR model is not equipped to handle. The sparse or group sparse component captures additional crosssectional autocorrelation structure among the time series. Finally, it is assumed that the error terms are serially uncorrelated. Our objective is to estimate and accurately based on a relatively small number of samples .
Stability. In order to ensure consistent estimation, we assume that the posited VAR model in (2) is stable; i.e. its characteristic polynomial satisfies on the unit circle of the complex plane . This is a common assumption in the literature of multivariate time series [7], required for consistency and asymptotic normality of lowdimensional VAR models. This assumption also ensures that the spectral density of the VAR model
(4) 
is bounded above in spectral norm.
It was shown in [8] that this condition is sufficient to establish consistency of some regularized VAR estimates of a sparse transition matrix. Further, the following quantities play a central role in the error bounds of the regularized estimates:
(5) 
As shown in [8], and together capture the narrowness of the spectral density of a time series. Processes with stronger temporal and crosssectional dependence have narrower spectra that in turn lead to slower convergence rates for the regularized estimates. For VAR models, and are related to and as follows:
(6) 
Proposition 2.2 in [8] provides a lower bound on . For the special structure of the models considered here, we can get an improved upper bound on , as shown in the following lemma:
Lemma 2.1.
For a stable VAR(1) model of the class (2), we have
(7) 
where is the largest singular value of , and .
Proof.
for any with . The result follows from the fact that . ∎
2.1 Estimation Procedure
The estimation of VAR model parameters is based on the following regression formulation (see [7]). Given consecutive observations from the VAR model, we work with the autoregressive design as follows:
(8) 
This is a standard regression problem with samples and variables. Our goal is to estimate , with high accuracy when .
There is an inherent identifiability issue in the estimation of the components and . Suppose the lowrank component itself is sparse or group sparse and the sparse or groupsparse component is of rank . In that scenario, we can not hope for any method to estimate and separately without imposing any further constraints. So, a minimal condition for lowrank and sparse or groupsparse recovery is that the low rank part should not be too sparse and the sparse or groupsparse part should not be lowrank.
This issue has been rigorously addressed in the literature (e.g. [18]) for independent and identically distributed data and resolved by imposing an incoherence condition. Such a condition is sufficient for exact recovery of the low rank and the sparse or groupsparse component by solving a convex program. In a recent paper, [19] showed that in a noisy setting where exact recovery of the two components is impossible, it is still possible to achieve good estimation error under a comparatively mild assumption. In particular, they formulated a general measure for the radius of nonidentifiability of the problem under consideration and established a nonasymptotic upper bound on the estimation error , which depend on this radius. The key idea is to allow for simultaneously sparse (or groupsparse) and lowrank matrices in the model, and control for the error introduced. We refer the readers to the above paper for a more detailed discussion on this notion of nonidentifiability. In this work, the lowrank plus sparse or groupsparse decomposition problem under restrictions on the radius of nonidentifiability takes the form
(9) 
Here (for sparse) or (for group sparse), represents or depending on sparsity or group sparsity of , and and are nonnegative tuning parameters controlling the regularizations of lowrank and sparse/groupsparse parts. The parameters and control for the degree of nonidentifiability of the matrices allowed in the model class. For instance, larger values of provide sparser estimates of and allow simultaneously sparse and lowrank components to be absorbed in . A smaller value of , on the other hand, tends to produce a matrix with smaller and pushes the simultaneously lowrank and sparse components to be absorbed in . In practice, we recommend choosing and in the range and , respectively. The issue of selecting them robustly in practice is discussed in Section 5.
Remark. On certain occasions, it may be useful to have both sparse m and groupsparse structures in the model, in addition to the low rank structure. We then have in (9) with . However, to guarantee the simultaneous identifiability of the sparse and groupsparse components from the lowrank component, stronger conditions need to be imposed on ; namely, .
Remark. Note that the estimated VAR model is not guaranteed to be stable, although the error bound analysis in section 3 ensures its stability with high probability as long as the sample size is large enough and the true generative model is stable. For network reconstruction and visualization purposes, stability of the estimated VAR is not strictly required. However, enforcing stability is essential for forecasting purposes. When there is a small deviation of the estimated model from stability (e.g the spectral radius of the estimated is a little over ), stability can be ensured through a postprocessing step of shrinking the moduli of eigenvalues of below while keeping its eigenvectors unchanged. This type of projection argument is common in covariance and correlation matrix estimation with missing data for ensuring positive definiteness of the estimates [42]. However, in case of moderate to large deviation from stability, a closer look at the individual time series is recommended to reassess the validity of the VAR formulation. For example, in macroeconomics, it is customary to use suitable transformations of the component time series to ensure that each of the individual time series and the resulting VAR model is stable, as opposed to modeling the individual and the joint time series (without transformation) as unit root and cointegrating processes. For instance, see [41, 40] for specific recommendations on useful transformations for macroeconomic time series.
3 Theoretical Properties
Next, we derive nonasymptotic upper bounds on the estimation errors of the lowrank plus structured sparse components of the transition matrix . The main result shows that consistent estimation is possible with a sample size of the order , as long as the process is stable and the radius of nonidentifiability, as measured by and/or , is small in an appropriate sense detailed next.
To establish the results, we first consider fixed realizations of and and impose the
following assumptions:
1) Restricted Strong Convexity (RSC): There exist and such that
where ,
and
2) Deviation Conditions: There exist a deterministic function of the model parameters and such that
where is the size of the largest group .
Later on, we show that assumptions 1) and 2) are indeed satisfied with high probability when the data are generated from the model (2).
Next, we present the nonasymptotic upper bounds on the estimation errors of the lowrank plus structured sparse components, respectively.
Proposition 1.
Remark. It should be noted that if each group in has only one element, then we have and nonzero entries. For such cases, part (b) of Proposition 1 becomes identical to part (a).
As a byproduct, we also give the estimation error bound of the transition matrix which can be characterized by the sparse plus groupsparse and the lowrank plus sparse and groupsparse components, respectively, under the assumption that the strength of the connections in the groupsparse component is weak; i.e. with , where .
Proposition 2.
(a) Suppose that the matrix has at most nonzero entries, while the matrix has at most nonzero groups. Then, for any and , any solution of (9) satisfies
(b) Suppose that the matrix has rank at most , while the matrix has at most nonzero entries and the matrix has at most nonzero groups. Then, for any , , and , any solution of (9) satisfies
Note that the objective in Proposition 2 is not the accurate recovery of the and components separately. The latter can be in principle achieved, if one sets to a very small value.
In order to obtain meaningful results in the context of our problem, we need upper bounds on , and and a lower bound on that hold with high probability. For the case of independent and identically distributed data, such highprobability deviation bounds are established in [19]. However, for time series data all entries of the matrix are dependent on each other, and hence it is a nontrivial technical task to establish such deviation bounds. A key technical contribution of this work is to derive these deviation bounds, which lead to meaningful analysis for VAR models. The results rely on the measure of stability defined in (2) and an analysis of the joint spectrum of and undertaken next.
Proposition 3.
See Appendix B for the detailed proof of Proposition 3.
Using the above deviation bounds in the nonasymptotic errors of Propositions 1, we obtain the final result for approximate recovery of the lowrank and the structured sparse components using nuclear and norm relaxations, as we show next.
Proposition 4.
Remark: The error bound presented in the above proposition consists of two key terms. The first term is the error of estimation emanating from randomness in the data and limited sample capacity. For a given model, this error goes to zero as the sample size increases. The second term represents the error due to the unidentifiability of the problem. This is more fundamental to the structure of the true lowrank and structured sparse components, and depends only on the model parameters and does not vanish, even with infinite sample size.
Further, the estimation error is a product of two terms  the second term involves the dimensionality parameters and matches the parametric convergence rate for independent observations. The effect of dependence in the data is captured through the first part of the term: . As discussed in [8], this term is larger when the spectral density is more spiky, indicating a stronger temporal and crosssectional dependence in the data.
Proposition 5.
Remark. Based on Proposition 5, similar conclusions can be obtained as that for the low rank plus sparse case.
4 Computational Algorithm and its Convergence Properties
Next, we introduce a fast algorithm for estimating the transition matrix from data. For ease of presentation and to convey the key ideas clearly, we first present the algorithm for representing a single structure (e.g. only low rank, or only group sparse, or only sparse), and in addition establish its convergence properties. Subsequently, we modify the algorithm to handle the composite structures considered in this paper and also establish its convergence.
The fast network structure learning (FNSL) Algorithm 1 is described next. A safeguarded BB initial value is selected, as the initial choice of the nominal step , i.e.
(12) 
For notational convenience, the penalty term for estimating the transition matrix is denoted by , where represents the tuning parameter.

// Backtracking

Set , where is from (12). Solve from for . Compute

If , then replace by and return to step 1.

// Updating iterates

Compute
The specific update depends on the employed penalty term; for an penalty inducing sparsity, it corresponds to softthresholding [27], for a group sparse penalty to group softthresholding [20], while for a nuclear norm penalty to singular value thresholding [28].
It can also been seen in Algorithm 1, that for for , then and , which leads to the traditional gradient descent algorithm. Indeed, Algorithm 1 is obtained by incorporating an efficient backtracking strategy into the accelerated multistep scheme by [29, 30]. It provides a different way to look for a larger stepsize by employing a relaxed line search condition, instead of searching for the gradient Lipschitz constant of the data fidelity term. Steps 1 and 2 constitute the backtracking ones. Both and are linear combinations of all past iterations of , but based on different weights as can be seen from their updates, i.e. and . Here ‘ag’ simply denotes ‘aggregate’. The data fidelity term in the cost function is linearized at . Since it is used after we have obtained and before we obtain , we use ‘md’ to denote a ‘middle’ update. Further, and are the parameters most closely related to our line search conditions. Intuitively, if we set , we are certainly able to guarantee the accelerated rate of convergence. However, this will render the stepsize smaller. Fortunately, our convergence analysis enables us to relax this condition by utilizing the summing up procedure, with corresponding to the part of the sum of . Thus, we can impose a relaxed termination condition on (see step 2 of Algorithm 1) without impacting the rate of convergence while being able to obtain a more aggressive stepsize. In fact, parameter in Step 2 plays an important role, i.e. the number of the trial steps can be reduced significantly when a relatively larger is selected. However, the value of can not be too large either, since it might impair the convergence rate in terms of the objective function value.
The convergence rate of the proposed algorithm 1 is established next.
Proposition 6.
Let be generated by Algorithm 1. Then, for any
(13) 
where is a finite positive number independent of .
See Appendix C for the detailed proof. It should be noted that the convergence rate of is still an open problem, which needs to be addressed further in future, considering its good performance for some cases.
Next, we enhance the algorithm for solving (9) in the general case. The accelerated convergence rate can be obtained by following the proof for Proposition 6. Similarly to the case in Algorithm 1, the initial trial step of is a safeguarded BB choice
(14) 
The update of the component is based on singular value thresholding, while that of the component on (group) softthresholding. Note that the most expensive computational operation corresponds to the singular value decomposition (SVD) when updating . As mentioned earlier, the proposed algorithm is able to look for larger magnitude step sizes by conducting fewer number of line searches, due to employing more relaxed line search conditions. Actually, this is an important improvement considering the computational cost of SVD. Indeed, the efficiency of the proposed algorithm can be enhanced further if we employ the truncated SVD [32] instead of the full SVD.
The convergence rate of the proposed algorithm 2 (given in supplement due to limited space) is established next.
Proposition 7.
Let be a sequence of updates generated by Algorithm 2. Then, for any
(15) 
where is a finite positive number independent of .
5 Performance Evaluation
Next, we present experimental results on both synthetic and real data. Specifically, the first two experiments focus on largescale network learning with single penalty term to show the efficiency and effectiveness of the proposed algorithms, while the remaining ones assess the accuracy of recovering low rank plus structured sparse transition matrices .
5.1 Performance metrics and experimental settings
We introduce the performance metrices used in the numerical work. For network estimation, we use the true positive rate (TPR) and false alarm rate (FAR) defined as:
where and are the correspnding elemnets in and , respectively. The estimation error (EE) and outofsample prediction error (PE) are defined as
To select the optimal value of the tuning parameters, we combine the three (or two or one, respectively) dimensional grid search method with the AIC/BIC/forward crossvalidation criteria. We will specify the criterion on a case by case basis for the following experiments. In examples B and C, the tuning parameter is selected by the AIC criterion. A grid of 100 values in the interval is used for . In examples D, E and F, we utilize a two/threedimensional grid search to select the optimal values of , and/or as that for . For the experiments employing synthetic data, the tuning parameters are selected by assuming the rank of the true lowrank transition matrix and/or the nonzero groupsparse components of the true groupsparse transition matrix are known. We will specify the forward crossvalidation procedure for the real data case in example G.
For all the experiments, the parameters used in the proposed algorithms are depicted in table 1. Also we set and . We rescale the entries of to ensure stability of the process (the spectral radius . All the results are based on 50 replications. Finally, all algorithms are run in the MATLAB R2015a environment on a PC equipped with 12GB memory.
Parameters  

Values  2  100  1/k 
5.2 Largescale sparse network learning
We start by comparing the performance of the proposed algorithm 1 with FISTA with line search [33] to solve problem (9) with a sparse transition matrix.
We consider three different VAR(1) models with and variables. For each of these models, we generate , and observations from a Gaussian VAR(1) process (2). The transition matrix with sparsity is generated in the following way. First, the topology is generated from a directed random graph , where the edge from one node to another node occurs independently with probability . Then, the strength of the edges is generated independently from a Gaussian distribution. This process is repeated until we obtain a transition matrix with a desired spectral radius . We compare TPR, FAR, EE, and computational time (denoted by ).
p  N  method  (TPR, FAR)(%)  EE  T 

800  1000  FISTA  (88.5, 14.4)  0.22  11.9 
FNSL  (88.6, 13.9)  0.22  
1500  FISTA  (90.8, 12.3)  0.17  14.7  
FNSL  (90.7, 12.0)  0.17  
2000  FISTA  (91.3, 11.0)  0.14  17.6  
FNSL  (91.3, 11.3)  0.14  
900  1000  FISTA  (87.8, 15.0)  0.24  13.1 
FNSL  (87.8, 14.7)  0.24  
1500  FISTA  (89.1, 10.7)  0.19  16.3  
FNSL  (89.1, 10.5)  0.19  
2000  FISTA  (90.9, 12.0)  0.16  20.7  
FNSL  (90.9, 11.8)  0.16  
1000  1000  FISTA  (88.5, 15.2)  0.25  18.6 
FNSL  (88.4, 14.8)  0.25  
1500  FISTA  (90.3, 14.1)  0.20  21.5  
FNSL  (90.2, 13.8)  0.20  
2000  FISTA  (91.2, 12.1)  0.16  24.7  
FNSL  (91.2, 12.4)  0.16 
Table 2 shows the experimental results for sparse network structure with different network size and sample size . It can be seen that the proposed algorithm performs similarly to FISTA in terms of TPR, FAR, and estimation error. To show the efficiency of the proposed algorithm, we also compare the computational time in seconds in terms of the convergence of the objective function value. Clearly, the proposed algorithm outperforms FISTA in efficiency, especially when the network and sample size become larger. This is mainly due to the relaxed line search scheme, as previously discussed. To further support our claim, we also show the graphs of the decreasing objective function value vs. CPU, see Figure 1 when and .
Table 3 shows comparisons between a variant of FISTA and FNSL in terms of objective function value, CPU time in seconds, and the number of matrix products for a network of size with sample size 1000, 1500 and 2000, respectively. For each data set, FISTA needs around 30 iterations to reach convergence and the total number of line searches is 3 for all iterations, while FNSL needs no more than 20 iterations and the total number of line search is no more than 4 for all iterations. This illustrates the computational savings of the proposed algorithm.
Algorithms  Objective value  CPU  AX 
FISTA  4.140e+5  18.6  66 
FNSL  4.089e+5  11.2  44 
FISTA  6.137e+5  21.5  62 
FNSL  6.071e+5  12.8  38 
FISTA  8.239e+5  24.7  66 
FNSL  8.169e+5  15.1  41 
5.3 Network learning with a lowrank transition matrix
To further show the efficiency of the proposed algorithms, we compare the performance of a variant of FISTA [34] and FNSL on estimating lowrank transition matrices.
We consider three different VAR(1) models with and variables. For each of these models, we generate , and observations from a Gaussian VAR(1) process (2). The lowrank transition matrix is generated with rank . Subsequently, we rescale the entries of to ensure the spectral radius lies in . We compare the rank of the estimated transition matrix, denoted by , EE, and computational time .
p  N  method  EE  T  

200  400  FISTA  8  0.80  4.9 
FNSL  8  0.80  
1200  FISTA  8  0.63  9.7  
FNSL  8  0.63  
2000  FISTA  8  0.57  14.2  
FNSL  8  0.57  
300  400  FISTA  12  0.84  7.8 
FNSL  12  0.84  
1200  FISTA  12  0.72  16.1  
FNSL  12  0.72  
2000  FISTA  12  0.68  18.2  
FNSL  12  0.68  
400  400  FISTA  16  0.87  8.5 
FNSL  16  0.87  
1200  FISTA  16  0.82  20.5  
FNSL  16  0.82  
2000  FISTA  16  0.75  31.4  
FNSL  16  0.75 
Table 4 shows the experimental results for lowrank network structure with different network and sample size. Both FISTA and the proposed algorithm achieve good recovery of the transition matrix with the correct rank and they have similar performance in terms of estimation error. Clearly, the proposed algorithm outperforms FISTA in efficiency for this case as well. The graphs of the decreasing objective function value vs. CPU are depicted in Figure 2.
The first two experiments focused on computational efficiency of the proposed algorithms, while retaining good network estimation properties. Next, we demonstrate their accuracy for learning structured sparse networks.
5.4 Sparse plus lowrank network learning
Next, we investigate estimation of sparse plus lowrank transition matrices and compare it to ordinary least square (OLS) and lasso estimates.
We consider three different VAR(1) models with and variables. For each of these models, we generate and observations from the model defined in (2) where can be decomposed into a lowrank matrix of rank and a sparse matrix with nonzero entries. We rescale the entries of to ensure stability of the process (the spectral radius is set to ). We compare the estimation and outofsample prediction errors. The number of out of samples is set to 10.
(TPR, FAR)(%)  

p/8  (84.5, 17.4) 
p/4  (82.4, 17.2) 
p/2  (80.4, 17.5) 
p  (73.2, 17.1) 
2p  (54.7, 17.0) 
4p  (40.2, 17.8) 
8p  (22.7, 17.4) 
First, we study the influence of in (9) on this learning problem with and . From Table 5, that a smaller parameter leads to markedly improved identification of all the true nonzero entries in the sparse component, which consequently leads to better separating the sparse component from the lowrank component .
p  N  model  (TPR, FAR)(%)  EE  PE 

50  100  OLS  (, )  0.84  0.72 
Lasso  (73.2, 30.0)  0.69  0.53  
L+S  (, )  
200  OLS  (, )  0.52  0.41  
Lasso  (77.3, 35.0)  0.57  0.45  
L+S  (, )  
75  100  OLS  (, )  0.75  0.37 
Lasso  (71.0, 24.7)  0.75  0.37  
L+S  (, )  
200  OLS  (, )  0.53  0.18  
Lasso  (77.0, 28.6)  0.67  0.22  
L+S  (, )  
100  100  OLS  (, )  3.7  4.0 
Lasso  (57.3, 29.0)  1.06  1.05  
L+S  (, )  
200  OLS  (, )  2.07  1.73  
Lasso  (59.4, 25.5)  0.86  0.95  
L+S  (, ) 
The corresponding estimation errors are reported in Table 6. In all three settings, we find that the lowrank plus sparse VAR estimates outperform the estimates using ordinary leastsquares (OLS) and lasso, as expected. We observe that as the ratio of increases, OLS may produce lower estimation error than lasso, even though the OLS model is not interpretable for this case. Also, we observe that the estimation errors of all three methods decrease with increasing sample sizes as expected and predicted by theory. Further, we illustrate how the squared Frobenius norm error in Proposition 4 of the VAR model with low rank plus sparse transition matrix scales with the sample size and dimension , when the rank of is fixed. The network size is set to and , respectively, while the rank is fixed to be 2 for all . Sparsity and are defined similarly as above, while the sample size is set to . The squared Frobenius norm error of estimation given by is depicted in the left panel of Figure 3, which displays the errors for different values of , plotted against the sample size . As predicted by our theoretical result, the error is larger for larger . The right panel of Figure 3 displays the error against the rescaled sample size . it can be seen that the corresponding error curves for different values of align well, which is consistent with the estimation error obtained in Proposition 4.
In addition to its improved estimation and prediction performance, the lowrank plus sparse modeling strategy aids in recovering the underlying Granger causal network after accounting for the latent structure. In Figures 4, we demonstrate this using a VAR(1) model with and . The top panel of the Figures 4 displays the true transition matrix , its lowrank component and the structure of its sparse component . The bottom panel of the Figures 4 displays the structure of the Granger causal networks estimated by the method of Lasso and the lowrank plus sparse modeling strategy. As predicted by the theory, it can be that the lasso estimate of the Granger causal network selects many false positives due to its failure to account for the latent structure. On the other hand, the sparse component provides an estimate exhibiting significantly fewer false positives entries.
It is interesting to note that the estimation performance of the regularized estimates in lowrank plus sparse VAR models is worse than the performance of lasso in sparse VAR models of similar dimension [8], even for the same sample sizes. This is in line with the error bounds presented in Proposition 4. The estimation error in lowrank plus sparse models is of the order of , while the error of lasso in sparse VAR models scales at a faster rate of . Further note that a sparse VAR requires estimating parameters in , while the presence of factors introduces an additional parameters in the loading matrix .
5.5 Sparse plus groupsparse plus lowrank network learning
Finally, we conduct numerical experiments to assess the performance of lowrank plus sparse plus groupsparse modeling in VAR analysis and compare it to the performance of sparse plus groupsparse and lowrank plus sparse estimates.
We consider three different VAR(1) models with and variables. For each of these models, we generate and observations from the Gaussian VAR(1) process defined in (2), where can be decomposed into a lowrank matrix of rank , a sparse matrix with nonzero entries, and a groupsparse matrix with each column corresponding to a different group for a total of groups. We rescale the entries of to ensure stability of the process (the spectral radius is set to ) and compare the estimation and outofsample prediction errors, with the number of outsamples set to 10.
p  N  method  (TPR, FAR)(%)  EE  PE 

50  200  S+G  (85.5, 34.1)  0.46  0.53 
L+S  (82.6, 26.4)  0.41  0.51  
L+S+G  (, )  
300  S+G  (91.7, 47.0)  0.37  0.58  
L+S  (88.6, 24.4)  0.31  0.56  
L+S+G  (, )  
100  200  S+G  (92.3, 49.9)  0.48  0.73 
L+S  (84.3, 28.4)  0.44  0.72  
L+S+G  (, )  
300  S+G  (94.8, 49.0)  0.44  0.72  
L+S  (89.6, 25.4)  0.37  0.70  
L+S+G  (, )  
150  200  S+G  (92.0, 50.5)  0.64  0.73 
L+S  (83.3, 28.8)  0.57  0.71  
L+S+G  (, )  
300  S+G  (93.6, 50.2)  0.55  0.71  
L+S  (85.6, 27.4)  0.46  0.68  
L+S+G  (, ) 
The corresponding estimation errors are reported in Table 7. In those three settings, we find that the lowrank plus sparse plus groupsparse VAR estimates performs only slightly better than lowrank plus sparse VAR estimates. One of the reasons lie in that the ability of the identification will degrade as more structures are involved. The other one is that multipletimes shrinkage for the multiple structures lead to severe bias estimation. Even though the group structures can be recovered completely, some nonzero elements in sparse component vanished. An adhoc way to improve the performance for this case is to combine these two methods together. However, both methods outperform the estimates using sparse plus groupsparse VAR. We also observe that, as the ratio of increases, the estimation errors of all three methods decrease with increasing sample sizes as expected and predicted by theory.
5.6 Structured network learning of asset pricing data
Finally, we employ the proposed framework to learn Granger causal networks of asset pricing data obtained from the University of Chicago’s Center for Research in Security Prices (CRSP) and retrieved from the Wharton Research Data Service (WRDS). Specifically, we examine the network structure of realized volatilities of financial institutions representing banks (BA), primary broker/dealers (PB) and insurance companies (INS). The analysis was performed across the following time periods: September 2002  August 2005, September 2006  August 2008 and September 2010  August 2012 that correspond to instances before the financial crisis of 2008 (precrisis period), the buildup and apex of the crisis and the postcrisis period, respectively. For each period, we collected data on 75 firms with 25 companies in each of the three categories  BA, PB and INS based on the size of the average market capitalization of each firm, but dropped a few due to duplicate/missing observations. The final form of the variables used are based on the transformation of the difference between the highest and lowest stock price during a day that acts as a proxy for realized volatility and subsequently detrending it.
To select the tuning parameters, we employ the following forward crossvalidation procedure: (1) We use a time window of length , the available number of time points in the data. Then, starting from time , we use the most recent observations to estimate and denote the transition matrix estimate by . We use the next observations (right after observations) to validate . (2) We select the optimal tuning parameters so that
where and
In our analysis, we set and . Further, to separate the sparse component from the lowrank component as much as possible, we set . The learned Granger causal network structures (the sparse component ) estimated by a sparse plus lowrank model are depicted in Figure 5. It can be seen that even in the presence of a lowrank component, the sparse component exhibits a certain density (about 5% in the precrisis period, rising to 10% during the crisis and dropping down to about 6% in the postcrisis period). This increased connectivity during the crisis period has been observed in Granger causal networks for logreturns as well [4]. In the Supplement, we also provide the Granger causal network structures estimated by only assuming sparsity of the transition matrix (see Supplement Figure 5). A similar increased connectivity pattern is observed during the crisis period. Also note that after accounting for the lowrank component, the estimated sparse component is significantly more sparse than that estimated by a lasso approach, thus enabling us to better examine specific firms that are key drivers in the volatility network.
6 Discussion
Our modeling and technical developments were based on a VAR(1) model. However, it can be generalized to VAR(d) models in different ways, depending on the context of the application. One possible formulation where the low rank component stays the same across lags can be expressed as
This model can be posed as a 1order (L+S) model on the concatenated process