Low Rank and Structured Modeling of High-dimensional Vector Autoregressions

# Low Rank and Structured Modeling of High-dimensional Vector Autoregressions

Sumanta Basu 111These authors contributed equally Email: sumbose@cornell.edu Department of Statistical Science, Cornell University Xianqi Li Email: xianqili@ufl.edu Department of Mathematics and the Informatics Institute, University of Florida George Michailidis Email: gmichail@ufl.edu Departments of Statistics and Computer Science and the Informatics Institute, University of Florida
###### Abstract

Network modeling of high-dimensional time series data is a key learning task due to its widespread use in a number of application areas, including macroeconomics, finance and neuroscience. While the problem of sparse modeling based on vector autoregressive models (VAR) has been investigated in depth in the literature, more complex network structures that involve low rank and group sparse components have received considerably less attention, despite their presence in data. Failure to account for low-rank structures results in spurious connectivity among the observed time series, which may lead practitioners to draw incorrect conclusions about pertinent scientific or policy questions. In order to accurately estimate a network of Granger causal interactions after accounting for latent effects, we introduce a novel approach for estimating low-rank and structured sparse high-dimensional VAR models. We introduce a regularized framework involving a combination of nuclear norm and lasso (or group lasso) penalty. Further, and subsequently establish non-asymptotic upper bounds on the estimation error rates of the low-rank and the structured sparse components. We also introduce a fast estimation algorithm and finally demonstrate the performance of the proposed modeling framework over standard sparse VAR estimates through numerical experiments on synthetic and real data.

Keywords: lasso, group lasso, nuclear norm, low rank, vector autoregression, probabilistic bounds, identifiability, fast algorithm.

## 1 Introduction

The problem of learning the network structure among a large set of time series arises in many signal processing, economic, finance and biomedical applications. Examples include processing signals obtained from radars [1, 2], macroeconomic policy making and forecasting [3], assessing connectivity among financial firms [4], reconstructing gene regulatory interactions from time-course genomic data [5] and understanding connectivity between brain regions from fMRI measurements [6]. Vector Autoregressive (VAR) models provide a principled framework for these tasks.

Formally, a VAR model for -dimensional time series is defined in its simplest possible form involving a single time-lag as

 Xt=B′Xt−1+ϵt,   t=1,⋯,T, (1)

where is a transition matrix specifying the lead-lag cross dependencies among the time series and is a zero mean error process. VAR models for small number of time series (low-dimensional) have been thoroughly studied in the literature [7]. However, the above mentioned applications, where dozens to hundreds of time series are involved, created the need for the study of VAR models under high dimensional scaling and the assumption that their interactions are sparse to compensate for the possible lack of adequate number of time points (samples; see [8] and references therein). There has been a growing body of literature on sparse estimation of large scale VAR models [39], including alternative penalties beyond the popular penalty (lasso), such as the Berhu regularization introduced in [9], group lasso type penalties employed in [22, 21], as well as non-convex penalties akin to a square-lasso one investigated in [11]. Further, [10] examine estimation of the transition matrix and the inverse covariance matrix of the error process through a joint sparse penalty. Note that the problem of sparse estimation of these two model parameters separately from a least squares and maximum likelihood viewpoints is addressed in [8, 3], respectively, where in addition probabilistic finite sample error bounds for the obtained estimates are obtained.

Nevertheless, there are occasions where the sparsity assumption may not be sufficient. For example, during financial crisis periods, returns on assets move together in a more concerted manner [4, 12], while transcription factors regulate a large number of genes that may lead to hub-node network structures [13]. Similarly, in brain connectivity networks, particular tasks activate a number of regions that cross talk in a collaborative manner [14]. Hence, it is of interest to study VAR models under high dimensional scaling where the transition matrix governing the temporal dynamics exhibits a more complex structure; e.g. it is low rank and/or (group) sparse.

In a low-dimensional regime, where the number of time points scales to infinity, but the number of time series under study remains fixed, [15] examined asymptotic properties of VAR models, where the parameters exhibit reduced rank structure and also discussed connections with canonical correlation analysis of such models presented in [16]. Specifically, the transition matrix in (2) can be written as the product of two rank- matrices , i.e. , so that in the resulting model specification of the original time series is expressed as linear combinations of the original ones, and specifies the dependence between and ; namely . Hence, to obtain and [15] suggest to estimate the parameters of the original model in (1) under the constraint that and that rank. Other works include low rank approximations of Hankel matrices that represent the input-output structure of a linear time invariant systems and were studied in [17, 18]. Finally, a brief mention to the possibility that the VAR transition matrix may exhibit such a structure appeared as a motivating example in [19].

On the other hand, there is a mature literature on imposing low rank plus sparse, or pure group sparse structure for many learning tasks for independent and identically distributed (i.i.d.) data. Examples include group sparsity in regression and graphical modeling [20], low rank and sparse matrix approximations for dimension reduction [18], etc. However, as shown in [8], the presence of temporal dependence across observations induces intricate dependencies between both rows and columns of the design matrix of the corresponding least squares estimation problem, as well as between the design matrix and the error term, that require careful handling to establish consistency properties for the model parameters under sparsity and high dimensional scaling. These issues are further compounded when more complex regularizing norms are involved, as discussed in [21]. In this paper, the authors model grouping structures within each column of , but do not consider a low-rank component. In contrast, we focus on groups potentially spanning across different columns and allow a low-rank component in .

More recently, [37] and [38] extended the framework of [18] beyond decomposition of an observable matrix to Gaussian process identification by assuming a low-rank plus sparse structure on the inverse spectral density and the transfer function of a general VAR() system, respectively. Our work is complementary to this recent developments. We directly model the transition matrix of a VAR(1) process that enables us to identify a directed network of (group) sparse Granger causal relationships that are of interest in a number of applications; e.g. in financial economics where firms with higher out-degree are of particular interest in measuring systemic risk [12, 4]. Further, our approach explicitly address identifiability issues for extracting the respective low-rank and sparse components, which in turn are leveraged to obtain probabilistic error bounds that characterize the quality of their estimates. The latter provide insights to the practitioner on sample size requirements and tuning parameter selection for real data applications. Finally, note that our approach to the issue of identifiability builds on [19], wherein we characterize the degree of unidentifiability which guides in an explicit manner the selection of the tuning parameters used in the proposed optimization algorithm.

Further, to estimate the posited model in (1) with being low-rank and structured sparse (henceforth indicating that it could be either pure sparse or group sparse or both), we also introduce a fast accelerated proximal gradient algorithm, inspired by [30, 31], for the corresponding optimization problems. The key idea is that instead of searching for the local Lipschitz constant of the gradient of the smooth component of the objective function, the proposed algorithm utilizes a safeguarded Barzilai-Borwein (BB) initial stepsize [25] and employs relaxed line search conditions to achieve better performance in practice. The latter enables the selection of more “aggressive” stepsizes, while preserving the accelerated convergence rate of , where denotes the number of iterations required until convergence. Finally, the performance of the model parameters under different structures together with the associated estimation procedure based on the accelerated proximal gradient algorithm are calibrated on synthetic data, and illustrated on three data sets examining realized volatilities of stock prices of large financial firms before, during and after the US financial crisis.

Notation: Throughout the paper, we employ the following notation: , and denote the -norm of a vector, the spectral norm and the Frobenius norm of a matrix, respectively. For a matrix , the symbol is used to denote the nuclear norm, i.e. , the sum of the singular values of a matrix, while denotes the conjugate transpose of a matrix . For any matrix , we use to denote , for and to denote . Further, if denote a partition of into non-overlapping groups, then we use to denote , for , while denotes the number of nonzero groups in . Here, with a little abuse of notation, we use to denote . In addition, , denote the maximum and minimum eigenvalues of a symmetric or Hermitian matrix. For any integer , we use to denote the unit ball . We also use generically to denote unit vectors in , when is clear from the context. Finally, for positive real numbers , we write if there exists an absolute positive constant , independent of the model parameters, such that .

## 2 Model Formulation and Estimation Procedure

Consider a VAR(1) model where the transition matrix is low-rank plus structured sparse given by

 Xt=B′Xt−1+ϵt,      ϵt\lx@stackreli.i.d.∼N(0,Σϵ), (2) B=L∗+R∗, rank(L∗)=r, (3)

where corresponds to the low rank component and represents either a sparse , or group-sparse component . It is further assumed that the number of non-zero elements in the sparse case is , while in the group sparse case the number of non-zero groups is , with and . The matrix captures persistence structure across all time series and enables the model to be applicable in settings where there are strong cross-autocorrelations, a feature that the standard sparse VAR model is not equipped to handle. The sparse or group sparse component captures additional cross-sectional autocorrelation structure among the time series. Finally, it is assumed that the error terms are serially uncorrelated. Our objective is to estimate and accurately based on a relatively small number of samples .

Stability. In order to ensure consistent estimation, we assume that the posited VAR model in (2) is stable; i.e. its characteristic polynomial satisfies on the unit circle of the complex plane . This is a common assumption in the literature of multivariate time series [7], required for consistency and asymptotic normality of low-dimensional VAR models. This assumption also ensures that the spectral density of the VAR model

 fX(θ)=12π(B−1(eiθ))Σϵ(B−1(eiθ))†,   θ∈[−π,π], (4)

is bounded above in spectral norm.

It was shown in [8] that this condition is sufficient to establish consistency of some regularized VAR estimates of a sparse transition matrix. Further, the following quantities play a central role in the error bounds of the regularized estimates:

 M(fX)=supθ∈[−π,π]Λmax(fX(θ)), \EuFrakm(fX)=supθ∈[−π,π]Λmin(fX(θ)), μmax(B)=max|z|=1Λmax(B†(z)B(z)), μmin(B)=min|z|=1Λmin(B†(z)B(z)). (5)

As shown in [8], and together capture the narrowness of the spectral density of a time series. Processes with stronger temporal and cross-sectional dependence have narrower spectra that in turn lead to slower convergence rates for the regularized estimates. For VAR models, and are related to and as follows:

 \EuFrakm(fX)≥12πΛmin(Σϵ)μmax(B),   M(fX)≤12πΛmax(Σϵ)μmin(B). (6)

Proposition 2.2 in [8] provides a lower bound on . For the special structure of the models considered here, we can get an improved upper bound on , as shown in the following lemma:

###### Lemma 2.1.

For a stable VAR(1) model of the class (2), we have

 μmax(B)≤[1+l+(vin+vout)/2]2 (7)

where is the largest singular value of , and .

###### Proof.

for any with . The result follows from the fact that . ∎

### 2.1 Estimation Procedure

The estimation of VAR model parameters is based on the following regression formulation (see [7]). Given consecutive observations from the VAR model, we work with the autoregressive design as follows:

 ⎡⎢ ⎢⎣(XT)′⋮(X1)′⎤⎥ ⎥⎦Y = ⎡⎢ ⎢⎣(XT−1)′⋮(X0)⎤⎥ ⎥⎦XB+⎡⎢ ⎢⎣(ϵT)′⋮(ϵ1)′⎤⎥ ⎥⎦E. (8)

This is a standard regression problem with samples and variables. Our goal is to estimate , with high accuracy when .

There is an inherent identifiability issue in the estimation of the components and . Suppose the low-rank component itself is -sparse or -group sparse and the sparse or group-sparse component is of rank . In that scenario, we can not hope for any method to estimate and separately without imposing any further constraints. So, a minimal condition for low-rank and sparse or group-sparse recovery is that the low rank part should not be too sparse and the sparse or group-sparse part should not be low-rank.

This issue has been rigorously addressed in the literature (e.g. [18]) for independent and identically distributed data and resolved by imposing an incoherence condition. Such a condition is sufficient for exact recovery of the low rank and the sparse or group-sparse component by solving a convex program. In a recent paper, [19] showed that in a noisy setting where exact recovery of the two components is impossible, it is still possible to achieve good estimation error under a comparatively mild assumption. In particular, they formulated a general measure for the radius of nonidentifiability of the problem under consideration and established a non-asymptotic upper bound on the estimation error , which depend on this radius. The key idea is to allow for simultaneously sparse (or group-sparse) and low-rank matrices in the model, and control for the error introduced. We refer the readers to the above paper for a more detailed discussion on this notion of nonidentifiability. In this work, the low-rank plus sparse or group-sparse decomposition problem under restrictions on the radius of nonidentifiability takes the form

 (^L,^R)=\operatornamewithlimitsargminL,R∈Rp×pL∈Ωl(L,R),
 l(L,R):=12∥Y−X(L+R)∥2F+λN∥L∥∗+μN∥R∥⋄. (9)

Here (for sparse) or (for group sparse), represents or depending on sparsity or group sparsity of , and and are non-negative tuning parameters controlling the regularizations of low-rank and sparse/group-sparse parts. The parameters and control for the degree of nonidentifiability of the matrices allowed in the model class. For instance, larger values of provide sparser estimates of and allow simultaneously sparse and low-rank components to be absorbed in . A smaller value of , on the other hand, tends to produce a matrix with smaller and pushes the simultaneously low-rank and sparse components to be absorbed in . In practice, we recommend choosing and in the range and , respectively. The issue of selecting them robustly in practice is discussed in Section 5.

Remark. On certain occasions, it may be useful to have both sparse m and group-sparse structures in the model, in addition to the low rank structure. We then have in (9) with . However, to guarantee the simultaneous identifiability of the sparse and group-sparse components from the low-rank component, stronger conditions need to be imposed on ; namely, .

Remark. Note that the estimated VAR model is not guaranteed to be stable, although the error bound analysis in section 3 ensures its stability with high probability as long as the sample size is large enough and the true generative model is stable. For network reconstruction and visualization purposes, stability of the estimated VAR is not strictly required. However, enforcing stability is essential for forecasting purposes. When there is a small deviation of the estimated model from stability (e.g the spectral radius of the estimated is a little over ), stability can be ensured through a post-processing step of shrinking the moduli of eigenvalues of below while keeping its eigenvectors unchanged. This type of projection argument is common in covariance and correlation matrix estimation with missing data for ensuring positive definiteness of the estimates [42]. However, in case of moderate to large deviation from stability, a closer look at the individual time series is recommended to re-assess the validity of the VAR formulation. For example, in macroeconomics, it is customary to use suitable transformations of the component time series to ensure that each of the individual time series and the resulting VAR model is stable, as opposed to modeling the individual and the joint time series (without transformation) as unit root and co-integrating processes. For instance, see [41, 40] for specific recommendations on useful transformations for macroeconomic time series.

## 3 Theoretical Properties

Next, we derive non-asymptotic upper bounds on the estimation errors of the low-rank plus structured sparse components of the transition matrix . The main result shows that consistent estimation is possible with a sample size of the order , as long as the process is stable and the radius of nonidentifiability, as measured by and/or , is small in an appropriate sense detailed next.

To establish the results, we first consider fixed realizations of and and impose the following assumptions:
1) Restricted Strong Convexity (RSC): There exist and such that

 12∥XΔ∥2F≥ζ2∥Δ∥2F−τNΦ2(Δ),  for all  Δ∈Rp×p

where , and
2) Deviation Conditions: There exist a deterministic function of the model parameters and such that

 ∥X′E/N∥2≤ϕ(B,Σϵ)√p/N,  and
 ∥X′E/N∥max≤ϕ(B,Σϵ)√2logpN,  and
 ∥X′E/N∥2,max≤ϕ(B,Σϵ)√mlogK√N,

where is the size of the largest group .

Later on, we show that assumptions 1) and 2) are indeed satisfied with high probability when the data are generated from the model (2).

Next, we present the non-asymptotic upper bounds on the estimation errors of the low-rank plus structured sparse components, respectively.

###### Proposition 1.

(a) Suppose that the matrix has rank at most , while the matrix has at most nonzero entries. Then, for any and , any solution of (9) satisfies

 ∥^L−L∗∥2F+∥^S−S∗∥2F≤4ζ2(92λ2Nr+4μ2Ns).

(b) Suppose that the matrix has rank at most , while the matrix has at most non-zero groups. Then, for any and , any solution of (9) satisfies

 ∥^L−L∗∥2F+∥^G−G∗∥2F≤4ζ2(92λ2Nr+4μ2Ng)

Remark. It should be noted that if each group in has only one element, then we have and non-zero entries. For such cases, part (b) of Proposition 1 becomes identical to part (a).

As a byproduct, we also give the estimation error bound of the transition matrix which can be characterized by the sparse plus group-sparse and the low-rank plus sparse and group-sparse components, respectively, under the assumption that the strength of the connections in the group-sparse component is weak; i.e. with , where .

###### Proposition 2.

(a) Suppose that the matrix has at most nonzero entries, while the matrix has at most non-zero groups. Then, for any and , any solution of (9) satisfies

 ∥^S+^G−S∗−G∗∥2F≤4ζ2(8μ2Ns+9ν2Ng).

(b) Suppose that the matrix has rank at most , while the matrix has at most nonzero entries and the matrix has at most non-zero groups. Then, for any , , and , any solution of (9) satisfies

 ∥^L−L∗∥2F+∥^S+^G−S∗−G∗∥2F≤4ζ2(9λ2Nr+252μ2Ns+8ν2Ng).

See Appendix A for the detailed proof of Propositions 1 and 2.

Note that the objective in Proposition 2 is not the accurate recovery of the and components separately. The latter can be in principle achieved, if one sets to a very small value.

In order to obtain meaningful results in the context of our problem, we need upper bounds on , and and a lower bound on that hold with high probability. For the case of independent and identically distributed data, such high-probability deviation bounds are established in [19]. However, for time series data all entries of the matrix are dependent on each other, and hence it is a non-trivial technical task to establish such deviation bounds. A key technical contribution of this work is to derive these deviation bounds, which lead to meaningful analysis for VAR models. The results rely on the measure of stability defined in (2) and an analysis of the joint spectrum of and undertaken next.

###### Proposition 3.

Consider a random realization of generated according to a stable VAR(1) process (2) and form the autoregressive design (8). Define

 ϕ(B,Σϵ)=Λmax(Σϵ)[1+1+μmax(B)μmin(B)]

Then, there exist universal positive constants such that

1. for ,

 P[∥X′EN∥2>c0ϕ(B,Σϵ)√pN]≤c1exp[−c2logp]

and for any ,

 P⎡⎣∥X′EN∥max>c0ϕ(B,Σϵ)√logpN⎤⎦≤c1exp[−c2logp]

and for ,

 P[∥X′EN∥2,max>c0ϕ(B,Σϵ)√mlogp√N]≤c1exp[−c2logp]
2. for ,

 P[Λmin(X′XN)>Λmin(Σϵ)2μmax(B)]≤c1exp[−c2logp]

See Appendix B for the detailed proof of Proposition 3.

Using the above deviation bounds in the non-asymptotic errors of Propositions 1, we obtain the final result for approximate recovery of the low-rank and the structured sparse components using nuclear and norm relaxations, as we show next.

###### Proposition 4.

Consider the setup of Proposition 3. There exist universal positive constants such that for , and , any solution of the program (9) satisfies, with probability at least ,

 ∥^S−S∗∥2F+∥^L−L∗∥2F≤c0ϕ2(B,Σϵ)μ2max(B)Λ2min(Σϵ)(rp+slogp)N+32Λ2min(Σϵ)μ2max(B)sα2p2. (10)

Remark: The error bound presented in the above proposition consists of two key terms. The first term is the error of estimation emanating from randomness in the data and limited sample capacity. For a given model, this error goes to zero as the sample size increases. The second term represents the error due to the unidentifiability of the problem. This is more fundamental to the structure of the true low-rank and structured sparse components, and depends only on the model parameters and does not vanish, even with infinite sample size.

Further, the estimation error is a product of two terms - the second term involves the dimensionality parameters and matches the parametric convergence rate for independent observations. The effect of dependence in the data is captured through the first part of the term: . As discussed in [8], this term is larger when the spectral density is more spiky, indicating a stronger temporal and cross-sectional dependence in the data.

###### Proposition 5.

Consider the setup of Proposition 3. There exist universal positive constants such that for , for any with , any solution of the program (9) satisfies, with probability at least ,

 ∥^G−G∗∥2F+∥^L−L∗∥2F≤c0ϕ2(B,Σϵ)μ2max(B)Λ2min(Σϵ)(rp+g(mlogp))N+32Λ2min(Σϵ)μ2max(B)gβ2K. (11)

See Appendix B for the detailed proof of Proposition 4 and 5.

Remark. Based on Proposition 5, similar conclusions can be obtained as that for the low rank plus sparse case.

## 4 Computational Algorithm and its Convergence Properties

Next, we introduce a fast algorithm for estimating the transition matrix from data. For ease of presentation and to convey the key ideas clearly, we first present the algorithm for representing a single structure (e.g. only low rank, or only group sparse, or only sparse), and in addition establish its convergence properties. Subsequently, we modify the algorithm to handle the composite structures considered in this paper and also establish its convergence.

The fast network structure learning (FNSL) Algorithm 1 is described next. A safeguarded BB initial value is selected, as the initial choice of the nominal step , i.e.

 η0,i=max{ηmin,∥X(Bi−Bi−1)∥2F∥Bi−Bi−1∥2F}  for  i>1. (12)

For notational convenience, the penalty term for estimating the transition matrix is denoted by , where represents the tuning parameter.

The specific update depends on the employed penalty term; for an penalty inducing sparsity, it corresponds to soft-thresholding [27], for a group sparse penalty to group soft-thresholding [20], while for a nuclear norm penalty to singular value thresholding [28].

It can also been seen in Algorithm 1, that for for , then and , which leads to the traditional gradient descent algorithm. Indeed, Algorithm 1 is obtained by incorporating an efficient backtracking strategy into the accelerated multi-step scheme by [29, 30]. It provides a different way to look for a larger stepsize by employing a relaxed line search condition, instead of searching for the gradient Lipschitz constant of the data fidelity term. Steps 1 and 2 constitute the backtracking ones. Both and are linear combinations of all past iterations of , but based on different weights as can be seen from their updates, i.e. and . Here ‘ag’ simply denotes ‘aggregate’. The data fidelity term in the cost function is linearized at . Since it is used after we have obtained and before we obtain , we use ‘md’ to denote a ‘middle’ update. Further, and are the parameters most closely related to our line search conditions. Intuitively, if we set , we are certainly able to guarantee the accelerated rate of convergence. However, this will render the stepsize smaller. Fortunately, our convergence analysis enables us to relax this condition by utilizing the summing up procedure, with corresponding to the part of the sum of . Thus, we can impose a relaxed termination condition on (see step 2 of Algorithm 1) without impacting the rate of convergence while being able to obtain a more aggressive stepsize. In fact, parameter in Step 2 plays an important role, i.e. the number of the trial steps can be reduced significantly when a relatively larger is selected. However, the value of can not be too large either, since it might impair the convergence rate in terms of the objective function value.

The convergence rate of the proposed algorithm 1 is established next.

###### Proposition 6.

Let be generated by Algorithm 1. Then, for any

 l(Bagk+1)−l(^B)≤2σ∥X∥22∥B0−^B∥2F+~C(k+1)2 (13)

where is a finite positive number independent of .

See Appendix C for the detailed proof. It should be noted that the convergence rate of is still an open problem, which needs to be addressed further in future, considering its good performance for some cases.

Next, we enhance the algorithm for solving (9) in the general case. The accelerated convergence rate can be obtained by following the proof for Proposition 6. Similarly to the case in Algorithm 1, the initial trial step of is a safeguarded BB choice

 η0,i=max{ηmin,∥X(Li+Ri−Li−1−Ri−1)∥2F∥Li+Ri−Li−1−Ri−1∥2F} (14)

The update of the component is based on singular value thresholding, while that of the component on (group) soft-thresholding. Note that the most expensive computational operation corresponds to the singular value decomposition (SVD) when updating . As mentioned earlier, the proposed algorithm is able to look for larger magnitude step sizes by conducting fewer number of line searches, due to employing more relaxed line search conditions. Actually, this is an important improvement considering the computational cost of SVD. Indeed, the efficiency of the proposed algorithm can be enhanced further if we employ the truncated SVD [32] instead of the full SVD.

The convergence rate of the proposed algorithm 2 (given in supplement due to limited space) is established next.

###### Proposition 7.

Let be a sequence of updates generated by Algorithm 2. Then, for any

 l(Lagk+1,Ragk+1)−l(^L,^R)≤2σ∥X∥22(∥L0−^L∥2F+∥R0−^R∥2F)+~C(k+1)2 (15)

where is a finite positive number independent of .

Proposition 7 is a direct extension of Proposition 6 and can be obtained by following the roadmap of the proof for the former result.

## 5 Performance Evaluation

Next, we present experimental results on both synthetic and real data. Specifically, the first two experiments focus on large-scale network learning with single penalty term to show the efficiency and effectiveness of the proposed algorithms, while the remaining ones assess the accuracy of recovering low rank plus structured sparse transition matrices .

### 5.1 Performance metrics and experimental settings

We introduce the performance metrices used in the numerical work. For network estimation, we use the true positive rate (TPR) and false alarm rate (FAR) defined as:

where and are the correspnding elemnets in and , respectively. The estimation error (EE) and out-of-sample prediction error (PE) are defined as

To select the optimal value of the tuning parameters, we combine the three (or two or one, respectively) -dimensional grid search method with the AIC/BIC/forward cross-validation criteria. We will specify the criterion on a case by case basis for the following experiments. In examples B and C, the tuning parameter is selected by the AIC criterion. A grid of 100 values in the interval is used for . In examples D, E and F, we utilize a two/three-dimensional grid search to select the optimal values of , and/or as that for . For the experiments employing synthetic data, the tuning parameters are selected by assuming the rank of the true low-rank transition matrix and/or the non-zero group-sparse components of the true group-sparse transition matrix are known. We will specify the forward cross-validation procedure for the real data case in example G.

For all the experiments, the parameters used in the proposed algorithms are depicted in table 1. Also we set and . We rescale the entries of to ensure stability of the process (the spectral radius . All the results are based on 50 replications. Finally, all algorithms are run in the MATLAB R2015a environment on a PC equipped with 12GB memory.

### 5.2 Large-scale sparse network learning

We start by comparing the performance of the proposed algorithm 1 with FISTA with line search [33] to solve problem (9) with a sparse transition matrix.

We consider three different VAR(1) models with and variables. For each of these models, we generate , and observations from a Gaussian VAR(1) process (2). The transition matrix with sparsity is generated in the following way. First, the topology is generated from a directed random graph , where the edge from one node to another node occurs independently with probability . Then, the strength of the edges is generated independently from a Gaussian distribution. This process is repeated until we obtain a transition matrix with a desired spectral radius . We compare TPR, FAR, EE, and computational time (denoted by ).

Table 2 shows the experimental results for sparse network structure with different network size and sample size . It can be seen that the proposed algorithm performs similarly to FISTA in terms of TPR, FAR, and estimation error. To show the efficiency of the proposed algorithm, we also compare the computational time in seconds in terms of the convergence of the objective function value. Clearly, the proposed algorithm outperforms FISTA in efficiency, especially when the network and sample size become larger. This is mainly due to the relaxed line search scheme, as previously discussed. To further support our claim, we also show the graphs of the decreasing objective function value vs. CPU, see Figure 1 when and .

Table 3 shows comparisons between a variant of FISTA and FNSL in terms of objective function value, CPU time in seconds, and the number of matrix products for a network of size with sample size 1000, 1500 and 2000, respectively. For each data set, FISTA needs around 30 iterations to reach convergence and the total number of line searches is 3 for all iterations, while FNSL needs no more than 20 iterations and the total number of line search is no more than 4 for all iterations. This illustrates the computational savings of the proposed algorithm.

### 5.3 Network learning with a low-rank transition matrix

To further show the efficiency of the proposed algorithms, we compare the performance of a variant of FISTA [34] and FNSL on estimating low-rank transition matrices.

We consider three different VAR(1) models with and variables. For each of these models, we generate , and observations from a Gaussian VAR(1) process (2). The low-rank transition matrix is generated with rank . Subsequently, we rescale the entries of to ensure the spectral radius lies in . We compare the rank of the estimated transition matrix, denoted by , EE, and computational time .

Table 4 shows the experimental results for low-rank network structure with different network and sample size. Both FISTA and the proposed algorithm achieve good recovery of the transition matrix with the correct rank and they have similar performance in terms of estimation error. Clearly, the proposed algorithm outperforms FISTA in efficiency for this case as well. The graphs of the decreasing objective function value vs. CPU are depicted in Figure 2.

The first two experiments focused on computational efficiency of the proposed algorithms, while retaining good network estimation properties. Next, we demonstrate their accuracy for learning structured sparse networks.

### 5.4 Sparse plus low-rank network learning

Next, we investigate estimation of sparse plus low-rank transition matrices and compare it to ordinary least square (OLS) and lasso estimates.

We consider three different VAR(1) models with and variables. For each of these models, we generate and observations from the model defined in (2) where can be decomposed into a low-rank matrix of rank and a sparse matrix with non-zero entries. We rescale the entries of to ensure stability of the process (the spectral radius is set to ). We compare the estimation and out-of-sample prediction errors. The number of out of samples is set to 10.

First, we study the influence of in (9) on this learning problem with and . From Table 5, that a smaller parameter leads to markedly improved identification of all the true nonzero entries in the sparse component, which consequently leads to better separating the sparse component from the low-rank component .

The corresponding estimation errors are reported in Table 6. In all three settings, we find that the low-rank plus sparse VAR estimates outperform the estimates using ordinary least-squares (OLS) and lasso, as expected. We observe that as the ratio of increases, OLS may produce lower estimation error than lasso, even though the OLS model is not interpretable for this case. Also, we observe that the estimation errors of all three methods decrease with increasing sample sizes as expected and predicted by theory. Further, we illustrate how the squared Frobenius norm error in Proposition 4 of the VAR model with low rank plus sparse transition matrix scales with the sample size and dimension , when the rank of is fixed. The network size is set to and , respectively, while the rank is fixed to be 2 for all . Sparsity and are defined similarly as above, while the sample size is set to . The squared Frobenius norm error of estimation given by is depicted in the left panel of Figure 3, which displays the errors for different values of , plotted against the sample size . As predicted by our theoretical result, the error is larger for larger . The right panel of Figure 3 displays the error against the rescaled sample size . it can be seen that the corresponding error curves for different values of align well, which is consistent with the estimation error obtained in Proposition 4.

In addition to its improved estimation and prediction performance, the low-rank plus sparse modeling strategy aids in recovering the underlying Granger causal network after accounting for the latent structure. In Figures 4, we demonstrate this using a VAR(1) model with and . The top panel of the Figures 4 displays the true transition matrix , its low-rank component and the structure of its sparse component . The bottom panel of the Figures 4 displays the structure of the Granger causal networks estimated by the method of Lasso and the low-rank plus sparse modeling strategy. As predicted by the theory, it can be that the lasso estimate of the Granger causal network selects many false positives due to its failure to account for the latent structure. On the other hand, the sparse component provides an estimate exhibiting significantly fewer false positives entries.

It is interesting to note that the estimation performance of the regularized estimates in low-rank plus sparse VAR models is worse than the performance of lasso in sparse VAR models of similar dimension [8], even for the same sample sizes. This is in line with the error bounds presented in Proposition 4. The estimation error in low-rank plus sparse models is of the order of , while the error of lasso in sparse VAR models scales at a faster rate of . Further note that a -sparse VAR requires estimating parameters in , while the presence of factors introduces an additional parameters in the loading matrix .

### 5.5 Sparse plus group-sparse plus low-rank network learning

Finally, we conduct numerical experiments to assess the performance of low-rank plus sparse plus group-sparse modeling in VAR analysis and compare it to the performance of sparse plus group-sparse and low-rank plus sparse estimates.

We consider three different VAR(1) models with and variables. For each of these models, we generate and observations from the Gaussian VAR(1) process defined in (2), where can be decomposed into a low-rank matrix of rank , a sparse matrix with non-zero entries, and a group-sparse matrix with each column corresponding to a different group for a total of groups. We rescale the entries of to ensure stability of the process (the spectral radius is set to ) and compare the estimation and out-of-sample prediction errors, with the number of out-samples set to 10.

The corresponding estimation errors are reported in Table 7. In those three settings, we find that the low-rank plus sparse plus group-sparse VAR estimates performs only slightly better than low-rank plus sparse VAR estimates. One of the reasons lie in that the ability of the identification will degrade as more structures are involved. The other one is that multiple-times shrinkage for the multiple structures lead to severe bias estimation. Even though the group structures can be recovered completely, some non-zero elements in sparse component vanished. An ad-hoc way to improve the performance for this case is to combine these two methods together. However, both methods outperform the estimates using sparse plus group-sparse VAR. We also observe that, as the ratio of increases, the estimation errors of all three methods decrease with increasing sample sizes as expected and predicted by theory.

### 5.6 Structured network learning of asset pricing data

Finally, we employ the proposed framework to learn Granger causal networks of asset pricing data obtained from the University of Chicago’s Center for Research in Security Prices (CRSP) and retrieved from the Wharton Research Data Service (WRDS). Specifically, we examine the network structure of realized volatilities of financial institutions representing banks (BA), primary broker/dealers (PB) and insurance companies (INS). The analysis was performed across the following time periods: September 2002 - August 2005, September 2006 - August 2008 and September 2010 - August 2012 that correspond to instances before the financial crisis of 2008 (pre-crisis period), the build-up and apex of the crisis and the post-crisis period, respectively. For each period, we collected data on 75 firms with 25 companies in each of the three categories - BA, PB and INS based on the size of the average market capitalization of each firm, but dropped a few due to duplicate/missing observations. The final form of the variables used are based on the transformation of the difference between the highest and lowest stock price during a day that acts as a proxy for realized volatility and subsequently detrending it.

To select the tuning parameters, we employ the following forward cross-validation procedure: (1) We use a time window of length , the available number of time points in the data. Then, starting from time , we use the most recent observations to estimate and denote the transition matrix estimate by . We use the next observations (right after observations) to validate . (2) We select the optimal tuning parameters so that

 (λ′N,μ′N)=argmin⎧⎨⎩1⌊m−50025⌋m∑t=500+25∗iErr(^Bt)⎫⎬⎭

where and

In our analysis, we set and . Further, to separate the sparse component from the low-rank component as much as possible, we set . The learned Granger causal network structures (the sparse component ) estimated by a sparse plus low-rank model are depicted in Figure 5. It can be seen that even in the presence of a low-rank component, the sparse component exhibits a certain density (about 5% in the pre-crisis period, rising to 10% during the crisis and dropping down to about 6% in the post-crisis period). This increased connectivity during the crisis period has been observed in Granger causal networks for log-returns as well [4]. In the Supplement, we also provide the Granger causal network structures estimated by only assuming sparsity of the transition matrix (see Supplement Figure 5). A similar increased connectivity pattern is observed during the crisis period. Also note that after accounting for the low-rank component, the estimated sparse component is significantly more sparse than that estimated by a lasso approach, thus enabling us to better examine specific firms that are key drivers in the volatility network.

## 6 Discussion

Our modeling and technical developments were based on a VAR(1) model. However, it can be generalized to VAR(d) models in different ways, depending on the context of the application. One possible formulation where the low rank component stays the same across lags can be expressed as

 Xt=d∑ℓ=1(L+Sℓ)Xt−ℓ+vt.

This model can be posed as a 1-order (L+S) model on the concatenated process