Sparse transition matrix estimation for highdimensional and locally stationary vector autoregressive models
Abstract
We consider the estimation of the transition matrix in the highdimensional timevarying vector autoregression (TVVAR) models. Our model builds on a general class of locally stationary VAR processes that evolve smoothly in time. We propose a hybridized kernel smoothing and regularized method to directly estimate the sequence of timevarying transition matrices. Under the sparsity assumption on the transition matrix, we establish the rate of convergence of the proposed estimator and show that the convergence rate depends on the smoothness of the locally stationary VAR processes only through the smoothness of the transition matrix function. In addition, for our estimator followed by thresholding, we prove that the false positive rate (type I error) and false negative rate (type II error) in the pattern recovery can asymptotically vanish in the presence of weak signals without assuming the minimum nonzero signal strength condition. Favorable finite sample performances over the penalized leastsquares estimator and the unstructured maximum likelihood estimator are shown on simulated data. We also provide two real examples on estimating the dependence structures on financial stock prices and economic exchange rates datasets.
2006 \volume0 \issue0 \firstpage1 \lastpage8 \startlocaldefs \endlocaldefs
Highdimensional VAR
t1Research partially supported by NSF grant DMS1404891 and UIUC Research Board Award RB15004.
class=MSC] \kwd[Primary ]62M10 \kwd62H12 \kwd[; secondary ]91B84
vector autoregression \kwdtimevarying parameters \kwdlocally stationary processes \kwdkernel smoothing \kwdhighdimension \kwdsparsity
Contents:
 1 Introduction
 2 Method and algorithm
 3 Asymptotic properties
 4 Simulation studies
 5 Real data analysis
 6 Proofs
 A Detailed setup of the simulation studies
 B Examples for four graph structures
 C ROC curves based on the true thresholding parameters
 D Selected stocks and currencies in real data analysis
1 Introduction
Vector autoregression (VAR) is a basic tool in multivariate time series analysis and it has been extensively used to model the crosssectional and serial dependence in various applications from economics and finance [[31], [3], [4], [10], [32], [14], [34], [13], [1]]. There are two fundamental limitations of the vectorautoregressive models. First, conventional methods to estimate the transition matrix of a VAR model are based on the least squares (LS) estimator and the maximum likelihood estimator (MLE), in which the parameter estimation is consistent when the sample size increases and the model size is fixed [[8], [25]]. Since the number of parameters grows quadratically in the number of time series variables, the VAR model typically includes no more than ten variables in many real applications [[18]]. However, due to the recent explosive data enrichment, analysis of panel data with a few hundred variables is often encountered, in which the LS and MLE are not suitable even for a moderate problem size. Second, stationarity plays a major role in the VAR model. Therefore, the stationary VAR does not capture the timevarying underlying data generation structures, which have been observed in a broad regime of applications in economics and finance [[2], [27]].
Motivated by the limitations of the VAR, this paper studies the estimation problems of the timevarying VAR (TVVAR) model for highdimensional time series data. Let be a sequence of dimensional observations generated by a meanzero TVVAR of order 1 (TVVAR(1))
(1.1) 
where , is a matrixvalued deterministic function consisting of the transition matrices at evenly spaced time points and are independent and identically distributed (iid) meanzero random errors, i.e. innovations. In this paper, our main focus is to estimate the transition matrices for the TVVAR(1) and the extension to higherorder VAR is straightforward. Indeed, for a general TVVAR of order
we can rewrite at time , as a TVVAR(1) in the augmented space
where
is the identity matrix and is the zero matrix. Then, we need only to estimate the first rows in .
To make the estimation problem feasible for highdimensional time series when is large, it is crucial to carefully regularize the coefficient matrices . The main idea is to use certain lowdimensional structures in and the degree of freedom of (1.1) can be dramatically reduced. In our problem, we assume two key structures in . First, since our goal is to estimate a sequence of transition matrices, which can be viewed as the discrete version of a matrixvalued function, it is natural to impose the smoothness condition to . In this case, (1.1) is closely related to the locally stationary processes, a general class of nonstationary processes proposed by [[15]]. Examples of other linear and nonlinear locally stationary processes can be found in [[11]]. In particular, let , be any time point and . Then, under suitable regularity conditions [[16]], there exists a stationary process such that for all ,
where and are the th element of and respectively, and
Therefore, is an approximately stationary VAR process in a small neighborhood of .
Second, at each time point , we need to estimate a matrix. There have been different structural options in literature, such as the sparse [[21], [35], [33], [30]], banded [[19]], and lowrank [[22]] transition matrix, all of which only considered the stationary VAR model. In this paper, we consider a sequence of sparse transition matrices and we allow the sparsity patterns to change over time. Note that our problem is also different from the emerging literature for estimating the highdimensional covariance matrix and its related functionals of time series data [[11], [12], [37], [5]], since our goal is to directly estimate the data generation mechanism specified by .
To simultaneously address the two issues, we propose a hybridized method of the nonparametric smoothing technique and regularization to estimate the sparse transition matrix of the locally stationary VAR. The proposed method is equivalent to solving a sequence of large numbers of linear programs and therefore the estimates can be efficiently obtained by using the highperformance (i.e. parallel) computing technology; see Section 2 for details. In Section 3, we establish the rate of convergence under suitable assumptions on the smoothness of the covariance function and the sparsity of the transition matrix function . Specifically, the dimension is permitted to increase subexponentially fast in the sample size to obtain consistent estimation, i.e. . In addition, we also prove that when our estimator is followed by thresholding, type I and type II errors in the pattern recovery asymptotically vanish in the presence of weak signals. In contrast with the existing literature on consistent model selection, we do not require the minimum nonzero signal strength to be bounded away from zero. Simulation studies in Section 4 and two real data examples in Section 5 demonstrate favorable performances and more interpretable results of the proposed TVVAR model. Technical proofs are deferred to Section 6.
We fix the notations that will be used in the rest of the paper. Let be a generic matrix, is an index subset, and be the submatrix of with columns and rows indexed by , resp. Denote . Let be the spectral norm of . and are the entry , and Frobenius norm of , resp. and are the matrix and norm of , resp. is the support of and is the number of nonzeros in the support set . For and a random variable (r.v.) , and we say that iff . Write . Denote and . If and are vectors, then the maximum and minimum are interpreted elementwise.
2 Method and algorithm
For , let . Then
(2.1) 
Therefore, for any estimator, say , that is reasonably close to the true coefficient matrices , we must have and . A naive estimator for would be constructed to invert the sample versions of (2.1). Estimators of this kind do not have good statistical properties in highdimensions because of the illconditioning and the dependence information in is not directly used in estimation. If is known to be sparse as a priori, we may consider the following constrained minimization program
(2.2)  
subject to  
Because and are unknown, the solution of (2.2) is an oracle estimator and therefore it is not implementable in practice.
Let be the continuous version of , in (1.1), and fix a . To estimate , we first estimate and in (2.1) by their empirical versions. Let
(2.3) 
be the kernel smoothed estimator of and , where are the nonnegative weights. Here, we consider the NadarayaWatson smoothed estimator
(2.4) 
where is a nonnegative bounded kernel function with support in such that and is the bandwidth parameter. We assume throughout the paper that the bandwidth satisfies the natural conditions: and . Then, our estimator is defined as the transpose of the solution of
(2.5)  
subject to  
Observe that (2.5) is equivalent to the following optimization subproblems
(2.6)  
subject to  
in that . Since the subproblems (2.6) can be independently solved, we can efficiently compute the solution of (2.5) by parallelizing the optimizations of (2.6). In addition, each subproblem in (2.6) can be recasted as a linear program (LP)
subject to  
so that , where is the nonnegative solution of the LP. Note that (2.6) can be efficiently solved by the simplex algorithm [[26]]. Compared with the estimation of sparse transition matrix for the stationary VAR model [[21]], our estimator requires to solve two sample versions of YuleWalker equations in (2.1) and (2.5) due to the nonstationarity and therefore twosided constraints of the estimator are needed in (2.5).
3 Asymptotic properties
3.1 Locally stationary VAR process in
To establish an asymptotic theory for estimating the continuous matrixvalued transition function of the nonstationary VAR, it is more convenient to model the time series as realizations from a continuous dimensional process. Following the framework of [[17]], are viewed as realizations of the continuous dimensional process on the discrete grid , i.e. and the temporal dependence of is carried out on the rescaled time indices .
Definition 3.1 (Locally stationary VAR process in ).
A meanzero nonstationary VAR process in of the form (1.1) is said to be locally stationary in , or weakly locally stationary, if for every and , there exists a meanzero stationary VAR process given by
(3.1) 
such that for all
(3.2) 
and the constant does not depend on and .
Note that the approximating stationary VAR process in Definition 3.1 generally depends on . As suggested by the YuleWalker equations (2.1), given a fixed time of interest, estimation of relies on an estimate of . Consider and let be the matrixvalued covariance function at lagzero. With a finite number of observations , it is unclear how to define the covariance function off the points . Nevertheless, the weakly locally stationary VAR processes provide a natural framework for extending to for all .
Since is continuous on , the stationary VAR process in (3.1) is defined for all . Let . By (3.2) and the CauchySchwarz inequality, we have for each
where the constant here is uniform in and . Therefore, we get
Letting and by the continuity of , we can extend for all . Similar extension can be done for for . In Section 3.2, it will be shown that the asymptotic theory of estimating depends on the smoothness of the weakly locally stationary VAR processes only through the smoothness of and therefore .
3.2 Rate of convergence
In this section, we characterize the rate of convergence of our estimator (2.5) under various matrix norms. We assume . To study the asymptotic properties of the proposed estimator, we make the following assumptions on model (1.1).

The coefficient matrices are sparse: let , for each
(3.3) where

The marginal and lagone covariance matrix processes and are smooth functions in : , where is the class of functions defined on that are twice differentiable with bounded derivatives uniformly in .

The random innovations have iid components and subGaussian tail: for all .
Before proceeding, we discuss the above assumptions. (3.3) requires that the transition matrices are sparse in both columns and rows at all time points. A similar matrix class defined by (3.3) was first proposed in [[7]] for symmetric matrices and it has been widely used for estimating highdimensional covariance and precision matrix; see e.g. [[9], [11]]. If , then the maximum number of nonzeros in columns and rows of is at most . Assumption 2 requires that the marginal and lagone covariance matrices evolve smoothly in time. The smoothness is not defined directly on for the ease of theorem statements. In view of (3.2), Assumption 2 is implied by the smoothness on under extra regularity conditions. For a generic matrix parameterized by , we let and be the first two elementwise derivatives of w.r.t. .
Lemma 3.1.
Assume that . Then we have
(3.4) 
and
(3.5)  
By Lemma 3.1, if and such that for some constant , then is also a function and therefore Assumption 2 is fulfilled.
Assumption 3 specifies the tail probability of the innovations . In [[21]], ’s follow iid for some error covariance matrix . A simple transformation by will reduce to the case that has iid components with the standard normal distribution, a special case of Assumption 3, which covers the subGaussian innovations.
Theorem 3.2.
Let . Fix an . Suppose that Assumption 1,2,3 are satisfied, and . Then, with probability at least , we have the estimator with tuning parameter obeys
(3.6)  
(3.7)  
(3.8) 
From Theorem 3.2, the bandwidth gives the optimal rate of convergence and the resulting tuning parameter . When is fixed, it is known that the optimal bandwidth in the nonparametric kernel density estimation is for twice continuously differentiable functions under the mean integrated square error. So in the highdimensional context, the dimension only has a logarithm impact on the choice of optimal bandwidth.
3.3 Pattern recovery
We also study the recovery of timevarying patterns using the estimator (2.5). Let be the nonzero positions of . If the nonzero entries of are small enough, then it is impossible to accurately distinguish the small nonzeros and the zeros. Therefore, the best hope we can expect is that the nonzero entries of with large magnitudes can be well separated from the zeros in . Let
(3.9) 
where is determined in Theorem 3.2. We use the thresholded version of (2.5) as an estimator of
(3.10) 
By Theorem 3.2, the maximal fluctuation is controlled by with probability at least . So if we apply a thresholding procedure for at the level , then we expect that the zeros and nonzeros with magnitudes larger than in can be identified by the thresholded support of . Precisely, we have the instantaneous partial recovery consistency.
Theorem 3.3.
Assume that Assumption 1,2,3 are satisfied. Then, for any fixed time points , and .
Theorem 3.3 states that, with high probability, the zeros in can be identified and the nonzero entries in with strong signal strength above can be recovered by . Therefore, the false positives (type I error) of the estimator (3.10) are asymptotically controlled; see Theorem 3.5 for precise statement. However, Theorem 3.3 does not provide too much information regarding the false negatives since there is no characterization of the signal strength in .
Let and . We introduce the following matrix class
(3.11) 
For , the parameters and control the proportion of small entries in the support of . If is large and grows slowly, then the fraction of weak signals in is small and therefore the false negatives (type II error) can also be well controlled. Below, we shall give such an example.
Example 3.1 (A spatial design).
Let . Consider a symmetric matrix that is generated by the covariance function of a spatial process , which is a random vector that is observed at sites in some spatial domain . Assume that the covariance between and satisfies
(3.12) 
where is the distance between the sites and
(3.13) 
and is a realvalue covariance function. Here, we consider the rational quadratic covariance function [[11], [36]]
(3.14) 
In this example, if , so is a banded matrix.
Lemma 3.4.
For in Example 3.1, then there exists a large enough constant depending only on such that , where
(3.15) 
(3.16) 
and .
For any fixed distance parameter , the weak signal parameter has a natural dependence on . If is smaller, then the covariance function decays to zero slower and there is a less fraction of weak signals in . This can allow to grow slowly in . Note that class is much less stringent than the widely used condition for support recovery and model selection in literature, which requires that the minimal nonzero signal strength is uniformly bounded away from zero [[28]]. To quantify the error in the pattern recovery, we use the following two error rate measures.
Definition 3.2.
The false positive rate (FPR) and false negative rate (FNR) of are defined as
(3.17) 
By convention, if , then ; if , then .
If with probability tending to one, which is a very strong requirement, then we have the pattern recovery consistency . Below, we show that both FPR and FNR are asymptotically controlled in the presence of weak signals.
Theorem 3.5.
Assume that Assumption 1,2,3 are satisfied. Fix an . If , then we have . If in addtiion , then we also have .
Since , the FNR vanishes with probability tending to one if .
4 Simulation studies
In this section, we present some numerical results on simulated datasets. We compare the following five methods:
 [label=()]

Our TVVAR estimator (2.5).

The stationary vector autoregressive model (Stat. VAR) [[21]].

The timevarying lasso method [[23]].

The timevarying ridge method [[20]].

The timevarying maximum likelihood estimator (MLE).
Methods (iii), (iv) and (v) are extensions of the lasso [[23]], ridge regression [[20]] and MLE to the timevarying setting by kernel smoothing. We include (ii), i.e. the stationary VAR model, as the baseline method that ignores the dynamic features in the transition matrices. (iii) is a competitor of (i). We solve (iii) with the FISTA [[6]] and we solve (i) and (ii) with the simplex algorithm [[26]].
4.1 Data generation
We consider different setups of and and . For each setup , the data are generated by the following procedure. First, the baseline coefficient matrices and are generated by using the sugm.generator() in flare R package [[24]]. We consider four graph structures defined in flare: hub, cluster, band and random for and . Examples of these four structures are shown in Appendix B. Then we normalize , and smoothly interpolate on the intermediate values
Following [[21]], we specify the innovation covariance matrix where . [[21]] showed that the choice of does not affect the numeric performance significantly. In our simulation studies, we use the Epanechnikov kernel and fix the model order . The bandwidth is set to be for (i), (iii) and (iv).
4.2 Tuning parameter selection
For the tuning parameter selection in (i)–(iv), we propose a datadriven procedure that minimizes the onestepahead prediction errors as follows.

Choose a grid for the tuning parameter (say ) and the number of training data points.

For each , perform the onestepahead prediction on the testing set by estimating with and then predict by , where . Then calculate the prediction error at time as .

Calculate the average errors over the last time points

Select that minimizes .
4.3 Comparison results
Estimation errors
For each setup, we run 100 simulations and for each simulation indexed by we estimate the transition matrix at each time point indexed by , where . Then, we calculate the error for and . We also report errors under , and norms. Next, the averaged errors over different time points are aggregated as where and . Finally, the averaged errors are reported over 100 simulations as . The results are shown in Table 1, 2, 3 and 4.
From Table 1 to Table 4, larger often results in larger errors. In general, the unstructured MLE performs the worst almost under all matrix norms. Although the ridge method shrinks the coefficients in the transition matrix to zeros, the values of those coefficients are not exactly zeros. Thus, the estimated transition matrices by the ridge method are not sparse which lead to higher estimation errors compared with the TVVAR, timevarying lasso and stationary VAR. Stationary VAR always perform worse than TVVAR and timevarying lasso, since it cannot capture the dynamic structure. The timevarying lasso performs better than any other methods except TVVAR. The proposed TVVAR performs the best almost under all matrix norms.
Pattern recovery
For the TVVAR, we also report the and in (3.17), where which are the indexes of 30 possible tuning parameters from 0.001 to 0.45. Then, we calculate the averaged and . Following [[9]], we set the to , which is considered to be numerical nonzero. The ROC curves for all possible values of the sparsity control parameter are plotted in Fig a, Fig b, Fig c and Fig d. Based on the ROC curves, the TVVAR method has better discrimination power for band or random patterns than hub or cluster patterns.
We also calculate by using the true value of in (3.9) and the resulting ROC curves are shown in Appendix C. Those ROC curves are similar to those based on .
d=20  

TVVAR  Stat. VAR  Lasso  Ridge  MLE  
0.407  1.102  0.453  1.530  2.770  
(0.083)  (0.248)  (0.068)  (0.075)  (0.205)  
0.395  1.747  0.446  1.532  2.762  
(0.088)  (0.516)  (0.076)  (0.068)  (0.255)  
0.329  0.778  0.348  0.604  1.157  
(0.038)  (0.147)  (0.032)  (0.022)  (0.08)  
0.239  0.292  0.228  0.342  0.568  
(0.011)  (0.032)  (0.011)  (0.008)  (0.02)  
d=30  
TVVAR  Stat. VAR  Lasso  Ridge  MLE  
0.643  1.154  0.695  2.116  4.350  
(0.125)  (0.295)  (0.111)  (0.087)  (0.302)  
0.694  2.197  0.732  2.102  4.302  
(0.135)  (0.652)  (0.109)  (0.073)  (0.302)  
0.430  0.846  0.443  0.710  1.607  
(0.046)  (0.141)  (0.04)  (0.024)  (0.112)  
0.245  0.288  0.244  0.393  0.729  
(0.013)  (0.026)  (0.014)  (0.006)  (0.02)  
d=40  
TVVAR  Stat. VAR  Lasso  Ridge  MLE  
0.711  1.221  0.800  2.594  6.183  
(0.103)  (0.256)  (0.095)  (0.079)  (0.426)  
0.812  2.868  0.872  2.586  6.198  
(0.144)  (0.775)  (0.126)  (0.082)  (0.406)  
0.460  0.958  0.486  0.793  2.168  
(0.041)  (0.124)  (0.035)  (0.024)  (0.142)  
0.253  0.296  0.259  0.429  0.907  
(0.011)  (0.02)  (0.011)  (0.006)  (0.021)  
d=50  
TVVAR  Stat. VAR  Lasso  Ridge  MLE  
0.827  1.158  0.944  3.043  8.478  
(0.124)  (0.219)  (0.102)  (0.095)  (0.633)  
1.019  3.574  1.072  3.055  8.558  
(0.19)  (1.021)  (0.135)  (0.098)  (0.65)  
0.504  1.052  0.531  0.863  2.874  
(0.045)  (0.157)  (0.032)  (0.021)  (0.224)  
0.266  0.297  0.278  0.454  1.108  
(0.014)  (0.019)  (0.012)  (0.006)  (0.027) 
d=20  

TVVAR  Stat. VAR  Lasso  Ridge  MLE  
0.400  1.093  0.465  1.527  2.736  
(0.073)  (0.300)  (0.084)  (0.081)  (0.202)  
0.383  1.695  0.448  1.527  2.731  
(0.074)  (0.506)  (0.077)  (0.083)  (0.251)  
0.324  0.774  0.352  0.601  1.133  
(0.032)  (0.143)  (0.033)  (0.025)  (0.084)  
0.239  0.291  0.228  0.341  0.562  
(0.011)  (0.03)  (0.011)  (0.008)  (0.019)  
d=30  
TVVAR  Stat. VAR  Lasso  Ridge  MLE  
0.557  1.127  0.631  2.095  4.305  
(0.096)  (0.285)  (0.097)  (0.083)  (0.308)  
0.589  2.236  0.664  2.120  4.343  
(0.132)  (0.659)  (0.110)  (0.093)  (0.297)  
0.399  0.848  0.424  0.710  1.607  
(0.045)  (0.135)  (0.038)  (0.024)  (0.112)  