A Detailed setup of the simulation studies

# Sparse transition matrix estimation for high-dimensional and locally stationary vector autoregressive models

## Abstract

We consider the estimation of the transition matrix in the high-dimensional time-varying vector autoregression (TV-VAR) models. Our model builds on a general class of locally stationary VAR processes that evolve smoothly in time. We propose a hybridized kernel smoothing and -regularized method to directly estimate the sequence of time-varying transition matrices. Under the sparsity assumption on the transition matrix, we establish the rate of convergence of the proposed estimator and show that the convergence rate depends on the smoothness of the locally stationary VAR processes only through the smoothness of the transition matrix function. In addition, for our estimator followed by thresholding, we prove that the false positive rate (type I error) and false negative rate (type II error) in the pattern recovery can asymptotically vanish in the presence of weak signals without assuming the minimum nonzero signal strength condition. Favorable finite sample performances over the -penalized least-squares estimator and the unstructured maximum likelihood estimator are shown on simulated data. We also provide two real examples on estimating the dependence structures on financial stock prices and economic exchange rates datasets.

[
\kwd

2006 \volume0 \issue0 \firstpage1 \lastpage8 \startlocaldefs \endlocaldefs

\runtitle

High-dimensional VAR

{aug}\thankstext

t1Research partially supported by NSF grant DMS-1404891 and UIUC Research Board Award RB15004.

class=MSC] \kwd[Primary ]62M10 \kwd62H12 \kwd[; secondary ]91B84

vector autoregression \kwdtime-varying parameters \kwdlocally stationary processes \kwdkernel smoothing \kwdhigh-dimension \kwdsparsity

## 1 Introduction

Vector autoregression (VAR) is a basic tool in multivariate time series analysis and it has been extensively used to model the cross-sectional and serial dependence in various applications from economics and finance [[31], [3], [4], [10], [32], [14], [34], [13], [1]]. There are two fundamental limitations of the vector-autoregressive models. First, conventional methods to estimate the transition matrix of a VAR model are based on the least squares (LS) estimator and the maximum likelihood estimator (MLE), in which the parameter estimation is consistent when the sample size increases and the model size is fixed [[8], [25]]. Since the number of parameters grows quadratically in the number of time series variables, the VAR model typically includes no more than ten variables in many real applications [[18]]. However, due to the recent explosive data enrichment, analysis of panel data with a few hundred variables is often encountered, in which the LS and MLE are not suitable even for a moderate problem size. Second, stationarity plays a major role in the VAR model. Therefore, the stationary VAR does not capture the time-varying underlying data generation structures, which have been observed in a broad regime of applications in economics and finance [[2], [27]].

Motivated by the limitations of the VAR, this paper studies the estimation problems of the time-varying VAR (TV-VAR) model for high-dimensional time series data. Let be a sequence of -dimensional observations generated by a mean-zero TV-VAR of order 1 (TV-VAR(1))

 xi=A(i/n)xi−1+ei, (1.1)

where , is a matrix-valued deterministic function consisting of the transition matrices at evenly spaced time points and are independent and identically distributed (iid) mean-zero random errors, i.e. innovations. In this paper, our main focus is to estimate the transition matrices for the TV-VAR(1) and the extension to higher-order VAR is straightforward. Indeed, for a general TV-VAR of order

 xi=Ai,1xi−1+Ai,2xi−2+⋯+Ai,kxi−k+ei,

we can rewrite at time , as a TV-VAR(1) in the augmented space

 zi=Aizi−1+~ei,

where

 Ai=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝Ai,1Ai,2⋯Ai,k−1Ai,kId×d0d×d⋯0d×d0d×d0d×dId×d⋯0d×d0d×d⋮⋮⋮⋮⋮0d×d0d×d⋯Id×d0d×d⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,~ei=⎛⎜ ⎜ ⎜ ⎜⎝ei0d×1⋮0d×1⎞⎟ ⎟ ⎟ ⎟⎠,

is the identity matrix and is the zero matrix. Then, we need only to estimate the first rows in .

To make the estimation problem feasible for high-dimensional time series when is large, it is crucial to carefully regularize the coefficient matrices . The main idea is to use certain low-dimensional structures in and the degree of freedom of (1.1) can be dramatically reduced. In our problem, we assume two key structures in . First, since our goal is to estimate a sequence of transition matrices, which can be viewed as the discrete version of a matrix-valued function, it is natural to impose the smoothness condition to . In this case, (1.1) is closely related to the locally stationary processes, a general class of non-stationary processes proposed by [[15]]. Examples of other linear and nonlinear locally stationary processes can be found in [[11]]. In particular, let , be any time point and . Then, under suitable regularity conditions [[16]], there exists a stationary process such that for all ,

 |Xij−~Xij(t0)|=OP(|i/n−t0|+1/n),

where and are the -th element of and respectively, and

 ~xi(t0)=A(t0)~xi−1(t0)+ei.

Therefore, is an approximately stationary VAR process in a small neighborhood of .

Second, at each time point , we need to estimate a matrix. There have been different structural options in literature, such as the sparse [[21], [35], [33], [30]], banded [[19]], and low-rank [[22]] transition matrix, all of which only considered the stationary VAR model. In this paper, we consider a sequence of sparse transition matrices and we allow the sparsity patterns to change over time. Note that our problem is also different from the emerging literature for estimating the high-dimensional covariance matrix and its related functionals of time series data [[11], [12], [37], [5]], since our goal is to directly estimate the data generation mechanism specified by .

To simultaneously address the two issues, we propose a hybridized method of the nonparametric smoothing technique and -regularization to estimate the sparse transition matrix of the locally stationary VAR. The proposed method is equivalent to solving a sequence of large numbers of linear programs and therefore the estimates can be efficiently obtained by using the high-performance (i.e. parallel) computing technology; see Section 2 for details. In Section 3, we establish the rate of convergence under suitable assumptions on the smoothness of the covariance function and the sparsity of the transition matrix function . Specifically, the dimension is permitted to increase sub-exponentially fast in the sample size to obtain consistent estimation, i.e. . In addition, we also prove that when our estimator is followed by thresholding, type I and type II errors in the pattern recovery asymptotically vanish in the presence of weak signals. In contrast with the existing literature on consistent model selection, we do not require the minimum nonzero signal strength to be bounded away from zero. Simulation studies in Section 4 and two real data examples in Section 5 demonstrate favorable performances and more interpretable results of the proposed TV-VAR model. Technical proofs are deferred to Section 6.

We fix the notations that will be used in the rest of the paper. Let be a generic matrix, is an index subset, and be the submatrix of with columns and rows indexed by , resp. Denote . Let be the spectral norm of . and are the entry , and Frobenius norm of , resp. and are the matrix and norm of , resp. is the support of and is the number of nonzeros in the support set . For and a random variable (r.v.) , and we say that iff . Write . Denote and . If and are vectors, then the maximum and minimum are interpreted element-wise.

## 2 Method and algorithm

For , let . Then

 Σi−1,1=Σi−1,0A⊤iandΣi,−1=AiΣi−1,0. (2.1)

Therefore, for any estimator, say , that is reasonably close to the true coefficient matrices , we must have and . A naive estimator for would be constructed to invert the sample versions of (2.1). Estimators of this kind do not have good statistical properties in high-dimensions because of the ill-conditioning and the dependence information in is not directly used in estimation. If is known to be sparse as a priori, we may consider the following constrained minimization program

 minimizeΛ∈Rd×d |Λ|1 (2.2) subject to |Σi−1,1−Σi−1,0Λ|∞≤τ |Σi,−1−Λ⊤Σi−1,0|∞≤τ.

Because and are unknown, the solution of (2.2) is an oracle estimator and therefore it is not implementable in practice.

Let be the continuous version of , in (1.1), and fix a . To estimate , we first estimate and in (2.1) by their empirical versions. Let

 ^Σt,ℓ=n∑m=1w(t,m)xmx⊤m+ℓ,ℓ=−1,0,1, (2.3)

be the kernel smoothed estimator of and , where are the nonnegative weights. Here, we consider the Nadaraya-Watson smoothed estimator

 w(t,m)=K(t−m/nbn)∑nℓ=1K(t−ℓ/nbn), (2.4)

where is a nonnegative bounded kernel function with support in such that and is the bandwidth parameter. We assume throughout the paper that the bandwidth satisfies the natural conditions: and . Then, our estimator is defined as the transpose of the solution of

 minimizeΛ∈Rd×d |Λ|1 (2.5) subject to |^Σi−1,1−^Σi−1,0Λ|∞≤τ |^Σi,−1−Λ⊤^Σi−1,0|∞≤τ.

Observe that (2.5) is equivalent to the following optimization sub-problems

 ^\boldmathβj(t) = argminu∈Rd|u|1 (2.6) subject to |[^Σi−1,1]∗j−^Σi−1,0u|∞≤τ |[^Σi,−1]j∗−u⊤^Σi−1,0|∞≤τ

in that . Since the sub-problems (2.6) can be independently solved, we can efficiently compute the solution of (2.5) by parallelizing the optimizations of (2.6). In addition, each sub-problem in (2.6) can be recasted as a linear program (LP)

 minimizev,u∈Rd+ |v|1+|w|1 subject to −^Σi−1,0v+^Σi−1,0w≤τ−max{[^Σi−1,1]∗j,[^Σi−1,−1]⊤∗j} ^Σi,0v−^Σi−1,0w≤τ+min{[^Σi,1]∗j,[^Σi,−1]⊤∗j}

so that , where is the nonnegative solution of the LP. Note that (2.6) can be efficiently solved by the simplex algorithm [[26]]. Compared with the estimation of sparse transition matrix for the stationary VAR model [[21]], our estimator requires to solve two sample versions of Yule-Walker equations in (2.1) and (2.5) due to the non-stationarity and therefore two-sided constraints of the estimator are needed in (2.5).

## 3 Asymptotic properties

### 3.1 Locally stationary VAR process in L2

To establish an asymptotic theory for estimating the continuous matrix-valued transition function of the non-stationary VAR, it is more convenient to model the time series as realizations from a continuous -dimensional process. Following the framework of [[17]], are viewed as realizations of the continuous -dimensional process on the discrete grid , i.e. and the temporal dependence of is carried out on the rescaled time indices .

###### Definition 3.1 (Locally stationary VAR process in L2).

A mean-zero non-stationary VAR process in of the form (1.1) is said to be locally stationary in , or weakly locally stationary, if for every and , there exists a mean-zero stationary VAR process given by

 ~xm(t)=A(t)~xm−1(t)+em (3.1)

such that for all

 max1≤j≤d∥~Xmj(t)−Xmj∥≤C(∣∣mn−t∣∣+1n) (3.2)

and the constant does not depend on and .

Note that the approximating stationary VAR process in Definition 3.1 generally depends on . As suggested by the Yule-Walker equations (2.1), given a fixed time of interest, estimation of relies on an estimate of . Consider and let be the matrix-valued covariance function at lag-zero. With a finite number of observations , it is unclear how to define the covariance function off the points . Nevertheless, the weakly locally stationary VAR processes provide a natural framework for extending to for all .

Since is continuous on , the stationary VAR process in (3.1) is defined for all . Let . By (3.2) and the Cauchy-Schwarz inequality, we have for each

 |σjk(ti)−~σjk(ti)| = |E(XijXik)−E(~Xij(ti)~Xik(ti))| ≤ ∥Xij∥2⋅∥Xik−~Xik(ti)∥+∥~Xik(ti)∥⋅∥Xij−~Xij(ti)∥ ≤ (∥Xij∥2+∥~Xik(ti)∥)max1≤j≤d∥Xij−~Xij(ti)∥ ≤ Cn−1,

where the constant here is uniform in and . Therefore, we get

 Σ(ti)=~Σ(ti)+O(n−1).

Letting and by the continuity of , we can extend for all . Similar extension can be done for for . In Section 3.2, it will be shown that the asymptotic theory of estimating depends on the smoothness of the weakly locally stationary VAR processes only through the smoothness of and therefore .

### 3.2 Rate of convergence

In this section, we characterize the rate of convergence of our estimator (2.5) under various matrix norms. We assume . To study the asymptotic properties of the proposed estimator, we make the following assumptions on model (1.1).

1. The coefficient matrices are sparse: let , for each

 Ai∈Gα(s,Md)=G|α(s,Md)∩G−α(s,Md), (3.3)

where

 G|α(s,Md) = {Θ∈Rd×d:maxj≤dd∑ℓ=1|Θℓj|α≤s,|Θ|ℓ1≤Md}, G−α(s,Md) = Missing or unrecognized delimiter for \Big
2. The marginal and lag-one covariance matrix processes and are smooth functions in : , where is the class of functions defined on that are twice differentiable with bounded derivatives uniformly in .

3. The random innovations have iid components and sub-Gaussian tail: for all .

Before proceeding, we discuss the above assumptions. (3.3) requires that the transition matrices are sparse in both columns and rows at all time points. A similar matrix class defined by (3.3) was first proposed in [[7]] for symmetric matrices and it has been widely used for estimating high-dimensional covariance and precision matrix; see e.g. [[9], [11]]. If , then the maximum number of nonzeros in columns and rows of is at most . Assumption 2 requires that the marginal and lag-one covariance matrices evolve smoothly in time. The smoothness is not defined directly on for the ease of theorem statements. In view of (3.2), Assumption 2 is implied by the smoothness on under extra regularity conditions. For a generic matrix parameterized by , we let and be the first two element-wise derivatives of w.r.t. .

###### Lemma 3.1.

Assume that . Then we have

 |˙Σ(t)|∞≤2|˙A(t)|∞⋅|A(t)|ℓ1⋅|Σ(t)|ℓ11−|A(t)|2ℓ1 (3.4)

and

 |¨Σ(t)|∞ ≤ 8|˙A(t)|∞⋅|A(t)|2ℓ1⋅|˙A(t)|ℓ1⋅|Σ(t)|ℓ1(1−|A(t)|2ℓ1)2 (3.5) +4max{|˙A(t)|∞,|¨A(t)|∞}⋅max{|A(t)|ℓ1,|˙A(t)|ℓ1}⋅|Σ(t)|ℓ11−|A(t)|2ℓ1.

By Lemma 3.1, if and such that for some constant , then is also a function and therefore Assumption 2 is fulfilled.

Assumption 3 specifies the tail probability of the innovations . In [[21]], ’s follow iid for some error covariance matrix . A simple transformation by will reduce to the case that has iid components with the standard normal distribution, a special case of Assumption 3, which covers the sub-Gaussian innovations.

###### Theorem 3.2.

Let . Fix an . Suppose that Assumption 1,2,3 are satisfied, and . Then, with probability at least , we have the estimator with tuning parameter obeys

 |^Ai−Ai|∞ ≤ 2τ|Σ−1i−1,0|ℓ1, (3.6) ρ(^Ai−Ai) ≤ C(α)s(τ|Σ−1i−1,0|ℓ1)1−α, (3.7) d−1|^Ai−Ai|2F ≤ C(α)s(τ|Σ−1i−1,0|ℓ1)2−α. (3.8)

From Theorem 3.2, the bandwidth gives the optimal rate of convergence and the resulting tuning parameter . When is fixed, it is known that the optimal bandwidth in the nonparametric kernel density estimation is for twice continuously differentiable functions under the mean integrated square error. So in the high-dimensional context, the dimension only has a logarithm impact on the choice of optimal bandwidth.

### 3.3 Pattern recovery

We also study the recovery of time-varying patterns using the estimator (2.5). Let be the nonzero positions of . If the nonzero entries of are small enough, then it is impossible to accurately distinguish the small nonzeros and the zeros. Therefore, the best hope we can expect is that the nonzero entries of with large magnitudes can be well separated from the zeros in . Let

 u♯=2τ|Σ−1i−1,0|ℓ1, (3.9)

where is determined in Theorem 3.2. We use the thresholded version of (2.5) as an estimator of

 ^Si=Tu♯(^Ai)={1(|^Ai,jk|>u♯)}dj,k=1. (3.10)

By Theorem 3.2, the maximal fluctuation is controlled by with probability at least . So if we apply a thresholding procedure for at the level , then we expect that the zeros and non-zeros with magnitudes larger than in can be identified by the thresholded support of . Precisely, we have the instantaneous partial recovery consistency.

###### Theorem 3.3.

Assume that Assumption 1,2,3 are satisfied. Then, for any fixed time points , and .

Theorem 3.3 states that, with high probability, the zeros in can be identified and the nonzero entries in with strong signal strength above can be recovered by . Therefore, the false positives (type I error) of the estimator (3.10) are asymptotically controlled; see Theorem 3.5 for precise statement. However, Theorem 3.3 does not provide too much information regarding the false negatives since there is no characterization of the signal strength in .

Let and . We introduce the following matrix class

 Gα,β(s,Md,Ld) = Gα(s,Md)∩{Θ:|{(m,k):|Θmk|∈(0,u)}||supp(Θ)|≤Lduβ,∀0

For , the parameters and control the proportion of small entries in the support of . If is large and grows slowly, then the fraction of weak signals in is small and therefore the false negatives (type II error) can also be well controlled. Below, we shall give such an example.

###### Example 3.1 (A spatial design).

Let . Consider a symmetric matrix that is generated by the covariance function of a spatial process , which is a random vector that is observed at sites in some spatial domain . Assume that the covariance between and satisfies

 Amk:=cov(Zm,Zk)=⎧⎨⎩f(π(h∘m,h∘k))π(h∘m,h∘k)

where is the distance between the sites and

 π(h∘m,h∘k)=|m−k|dr (3.13)

and is a real-value covariance function. Here, we consider the rational quadratic covariance function [[11], [36]]

 f(π(h∘m,h∘k))=1(1+π(h∘m,h∘k)2)γ,γ>1. (3.14)

In this example, if , so is a banded matrix.

###### Lemma 3.4.

For in Example 3.1, then there exists a large enough constant depending only on such that , where

 s=⎧⎪⎨⎪⎩Cdrlogdif 2γα=1Cdrif 2γα>1Cdrd(1−r)(1−2γα)if 0<2γα<1, (3.15)
 Md=⎧⎪⎨⎪⎩Cdrlogdif γ=1/2Cdrif γ>1/2Cdrd(1−r)(1−2γ)if 0<γ<1/2, (3.16)

and .

For any fixed distance parameter , the weak signal parameter has a natural dependence on . If is smaller, then the covariance function decays to zero slower and there is a less fraction of weak signals in . This can allow to grow slowly in . Note that class is much less stringent than the widely used condition for support recovery and model selection in literature, which requires that the minimal nonzero signal strength is uniformly bounded away from zero [[28]]. To quantify the error in the pattern recovery, we use the following two error rate measures.

###### Definition 3.2.

The false positive rate (FPR) and false negative rate (FNR) of are defined as

 FPRi=|^Si∩Sci||Sci|,FNRi=|^Sci∩Si||Si|. (3.17)

By convention, if , then ; if , then .

If with probability tending to one, which is a very strong requirement, then we have the pattern recovery consistency . Below, we show that both FPR and FNR are asymptotically controlled in the presence of weak signals.

###### Theorem 3.5.

Assume that Assumption 1,2,3 are satisfied. Fix an . If , then we have . If in addtiion , then we also have .

Since , the FNR vanishes with probability tending to one if .

## 4 Simulation studies

In this section, we present some numerical results on simulated datasets. We compare the following five methods:

[label=()]
1. Our TV-VAR estimator (2.5).

2. The stationary vector autoregressive model (Stat. VAR) [[21]].

3. The time-varying lasso method [[23]].

4. The time-varying ridge method [[20]].

5. The time-varying maximum likelihood estimator (MLE).

Methods (iii), (iv) and (v) are extensions of the lasso [[23]], ridge regression [[20]] and MLE to the time-varying setting by kernel smoothing. We include (ii), i.e. the stationary VAR model, as the baseline method that ignores the dynamic features in the transition matrices. (iii) is a competitor of (i). We solve (iii) with the FISTA [[6]] and we solve (i) and (ii) with the simplex algorithm [[26]].

### 4.1 Data generation

We consider different setups of and and . For each setup , the data are generated by the following procedure. First, the baseline coefficient matrices and are generated by using the sugm.generator() in flare R package [[24]]. We consider four graph structures defined in flare: hub, cluster, band and random for and . Examples of these four structures are shown in Appendix B. Then we normalize , and smoothly interpolate on the intermediate values

 Ai=(1−ti)4A01+t2iA02,ti=i/n for i=1,⋯,n.

Following [[21]], we specify the innovation covariance matrix where . [[21]] showed that the choice of does not affect the numeric performance significantly. In our simulation studies, we use the Epanechnikov kernel and fix the model order . The bandwidth is set to be for (i), (iii) and (iv).

### 4.2 Tuning parameter selection

For the tuning parameter selection in (i)–(iv), we propose a data-driven procedure that minimizes the one-step-ahead prediction errors as follows.

1. Choose a grid for the tuning parameter (say ) and the number of training data points.

2. For each , perform the one-step-ahead prediction on the testing set by estimating with and then predict by , where . Then calculate the prediction error at time as .

3. Calculate the average errors over the last time points

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Err(τ)=1n−n1n∑t=n1+1Errt(τ)
4. Select that minimizes .

### 4.3 Comparison results

#### Estimation errors

For each setup, we run 100 simulations and for each simulation indexed by we estimate the transition matrix at each time point indexed by , where . Then, we calculate the error for and . We also report errors under , and norms. Next, the averaged errors over different time points are aggregated as where and . Finally, the averaged errors are reported over 100 simulations as . The results are shown in Table 1, 2, 3 and 4.

From Table 1 to Table 4, larger often results in larger errors. In general, the unstructured MLE performs the worst almost under all matrix norms. Although the ridge method shrinks the coefficients in the transition matrix to zeros, the values of those coefficients are not exactly zeros. Thus, the estimated transition matrices by the ridge method are not sparse which lead to higher estimation errors compared with the TV-VAR, time-varying lasso and stationary VAR. Stationary VAR always perform worse than TV-VAR and time-varying lasso, since it cannot capture the dynamic structure. The time-varying lasso performs better than any other methods except TV-VAR. The proposed TV-VAR performs the best almost under all matrix norms.

#### Pattern recovery

For the TV-VAR, we also report the and in (3.17), where which are the indexes of 30 possible tuning parameters from 0.001 to 0.45. Then, we calculate the averaged and . Following [[9]], we set the to , which is considered to be numerical nonzero. The ROC curves for all possible values of the sparsity control parameter are plotted in Fig a, Fig b, Fig c and Fig d. Based on the ROC curves, the TV-VAR method has better discrimination power for band or random patterns than hub or cluster patterns.

We also calculate by using the true value of in (3.9) and the resulting ROC curves are shown in Appendix C. Those ROC curves are similar to those based on .