Variable Screening for High Dimensional Time Series

# Variable Screening for High Dimensional Time Series

## Abstract

Variable selection is a widely studied problem in high dimensional statistics, primarily since estimating the precise relationship between the covariates and the response is of great importance in many scientific disciplines. However, most of theory and methods developed towards this goal for the linear model invoke the assumption of iid sub-Gaussian covariates and errors. This paper analyzes the theoretical properties of Sure Independence Screening (SIS) (Fan and Lv [19]) for high dimensional linear models with dependent and/or heavy tailed covariates and errors. We also introduce a generalized least squares screening (GLSS) procedure which utilizes the serial correlation present in the data. By utilizing this serial correlation when estimating our marginal effects, GLSS is shown to outperform SIS in many cases. For both procedures we prove sure screening properties, which depend on the moment conditions, and the strength of dependence in the error and covariate processes, amongst other factors. Additionally, combining these screening procedures with the adaptive Lasso is analyzed. Dependence is quantified by functional dependence measures (Wu [47]), and the results rely on the use of Nagaev-type and exponential inequalities for dependent random variables. We also conduct simulations to demonstrate the finite sample performance of these procedures, and include a real data application of forecasting the US inflation rate.

[
\kwd
\setattribute

journalname

\runtitle

Screening for Time Series

{aug}

,

class=AMS] \kwd[Primary ]62F07 \kwd[; secondary ]62J07

High-dimensional Statistics \kwdsparsity \kwdLasso \kwdTime Series \kwdfunctional dependence measure \kwdVariable Selection \kwdNagaev inequality \kwdSure Independence Screening

## 1 Introduction

With the advancement of data acquisition technology, high dimensionality is a characteristic of data being collected in fields as diverse as health sciences, genomics, neuroscience, astronomy, finance, and macroeconomics. Applications where we have a large number of predictors for a relatively few number of observations are becoming increasingly common. For example, in disease classification we usually have thousands of variables, such as expression of genes, which are collected, while the sample size is usually in the tens. Other examples include fMRI data, where the number of voxels can number in the thousands and far outnumber the observations. For an overview of high dimensionality in economics and finance, see [21]. For the biological sciences, see [23, 5] and references therein.

The main goals in these situations according to [4] are:

• To construct as effective a method as possible to predict future observations.

• To gain insight into the relationship between features and response for scientific purposes, as well as hopefully, to construct an improved prediction method.

More formally we are dealing with the case where

 y=Xβ+ϵ (1)

with being an -vector of responses, being an random design matrix, and is an - random vector of errors. In addition, when the dimensionality of the predictors () is large we usually make the assumption that the underlying coefficient vector is sparse. Sparsity is a characteristic that is frequently found in many scientific applications [19],[29]. For example, in disease classification it is usually the case that only a small amount of genes are relevant to predicting the outcome.

Indeed, there are a wealth of theoretical results and methods that are devoted to this issue. Our primary focus is on screening procedures. Sure Independence Screening (SIS) as originally introduced in [19], was applicable to the linear model, and is based on a ranking of the absolute values of the marginal correlations of the predictors with the outcome. This method allows one to deal with the situation in which the number of predictors is of an exponential order of the number of observations, which they termed as ultrahigh dimensionality. Further work on the topic has expanded the procedure to cover the case of generalized linear models [24], non-parametric additive models [18], Cox proportional hazards model [17], single index hazard rate models [26], and varying coefficient models [22]. Model-free screening methods have also been developed. For example; screening using distance correlation was analyzed in [33], a martingale difference correlation approach was introduced in [40], additional works include [55], [28] among others. For an overview of works related to screening procedures, one can consult [35]. The main result introduced with these methods is that, under the appropriate conditions, we can reduce the predictor dimension from size , for some , to a size , while retaining all the relevant predictors with probability approaching 1.

Another widely used class of methods is based on the penalized least squares approach. An overview of these methods is provided in [20] and [5]. Examples of methods in this class are the Lasso [43], and the adaptive Lasso [56]. Various theoretical results have been discovered for these class of methods. They broadly fall into analyzing the prediction error , parameter estimation error , model selection consistency, as well as limiting distributions of the estimated parameters (see [9] for a comprehensive summary). Using screening procedures in conjunction with penalized least squares methods, such as the adaptive Lasso, presents a powerful tool for variable selection. Variable screening can allow us to quickly reduce the parameter dimension significantly, which weakens the the assumptions needed for the sign consistency of the adaptive Lasso [27, 37].

A key limitation of the results obtained for screening methods, is the assumption of independent observations. In addition, it is usually assumed that the covariates and the errors are sub-Gaussian (or sub-exponential). However, there are many examples of real world data where these assumptions are violated. Data which is observed over time and/or space such as meteorological data, longitudinal data, economic and financial time series frequently exhibit covariates and/or errors which are serially correlated. One specific example is the case of fMRI time series, where there can exist a complicated spatial-temporal dependence structure in the errors and the covariates (see [46]). Another example is in forecasting macroeconomic indicators such as GDP or inflation rate, where we have large number of macroeconomic and financial time series and their lags as possible covariates. Examples of heavy tailed and dependent errors and covariates can be found most prominently in financial, insurance and macroeconomic data.

These examples stress why it is extremely important for variable selection methods to be capable of handling scenarios where the assumption of independent sub-Gaussian (or sub-exponential) observations is violated. Some works related to this goal for the Lasso include [44], which extended the Lasso to jointly model the autoregressive structure of the errors as well as the covariates. However, their method is applicable only to the case where , and they assume an autoregressive structure where the order of the process is known. Whereas [51], studied the theoretical properties of the Lasso assuming a fixed design in the case of heavy tailed and dependent errors. Additionally [3], and [31] investigated theoretical properties of the Lasso for high-dimensional Gaussian processes. Most recently [37] analyzed the adaptive Lasso for high dimensional time series while allowing for both heavy tailed covariate and errors processes, with the additional assumption that the error process is a martingale difference sequence.

Some works related to this goal for screening methods include [34], which allows for heavy tailed errors and covariates. Additionally [10], [52], and [55] also relax the Gaussian assumption, with the first two requiring the tails of the covariates and the response to be exponentially light, while the latter allows for heavy tailed errors provided the covariates are sub-exponential. Although these works relax the moment and distributional assumptions on the covariates and the response, they still remain in the framework of independent observations. A few works have dealt with correlated observations in the context of longitudinal data, see [13], and [54]. However, the dependence structure of longitudinal data is too restrictive to cover the type of dependence present in most time series. Most recently [12] proposed a non-parametric kernel smoothing screening method applicable to time series data. In their work they assume a sub-exponential response, covariates that are bounded and have a density, as well as assuming the sequence is strong mixing, with the additional assumption that the strong mixing coefficients decay geometrically. These assumptions can be quite restrictive; they exclude, for example, heavy tailed time series, and discrete valued time series which are common in fields such as macroeconomics, finance, neuroscience, amongst others [15].

In this work, we study the theoretical properties of SIS for the linear model with dependent and/or heavy tailed covariates and errors. This allows us to substantially increase the number of situations in which SIS can be applied. However, one of the drawbacks to using SIS in a time series setting is that the temporal dependence structure between observations is ignored. In an attempt to correct this, we introduce a generalized least squares screening (GLSS) procedure, which utilizes this additional information when estimating the marginal effect of each covariate. By using GLS to estimate the marginal regression coefficient for each covariate, as opposed to OLS used in SIS, we correct for the effects of serial correlation. Our simulation results show the effectiveness of GLSS over SIS, is most pronounced when we have strong levels of serial correlation and weak signals. Using the adaptive Lasso as a second stage estimator after applying the above screening procedures is also analyzed. Probability bounds for our combined two stage estimator being sign consistent are provided, along with comparisons between our two stage estimator and the adaptive Lasso as a stand alone procedure.

Compared to previous work, we place no restrictions on the distribution of the covariate and error processes besides existence of a certain number of finite moments. In order to quantify dependence, we rely on the functional dependence measure framework introduced by [47], rather than the usual strong mixing coefficients. Comparisons between functional dependence measures and strong mixing assumptions are discussed in section 2. For both GLSS and SIS, we present the sure screening properties and show the range of can vary from the high dimensional case, where is a power of , to the ultrahigh dimensional case discussed in [19]. We detail how the range of and the sure screening properties are affected by the the strength of dependence and the moment conditions of the errors and covariates, the strength of the underlying signal, and the sparsity level , amongst other factors.

The rest of the paper is organized as follows: Section 2 reviews the functional and predictive dependence measures which will allow us to characterize the dependence in the covariate () and error processes. We also discuss the assumptions placed on structure of the covariate and error processes; these assumptions are very mild, allowing us to represent a wide variety of stochastic processes which arise in practice. Section 3 presents the sure screening properties of SIS under a range of settings. Section 4 introduces the GLSS procedure and presents its sure screening properties. Combining these screening procedures with the adaptive Lasso will discussed in Section 5. Section 6 covers simulation results, while section 7 discusses an application to forecasting the US inflation rate. Lastly, concluding remarks are in Section 8, and the proofs for all the results follow in the appendix.

## 2 Preliminaries

We shall assume the error sequence is a strictly stationary, ergodic process with the following form:

 ϵi=g(…,ei−1,ei) (2)

Where is a real valued measurable function, and are iid random variables. This representation includes a very wide range of stochastic processes such as linear processes, their non-linear transforms, Volterra processes, Markov chain models, non-linear autoregressive models such as threshold auto-regressive (TAR), bilinear, GARCH models, among others (for more details see [48],[47]). This representation allows us to use the functional and predictive dependence measures introduced in [47]. The functional dependence measure for the error process is defined as the following:

 δq(ϵi)=||ϵi−g(F∗i)||q=(E|ϵi−g(F∗i)|q)1/q (3)

where with being iid. Since we are replacing by , we can think of this as measuring the dependency of on as we are keeping all other inputs the same. The cumulative functional dependence measure is defined as . We assume weak dependence of the form:

 Δ0,q(ϵ)=∞∑i=0δq(ϵi)<∞ (4)

The predictive dependence measure is related to the functional dependence measure, and is defined as the following:

 θq(ϵl)=||E(ϵl|F0)−E(ϵl|F−1)||q=||P0ϵl||q (5)

where with being iid. The cumulative predictive dependence measure is defined as , and by theorem 1 in [47] we obtain .

Similarly the covariate process is of the form:

 xi=h(…,ηi−1,ηi) (6)

Where , are iid random vectors, , and . Let . As before the functional dependence measure is and the cumulative dependence measure for the covariate process is defined as :

 Φm,q(x)=∞∑i=mmaxj≤pnδq(Xij)<∞ (7)

The representations (2), and (6), along with the functional and predictive dependence measures have been used in various works including [50],[53], and [51] amongst others. Compared to strong mixing conditions, which are often difficult to verify, the above dependence measures are easier to interpret and compute since they are related to the data generating mechanism of the underlying process [48]. In many cases using the functional dependence measure also requires less stringent assumptions. For example, consider the case of a linear process, , with iid. Sufficient conditions for a linear process to be strong mixing involve: the density function of the innovations () being of bounded variation, restrictive assumptions on the decay rate of the coefficients (), and invertibility of the process (see theorem 14.9 in [14] for details). Additional conditions are needed to ensure strong mixing if the innovations for the linear process are dependent [16].

As a result many simple processes can be shown to be non-strong mixing. A prominent example involves an AR(1) model with iid Bernoulli (1/2) innovations: is non-strong mixing if [2]. These cases can be handled quite easily in our framework, since we are not placing distributional assumptions on the innovations, , such as the existence of a density. For linear processes with iid innovations, representation (2) clearly holds and (4) is satisfied if . For dependent innovations, suppose we have : , where is a real valued measurable function and , are iid. Then , has a causal representation, and satisfies (4) if: , and (see [49]).

## 3 SIS with Dependent Observations

Sure Independence Screening, as introduced by Fan and Lv [19], is a method of variable selection based on ranking the magnitudes of the marginal regression estimates. Under appropriate conditions, this simple procedure is shown to possess the sure screening property. The method is as follows, let :

 ^ρ=(^ρ1,…,^ρpn) , where ^ρj=(n∑t=1X2tj)−1(n∑t=1XtjYt) (8)

Therefore, is the OLS estimate of the linear projection of onto . Now let

 M∗={1≤i≤pn:βi≠0} (9)

and let be the size of the true sparse model. We then sort the elements of by their magnitudes. For any given , define a sub-model

 ^Mγn={1≤i≤pn:|^ρi|≥γn} (10)

and let be the size of the selected model. The sure screening property states that for an appropriate choice of , we have .

Throughout this paper let: , , , and be column of . In addition, we assume , . Note that can contain lagged values of . Additionally, let , and . For a vector , denotes its sign vector, with the convention that , and . For a square matrix , let and , denote the minimum eigenvalue, and maximum eigenvalue respectively. For any matrix , let , and denote the maximum absolute row sum of , and the spectral norm of respectively. Lastly we will use to denote generic positive constants which can change between instances.

### 3.1 SIS with dependent, heavy tailed covariates and errors

To establish sure screening properties, we need the following conditions:

Condition A: for

Condition B: .

Condition C: Assume the error and the covariate processes have representations (2), and (6) respectively. Additionally, we assume the following decay rates , for some , and .

Condition A is standard in screening procedures, and it assumes the marginal signals of the active predictors cannot be too small. Condition B assumes the covariates and the errors are contemporaneously uncorrelated. This is significantly weaker than independence between the error sequence and the covariates usually assumed. Condition C presents the structure, dependence and moment conditions on the covariate and error processes. Notice that higher values of indicate weaker temporal dependence.

Examples of error and covariate processes which satisfy condition C are : If is a linear process, with iid and then . If for we have and . We have a geometric decay rate in the cumulative functional dependence measure, if satisfies the geometric moment contraction (GMC) condition, see [39]. Conditions needed for a process to satisfy the GMC condition are given in theorem 5.1 of [39]. Examples of processes satisfying the GMC condition include stationary, causal finite order ARMA , GARCH, ARMA-GARCH, bilinear, and threshold autoregressive processes, amongst others (see [48] for details).

For the covariate process, if we assume is a vector linear process: . Where are coefficient matrices and are iid random vectors with . For simplicity, assume are identically distributed, then

 δq(Xij)=||Ai,jη0−Ai,jη∗0||q≤2|Ai,j|||η0,1||q (11)

where is the column of . If for , then . In particular, for stable VARMA processes, we have where . For simplicity, we show the derivation for stable VAR() processes, , the result for stable VARMA processes follows similarly. We can rewrite a VAR() process as a VAR(1) process, , with:

 ~xt =⎡⎢ ⎢ ⎢ ⎢ ⎢⎣ηt0⋮0⎤⎥ ⎥ ⎥ ⎥ ⎥⎦kp×1 (12)

By section 11.3.2 in [36], the process is stable if and only if is stable. Therefore, we have , and by [11], . Additional examples of error and covariate processes which satisfy condition C are given in [50] and [51] respectively.

Define and let , if , otherwise . Let if , otherwise . Additionally let and . Given condition C, it follows that . The following theorem gives the sure screening properties, and provides a bound on the size of the selected model:

###### Theorem 1.

Suppose conditions A,B,C hold.

1. For any , we have:

 P(maxj≤pn|^ρj−E(ρj)|>c2n−κ) ≤O(snpn[nπKrx,r(n/sn)r/2−rκ/2+exp(−n1−2κs2nK4x,r)]) +O(pn[nιKτx,rKτϵ,qnτ−τκ+exp(−n1−2κK2x,rK2ϵ,q)])
2. For with , we have :

 P(M∗⊂^Mγn) ≥1−O(s2n[nπKrx,r(n/sn)r/2−rκ/2+exp(−n1−2κs2nK4x,r)]) −O(sn[nιKτx,rKτϵ,qnτ−τκ+exp(−n1−2κK2x,rK2ϵ,q)])
3. For with , we have:

 P(|^Mγn|≤O(n2κλmax(Σ))) ≥1−O(snpn[nπKrx,r(n/sn)r/2−rκ/2+exp(−n1−2κs2nK4x,r)]) −O(pn[nιKτx,rKτϵ,qnτ−τκ+exp(−n1−2κK2x,rK2ϵ,q)])

In theorem 1, we have two types of bounds, for large the polynomial terms dominate, whereas for small values of the exponential terms dominate. The covariate dimension () can be as large as . The range of depends on the dependence in both the covariate and the error processes, the strength of the signal (), the moment condition, and the sparsity level (). If we assume , and then . For the case of iid errors and covariates, we would replace in theorem 1 with and respectively. Therefore for the case of weaker dependence in the covariate and error processes (i.e. and ), our range for is reduced only by a constant factor. However, our range for is significantly reduced in the case of stronger dependence in the error or covariate processes (i.e. either or ). For instance if and , our range for is reduced by a factor of in the case of stronger dependence.

In the iid setting, to achieve sure screening in the ultrahigh dimensional case, [19] assumed the covariates and errors are jointly normally distributed. Future works applicable to the linear model, such as [24],[18] among others, relaxed this Gaussian assumption, but generally assumed the tails of the covariates and errors are exponentially light. Compared to the existing results for iid observations, our moment conditions preclude us from dealing with the ultrahigh dimensional case. However, our setting is far more general in that it allows for dependent and heavy tailed covariates and errors. In addition, we allow for the covariates and error processes to be dependent on each other, with the mild restriction that .

### 3.2 Ultrahigh Dimensionality under dependence

It is possible to achieve the sure screening property in the ultrahigh dimensional setting with dependent errors and covariates. However, we need to make stronger assumptions on the moments of both the error and covariate processes. Until now we have assumed the existence of a finite moment, which restricted the range of to a power of . If the error and covariate processes are assumed to follow a stronger moment condition, such as and for arbitrary , we can achieve a much larger range of which will cover the ultrahigh dimensional case discussed in [19]. More formally, we have :

Condition D: Assume the error and the covariate processes have representations (2), and (6) respectively. Additionally assume and , for some

By theorem 3 in [51], condition D implies the tails of the covariate and error processes are exponentially light. There are a wide range of processes which satisfy the above condition. For example, if is a linear process : with iid and then . If we assume is sub-Gaussian, then , since . Similarly if is sub-exponential we have . More generally, for , if is sub-exponential, we have . Similar results hold for vector linear processes discussed previously.

Condition D is primarily a restriction on the rate at which increase as . We remark that, for any fixed , we are not placing additional assumptions on the temporal decay rate of the covariate and error processes besides requiring ,. In comparison, in the ultrahigh dimensional setting, [12] requires geometrically decaying strong mixing coefficients, in addition to requiring sub-exponential tails for the response. As an example, if we assume , geometrically decaying strong mixing coefficients would require the coefficients, , to decay geometrically. Whereas in condition D, the only restrictions we place on the coefficients, , is absolute summability.

###### Theorem 2.

Suppose conditions A,B,D hold. Define , and .

1. For any we have:

 P(maxj≤pn|^ρj−ρj|>c2n−κ)≤ O(snpnexp(−n1/2−κυ2xsn)~α) +O(pnexp(−n1/2−κυxυϵ)~α′)
2. For with , we have :

 P(M∗⊂^Mγn)≥1 −O(s2nexp(−n1/2−κυ2xsn)~α) −O(snexp(−n1/2−κυxυϵ)~α′)
3. For with , we have:

 P(|^Mγn|≤O(n2κλmax(Σ)))≥1 −O(snpnexp(−n1/2−κυ2xsn)~α) −O(pnexp(−n1/2−κυxυϵ)~α′)

From theorem 2, we infer the covariate dimension () can be as large as
. As in theorem 1, the range of depends on the dependence in both the covariate and the error processes, the strength of the signal (), the moment condition, and the sparsity level (). For the case of iid covariates and errors, we would replace and with and respectively. In contrast to theorem 1, temporal dependence affects our range of only by a constant factor.

If we assume , and both the covariate and error processes are sub-Gaussian we obtain , while for sub-exponential distributions we obtain . In contrast, Fan and Lv [19], assuming independent observations, allow for a larger range . However, their work relied critically on the Gaussian assumption. Fan and Song [24], relax the Gaussian assumption by allowing for sub-exponential covariates and errors, and our rates are similar to theirs up to a constant factor. Additionally, in our work we relax the sub-exponential assumption, provided the tails of the covariates and errors are exponentially light.

## 4 Generalized Least Sqaures Screening (GLSS)

Consider the marginal model :

 Yt=Xtkρk+ϵt,k (13)

where is the linear projection of onto . In SIS, we rank the magnitudes of the OLS estimates of this projection. In a time series setting, if we are considering the marginal model (13) it is likely the case that the marginal errors () will be serially correlated. This holds even if we assume that the errors () in the full model (1) are serially uncorrelated. A procedure which accounts for this serial correlation, such as Generalized Least Squares (GLS), will provide a more efficient estimate of .

We first motivate our method by considering a simple univariate model. Assume and the errors follow an AR(1) process, , where , and are iid standard Gaussian. We set , , and estimate the model using both OLS and GLS for values of ranging from .5 to .95. The mean absolute errors for both procedures is plotted in figure 1. We observe that the performance of OLS steadily deteriorates for increasing values of , while the performance of GLS stays constant. This suggests that a screening procedure based on GLS estimates will be most useful in situations where we have weak signals and high levels of serial correlation.

The GLS estimate for is :

 ~βMk=(XTkΣ−1kXk)−1XTkΣ−1ky (14)

Where is the column of , and is the auto-covariance matrix of . Under appropriate conditions, even while using an estimate of to form , our GLS estimates will be asymptotically more efficient relative to OLS estimates [1, 32]. Our estimate of will be the banded autocovariance matrix estimator introduced in [50], which is defined as:

 ^Σk,ln=(^γi−j,k\mathbbm1|i−j|≤ln)1≤i,j≤n (15)

Where is our band length, , with , and is the OLS estimate of . Our GLS estimator is now:

 ^βMk=(XTk^Σ−1k,lnXk)−1XTk^Σ−1k,lny (16)

For positive definite , the banded estimate for is not guaranteed to be positive definite, however it is asymptotically positive definite (see lemma 1). For small samples, we can preserve positive definiteness by using the tapered estimate: , where is a positive definite kernel matrix, and denotes coordinate-wise multiplication. For example, we can choose . We need the following conditions for the sure screening property to hold:

Condition E: Assume the marginal error process, , is a stationary AR() process, . Where , .

Condition F: For : .

Condition G: Assume

Condition H: Assume , are of the form (2), and the covariate process is of the form (6). Additionally we assume the following decay rates , , for some , , and .

Condition I: Assume , are of the form (2), and the covariate process is of the form (6). Additionally assume
, for some , and .

In condition E, we can let the band length diverge to infinity at a slow rate, e.g , for simplicity we set to be a constant. Assuming a finite order AR model for the marginal error process is reasonable in most practical situations, since any stationary process with a continuous spectral density function can be approximated arbitrarily closely by a finite order linear AR process (see corollary 4.4.2 in [7]). For further details on linear AR approximations to stationary processes, see [1] and [8]. We remark that compared to previous works [1, 32], knowledge about the structure of the marginal errors is not necessary in estimating , since we use a non-parametric estimate of . Therefore condition E is assumed strictly for technical reasons.

For condition F, from (13), we have , iff . When , recall that:

 βMk=E(yt−Lk∑i=1αiyt−i)(Xt,k−Lk∑i=1αiXt−i,k)/(E(Xt,k−Lk∑i=1αiXt−i,k)2) (17)

If we assume the cross covariance, , for
, then whenever , even if . And for , it is likely the case that if we assume .

For condition H, since , we have , where is a measurable function and . If we assume for , then are iid. We then have:

 δq′(ϵt,k)= ||∑i∈M∗Xtiβi+ϵt−XtkβMk−(∑i∈M∗X∗tiβi+ϵ∗t−X∗tkβMk)||q′ ≤∑i∈M∗|βi|δq′(Xti)+δq′(ϵt)+|βMk|δq′(Xtk)

Therefore, , if we assume , , and .

For GLSS; define . Define ,
. Let if , otherwise . Let