Lasso Guarantees for \beta-Mixing Heavy Tailed Time Series

# Lasso Guarantees for β-Mixing Heavy Tailed Time Series

Wong, Kam Chung
kamwong@umich.com
Tewari, Ambuj
tewaria@umich.com
###### Abstract

Most existing theoretical results for the lasso, however, require the samples to be iid. Recent work has provided guarantees for the lasso assuming that the time series is generated by a sparse Vector Auto-Regressive (VAR) model with Gaussian innovations. Proofs of these results rely critically on the fact that the true data generating mechanism (DGM) is a finite-order Gaussian VAR. This assumption is quite brittle: linear transformations, including selecting a subset of variables, can lead to the violation of this assumption. In order to break free from such assumptions, we derive non-asymptotic inequalities for estimation error and prediction error of the lasso estimate of the best linear predictor without assuming any special parametric form of the DGM. Instead, we rely only on (strict) stationarity and geometrically decaying -mixing coefficients to establish error bounds for the lasso for subweibull random vectors. The class of subweibull random variables that we introduce includes subgaussian and subexponential random variables but also includes random variables with tails heavier than an exponential. We also show that, for Gaussian processes, the -mixing condition can be relaxed to summability of the -mixing coefficients. Our work provides an alternative proof of the consistency of the lasso for sparse Gaussian VAR models. But the applicability of our results extends to non-Gaussian and non-linear times series models as the examples we provide demonstrate.

## 1 Introduction

High dimensional statistics is a vibrant area of research in modern statistics and machine learning (buhlmann2011statistics, ; hastie2015statistical, ). The interplay between computational and statistical aspects of estimation in high dimensions has led to a variety of efficient algorithms with statistical guarantees including methods based on convex relaxation (see, e.g., chandrasekaran2012convex (); negahban2012unified ()) and methods using iterative optimization techniques (see, e.g., beck2009fast (); agarwal2012fast (); donoho2009message ()). However, the bulk of existing theoretical work focuses on iid samples. The extension of theory and algorithms in high dimensional statistics to time series data, where dependence is the norm rather than the exception, is just beginning to occur. We briefly summarize some recent work in Section 1.1 below.

Our focus in this paper is to give guarantees for -regularized least squares estimation, or lasso (hastie2015statistical, ), that hold even when there is temporal dependence in data. We build upon the recent work of basu2015regularized () who took a major step forward in providing guarantees for lasso in the time series setting. They considered Gaussian VAR models with finite lag (see Example 1) and defined a measure of stability using the spectral density, which is the Fourier transform of the autocovariance function of the time series. Then they showed that one can derive error bounds for lasso in terms of their measure of stability. Their bounds are an improvement over previous work (negahban2011estimation, ; loh2012high, ; han2013transition, ) that assumed operator norm bounds on the transition matrix. These operator norm conditions are restrictive even for VAR models with a lag of and never hold if the lag is strictly larger than 1! Therefore, the results of basu2015regularized () hold in greater generality than previous work. But they do have limitations.

A key limitation is that basu2015regularized () assume that the VAR model is the true data generating mechanism (DGM). Their proof techniques rely heavily on having the VAR representation of the stationary process available. The VAR model assumption, while popular in many areas, can be restrictive since the VAR family is not closed under linear transformations: if is a VAR process then may not expressible as a finite lag VAR (lutkepohl2005new, ). We later provides examples (Examples 4 and 2) of VAR processes where leaving out a single variable breaks down the VAR assumption. What if we do not assume that is a finite lag VAR process but simply that it is stationary? Under stationarity (and finite 2nd moment conditions), the best linear predictor of in terms of is well defined even if is not a lag VAR. If we assume that this best linear predictor involves sparse coefficient matrices, can we still guarantee consistent parameter estimation? Our paper provides an affirmative answer to this important question.

We provide finite sample parameter estimation and prediction error bounds for lasso in two cases: (a) for stationary processes with subweibull marginals and geometrically decaying -mixing coefficients (Section 4), and (b) for stationary Gaussian processes with suitably decaying -mixing coefficients (Section 3). It is well known that guarantees for lasso follow if one can establish restricted eigenvalue (RE) conditions and provide deviation bounds (DB) for the correlation of noise with the regressors (see the Master Theorem in Section 2.3 below for a precise statement). Therefore, the bulk of the technical work in this paper boils down to establishing, with high probability, that RE and DB conditions hold under the the Gaussian -mixing and the subweibull -mixing assumptions respectively (Propositions 7832). Note that RE conditions were previously shown to hold under the iid assumption by raskutti2010restricted () for Gaussian random vectors and by rudelson2013reconstruction () for sub-Gaussian random vectors.

### 1.1 Summary of Recent Work on High Dimensional Time Series.

While we discussed the work of basu2015regularized () – since ours is closely related to theirs – we wish to emphasize that several other researchers have recently published work on statistical analysis of high dimensional time series. song2011large (), wu2015high () and alquier2011sparsity () give theoretical guarantees assuming that RE conditions hold. As basu2015regularized () pointed out, it takes a fair bit of work to actually establish RE conditions in the presence of dependence. chudik2011infinite (); chudik2013econometric (); chudik2014theory () use high dimensional time series for global macroeconomic modeling. Alternatives to lasso that have been explored include the Dantzig selector (han2013transition, ), quantile based methods for heavy-tailed data (qiu2015robust, ), quasi-likelihood approaches (uematsu2015penalized, ), and two-stage estimation techniques (davis2012sparse, ). fan2016penalized () considered the case of multiple sequences of univariate -mixing heavy-tailed dependent data. Under a stringent condition on the auto-covariance structure (please refer to Appendix D for details), the paper established finite sample consistency in the real support for penalized least squares estimators. In addition, under mutual incoherence type assumption , it provided sign and consistency. An AR(1) example was given as an illustration. Both uematsu2015penalized () as well as kock2015oracle () establish oracle inequalities for the lasso applied to time series prediction. uematsu2015penalized () provided results not just for lasso but also for estimators using penalties such as the SCAD penalty. Also, instead of assuming Gaussian errors, it is only assumed that fourth moments of the errors exist. kock2015oracle () provided non-asymptotic lasso error and prediction error bounds for stable Gaussian VARs. Both sivakumar2015beyond () and medeiros2016 () considered subexponential designs. sivakumar2015beyond () studied lasso on iid subexponential designs and provide finite sample bounds. medeiros2016 () studied adaptive lasso for linear time series models and provide sign consistency results. wang2007regression () provided theoretical guarantees for lasso in linear regression models with autoregressive errors. Other structured penalties beyond the penalty have also been considered (nicholson2014hierarchical, ; nicholson2015varx, ; guo2015high, ; ngueyep2014large, ). zhang2015gaussian ()mcmurry2015high ()wang2013sparse () and chen2013covariance () consider estimation of the covariance (or precision) matrix of high dimensional time series. mcmurry2015high () and nardi2011autoregressive () both highlight that autoregressive (AR) estimation, even in univariate time series, leads to high dimensional parameter estimation problems if the lag is allowed to be unbounded.

### 1.2 Organization of the Paper

Section 2 introduces our notation, presents the key assumptions and also records some useful facts needed later. Then we present two sets of high probability guarantees for lower restricted eigenvalue and deviation bound conditions: Section 3 covers -mixing Gaussian time series. Note that -mixing is a weaker notion than -mixing and all the parameter dependences are explicit. It is followed by Section 4 which covers -mixing time series with subweibull observations and we make the dependence on the subweibull norm explicit.

We present five examples, two involving -mixing Gaussian processes, three -mixing sub-Gaussian vectors, and one -mixing subweibull vectors. They are presented along with the corresponding theoretical results to illustrate potential applications of the theory. Examples 1 and 2 concern applications of the results in Section 3. We consider VAR models with Gaussian innovations when the model is correctly or incorrectly specified. In Examples 34, and 5, we focus on the case of subweibull random vectors. We consider VAR models with subweibull innovations when the model is correctly or incorrectly specified (Examples 3 and  4). In addition, we go beyond linear models and introduce non-linearity in the DGM in Example 5.

These examples serve to illustrate that our theoretical results for lasso on high dimensional dependent data estimation extend beyond the classical linear Gaussian setting and provides guarantees potentially in the presence of one or more of the following scenarios: model mis-specification, heavy tailed non-Gaussian innovations and nonlinearity in the DGM.

## 2 Preliminaries

Consider a stochastic process of pairs where . We will be interested in time series prediction where, given a time series , we might be interested in predicting using . As such, we cannot, and will not, assume that the pairs are iid. We will assume that the sequence is strictly stationary. We will be interested in estimating the best linear predictor of in terms of . That is, our parameter matrix of interest , is

 Θ⋆=argminΘ∈Rp×qE[∥∥Yt−Θ′Xt∥∥22]. (2.1)

Note that does not depend on because of stationarity. Given pairs of observations, define the matrices that collect the observations together:

 Y =(Y1,Y2,…,YT)′∈RT×q X =(X1,X2,…,XT)′∈RT×p. (2.2)

A matrix referred to in our analysis, but not available to an estimator, is the matrix of residuals:

 W :=Y−XΘ⋆. (2.3)

Let the -penalized least squares, or lasso, estimator be defined as

 ˆΘ=argminΘ∈Rp×q1T∥vec(Y−XΘ)∥22+λT∥vec(Θ)∥1. (2.4)

### 2.1 Notation

For scalars and , define shorthands and . For a symmetric matrix M, let and denote its maximum and minimum eigenvalues respectively. For any matrix let M, , , , and denote its spectral radius , operator norm , entry-wise norm , and Frobenius norm respectively. For any vector , denotes its norm . Unless otherwise specified, we shall use to denote the norm. For any vector , we use and to denote and respectively. Similarly, for any matrix M, where is the vector obtained from M by concatenating the rows of . We say that matrix M (resp. vector ) is -sparse if (resp. ). We use and to denote the transposes of and M respectively. When we index a matrix, we adopt the following conventions. For any matrix , for , , we define , and where is the vector with all s except for a in the th coordinate. The set of integers is denoted by .

For a lag , we define the auto-covariance matrix w.r.t. as . Note that . Similarly, the auto-covariance matrix of lag w.r.t. is , and w.r.t. is . The cross-covariance matrix at lag is . Note the difference between and : the former is a matrix, the latter is a matrix. Thus, is a matrix consisting of four sub-matrices. Using Matlab-like notation, . As per our convention, at lag , we omit the lag argument . For example, denotes . Finally, let be the empirical covariance matrix.

### 2.2 Sparsity, Stationarity and Zero Mean Assumptions

The following assumptions are maintained throughout; we will make additional assumptions specific to each of the subweibull and Gaussian scenarios. Our goal is to provide finite sample bounds on the error . We shall present theoretical guarantees on the parameter estimation error and also the associated (in-sample) prediction error .

###### Assumption 1.

The matrix is -sparse, i.e., .

###### Assumption 2.

The process is strictly stationary: i.e., ,

 ((X1,Y1),⋯,(Xn,Yn))\leavevmode\nobreak d=\leavevmode\nobreak ((X1+τ,Y1+τ),⋯,(Xn+τ,Yτ+n)).

where “” denotes equality in distribution.

###### Assumption 3.

The process is centered; i.e., and .

### 2.3 A Master Theorem

We shall start with what we call a “master theorem" that provides non-asymptotic guarantees for lasso estimation and prediction errors under two well-known conditions, viz. the restricted eigenvalue (RE) and the deviation bound (DB) conditions. Note that in the classical linear model setting (see, e.g., (hayashi2000econometrics, , Ch 2.3)) where sample size is larger than the dimensions (), the conditions for consistency of the ordinary least squares(OLS) estimator are as follows: (a) the empirical covariance matrix and invertible, i.e., , and (b) the regressors and the noise are asymptotically uncorrelated, i.e., .

In high-dimensional regimes, bickel2009simultaneous ()loh2012high () and negahban2012restricted () have established similar consistency conditions for lasso. The first one is the restricted eigenvalue (RE) condition on (which is a special case, when the loss function is the squared loss, of the restricted strong convexity (RSC) condition). The second is the deviation bound (DB) condition on . The following lower RE and DB definitions are modified from those given by loh2012high ().

###### Definition 1 (Lower Restricted Eigenvalue).

A symmetric matrix satisfies a lower restricted eigenvalue condition with curvature and tolerance if,

###### Definition 2 (Deviation Bound).

Consider the random matrices and defined in (2.2) and (2.3) above. They are said to satisfy the deviation bound condition if there exist a deterministic multiplier function and a rate of decay function such that,

We now present a master theorem that provides guarantees for the parameter estimation error and for the (in-sample) prediction error. The proof, given in Appendix A builds on existing result of the same kind (bickel2009simultaneous, ; loh2012high, ; negahban2012restricted, ) and we make no claims of originality for either the result or for the proof.

###### Theorem 1 (Estimation and Prediction Errors).

Consider the lasso estimator defined in (2.4). Suppose Assumption 1 holds. Further, suppose that satisfies the lower RE condition with and satisfies the deviation bound. Then, for any , we have the following guarantees:

 ∥∥vec(ˆΘ−Θ⋆)∥∥≤4√sλT/α, (2.5) ∣∣∣∣∣∣(ˆΘ−Θ⋆)′^Γ(ˆΘ−Θ⋆)∣∣∣∣∣∣2F≤32λ2Tsα. (2.6)

With this master theorem at our disposal, we just need to establish the validity of the restricted eigenvalue (RE) condition and deviation bound (DB) conditions for stationary time series by making appropriate assumptions. We shall do that without assuming any parametric form of the data generating mechanism. Instead, we will impose appropriate tail conditions on the random vectors and also assume that they satisfy some type of mixing condition. Specifically, in Section 4, we will consider -mixing subweibull random vectors (we define these below in Section 4.1). Next, in Section 3, we will consider -mixing Gaussian random vectors. Classically, mixing conditions were introduced to generalize classic limit theorems in probability beyond the case of iid random variables (rosenblatt1956central, ). Recent work on high dimensional statistics has established the validity of RE conditions in the iid Gaussian (raskutti2010restricted, ) and iid sub-Gaussian cases (rudelson2013reconstruction, ). One of the main contributions of our work is to extend these results in high dimensional statistics from the iid to the mixing case.

### 2.4 A Brief Overview of Mixing Conditions

Mixing conditions (bradley2005basic, ) are well established in the stochastic processes literature as a way to allow for dependence in extending results from the iid case. The general idea is to first define a measure of dependence between two random variables (that can vector-valued or even take values in a Banach space) with associated sigma algebras . For example,

 α(X,Y)=sup{|P(A∩B)−P(A)P(B)|:A∈σ(X),B∈σ(Y)}.

Then for a stationary stochastic process , one defines the mixing coefficients, for ,

 α(l)=α(X−∞:t,Xt+l:∞).

We say that that the process is mixing, in the sense just defined, when as . The particular notion we get using the measure of dependence above is called “-mixing". It was first used by rosenblatt1956central () to extend the central limit theorem to dependent random variables. There are other, stronger notions of mixing, such as -mixing and -mixing that are defined using the dependence measures:

 ρ(X,Y) =sup{Cov(f(X),g(Y)):Ef=Eg=0,Ef2=Eg2=1} β(X,Y) =sup12I∑i=1J∑j=1|P(Si∩Tj)−P(Si)P(Tj)|

where the last supremum is over all pairs of partitions and of the sample space such that for all . The -mixing and -mixing conditions do not imply each other but each, by itself, implies -mixing (bradley2005basic, ). For stationary gaussian processes, -mixing is equivalent to -mixing (see Fact 3 below).

The -mixing condition has been of interest in statistical learning theory for obtaining finite sample generalization error bounds for empirical risk minimization (vidyasagar2003learning, , Sec. 3.4) and boosting (kulkarni2005convergence, ) for dependent samples. There is also work on estimating -mixing coefficients from data (mcdonald2011estimating, ). The usefulness of -mixing lies in the fact that by using a simple blocking technique, that goes back to the work of yu1994rates (), one can often reduce the situation to the iid setting. At the same time, many interesting processes such as Markov and hidden Markov processes satisfy a -mixing condition (vidyasagar2003learning, , Sec. 3.5). To the best of our knowledge, however, there are no results showing that RE and DB conditions holds under mixing conditions. Next we fill this gap in the literature. Before we continue, we note an elementary but useful fact about mixing conditions, viz. they persist under arbitrary measurable transformations of the original stochastic process.

###### Fact 1.

The range of values that the -, - and -mixing coefficients can take on are bounded(see e.g. bradley2005basic ()): Consider the probability space , for any two sigma fields , we have

 0≤α(A,B)≤1/4, 0≤β(A,B)≤1, 0≤ρ(A,B)≤1
###### Fact 2.

Suppose a stationary process is , , or -mixing. Then the stationary sequence , for any measurable function , also is mixing in the same sense with its mixing coefficients bounded by those of the original sequence.

## 3 Gaussian Processes under α-Mixing

Here we will study Gaussian processes and the -mixing condition which is weaker than that of the -mixing. We make the following additional assumption.

###### Assumption 4 (Gaussianity).

The process is a Gaussian process.

Assume satisfies Assumptions 23, and 4. Note that and . To control dependence over time, we will assume -mixing, the weakest notion among , and mixing.

###### Assumption 5 (α-Mixing).

The process is an -mixing process. Let . If is summable, we let .

We will use the following useful fact (ibragimov1978gaussian, , p. 111) in our analysis.

###### Fact 3.

For any stationary Gaussian process, the and mixing coefficients are related as follows:

 ∀l≥1, α(l)≤ρ(l)≤2πα(l).
###### Proposition 2 (Deviation Bound, Gaussian Case).

Suppose Assumptions 23 and 45 hold. Then, there exists a deterministic positive constant , and a free parameter , such that, for , we have

 P[∣∣ ∣∣∣∣ ∣∣∣∣ ∣∣X′WT∣∣ ∣∣∣∣ ∣∣∣∣ ∣∣∞≤Q(X,W,Θ⋆)R(p,q,T)]≥1−8exp(−blog(pq))

where

 Q(X,W,Θ⋆) =8π√(b+1)~c(|||ΣX|||(1+max1≤i≤p∥∥Θ⋆:i∥∥22)+|||ΣY|||) R(p,q,T) =Sα(T)√log(pq)T.
###### Remark 1.

Note that the free parameter serves to trade-off between the success probability on the one hand and the sample size threshold and multiplier function on the other. A large increases the success probability but worsen the sample size threshold and the multiplier function.

###### Proposition 3 (RE, Gaussian Case).

Suppose Assumptions 23 and 45 hold. Then, for some constant , when the sample size , then we have, with probability at least , that for every vector ,

 |v′^Γv|>α1∥v∥22−τ1(T,p)∥v∥21, (3.1)

where

 α1 =12λmin(ΣX), τ1(T,p) =α/⌈cT4log(p)min{1,η2}⌉, and η= λmin(ΣX)108πSα(T)λmax(ΣX).
###### Remark 2.

Note that, in Theorem 1, it is advantageous to have a large and a smaller so that the convergence rate is fast and the initial sample threshold for the result to hold is small. The result above, therefore, clearly shows that is advantageous to have a well-conditioned .

### 3.1 Estimation and Prediction Errors

Substituting the RE and DB constants from Propositions 2-3 into Theorem 1 immediately yields the following guarantees

###### Corollary 4 (Lasso Guarantee for Gaussian Vectors under α-Mixing).

Suppose Assumptions 25 hold. Let be fixed constants from Propositions 3 and 2 and let be free parameter. Then, for sample size

 T ≥max{log(p)cmin{1,η2}max{42e,128s},log(pq)√b+1~c} where η=λmin(ΣX)108πSα(T)λmax(ΣX)

we have, with probability at least , the lasso error bounds (2.5) and (2.6) hold with

 α =12λmin(ΣX) λT =4Q(X,W,Θ⋆)R(p,q,T)

where

 Q(X,W,Θ⋆) =8π√(b+1)~c(|||ΣX|||(1+max1≤i≤p∥∥Θ⋆:i∥∥22)+|||ΣY|||), R(p,q,T) =Sα(T)√log(pq)T.
###### Remark 3.

If the -mixing coefficients are summable, i.e., , then we get the usual convergence rate of . Also, the threshold sample size is . This is in agreement with what is happens in the iid Gaussian case. When is not summable then both the initial sample threshold required for the guarantee to be valid as well as the rate of error decay deteriorate. The latter becomes . We see that as long as , we still have consistency. In the finite order stable VAR case considered by basu2015regularized (), the -mixing coefficients are geometrically decaying and hence summable (see Example 1 for details).

### 3.2 Examples

We illustrate applicability of our theory in Section 3 using the examples below.

###### Example 1 (Gaussian VAR).

Transition matrix estimation in sparse stable VAR models has been considered by several authors in recent years (davis2015sparse, ; han2013transition, ; song2011large, ). The lasso estimator is a natural choice for the problem.

Formally a finite order Gaussian VAR() process is defined as follows. Consider a sequence of serially ordered random vectors , that admits the following auto-regressive representation:

where each is a sparse non-stochastic coefficient matrix in and innovations are -dimensional random vectors from . Assume and .

Assume that the VAR() process is stable; i.e. . Now, we identify and for .

We can verify (see Appendix E.1 for details) that Assumptions 15 hold. Note that . As a result, Propositions 2 and 3, and thus Corollary 4 follow and hence we have all the high probabilistic guarantees for lasso on Example 1. This shows that our theory covers the stable Gaussian VAR models for which basu2015regularized () provided lasso errors bounds.

We state the following convenient fact because it allows us to study any finite order VAR model by considering its equivalent VAR() representation. See Appendix LABEL:*veri:VAR for details.

###### Fact 4.

Every VAR(d) process can be written in VAR() form (see e.g. (lutkepohl2005new, , Ch 2.1)).

Therefore, without loss of generality, we can consider VAR() model in the ensuing Examples.

###### Example 2 (Gaussian VAR with Omitted Variable).

We study OLS estimator of a VAR(1) process when there are endogenous variables omitted. This arises naturally when the underlying DGM is high-dimensional but not all variables are available/observable/measurable to the researcher to do estimation/prediction. This also happens when the researcher mis-specifies the scope of the model.

Notice that the system of the retained set of variables is no longer a finite order VAR(and thus non-Markovian). There is model mis-specification and this example serves to illustrate that our theory is applicable to models beyond the finite order VAR setting.

Consider a VAR(1) process such that each vector in the sequence is generated by the recursion below:

 (Zt;Ξt)=A(Zt−1;Ξt−1)+(EZ,t−1;EΞ,t−1)

where , , , and are partitions of the random vectors and into and variables. Also,

 A:=[AZZAZΞAΞZAΞΞ]

is the coefficient matrix of the VAR(1) process with -sparse, -sparse and . for are iid draws from a Gaussian white noise process.

We are interested in the OLS -lag estimator of the system restricted to the set of variables in . Recall that

 Θ⋆:=argminB∈Rp×pE(∥∥Zt−B′Zt−1∥∥22)

Now, set and for . It can be shown that . We can verify that Assumptions 15 hold. See Appendix E.2 for details. As a result, Propositions 2 and 3, and thus Corollary 4 follow and hence we have all the high probabilistic guarantees for lasso on this non-Markovian example.

## 4 Subweibull Random Vectors under β-Mixing

Existing analyses of lasso mostly assume subgaussian or subexponential tails. These assumptions ensure that the moment generating function exists, at least for some values of the free parameter. Non-existence of the moment generating function is often taken as the definition of having a heavy tail (foss2011introduction, ). We now introduce a family of random variables that includes subgaussian and subexponential random variables. In addition, it includes some heavy tailed distributions.

### 4.1 Subweibull Random Variables and Vectors

The subgaussian and subexponential random variables are characterized by the behavior of their tails. Among the several equivalent definitions of these random variables, we recall the ones that are based on the growth behavior of moments. Recall that a subgaussian (resp. subexponential) random variable can be defined as one for which for some constant (resp. ). A natural generalization of these definitions that allows for heavier tails is as follows. Fix some , and require

There are a few different equivalent ways to imposing the condition above including a tail condition that says that the tail is no heavier than that of a Weibull random variable with parameter . That is the reason why we call this family “subweibull.

###### Lemma 5.

(Subweibull properties) Let be a random variable. Then the following statements are equivalent for every . The constants differ from each other at most a constant depending only on .

1. The tails of satisfies

2. The moments of satisfy,

3. The moment generating function of is finite at some point; namely

 E[exp(|X|/K3)γ]≤2.
###### Remark 4.

A similar tail condition is called “Condition C0” by tao2013random (). However, to the best of our knowledge, this family has not been systematically introduced. The equivalence above is related to the theory of Orlicz spaces (see, for example, Lemma 3.1 in the lecture notes of pisier2016subgaussian ()).

###### Definition 3.

(Subweibull() Random Variable and Norm). A random variable that satisfies any property in Lemma 5 is called a subweibull() random variable. The subweibull() norm associated with , denoted , is defined to be the smallest constant such that the moment condition in definition Lemma 5 holds. In other words,

It is easy to see that , being a pointwise supremum of norms, is indeed a norm on the space of subweibull() random variables.

###### Remark 5.

It is common in the literature (see, for example foss2011introduction ()) to call a random variable heavy-tailed if its tail decays slower than that of an exponential random variable. This way of distinguishing between light and heavy tails is natural because the moment generating function for a heavy-tailed random variable thus defined fails to exist at any point. Note that, under such a definition, subweibull() random variables with include heavy-tailed random variables.

In our theoretical analysis, we will often be dealing with squares of random variables. The next lemma tells us what happens to the subweibull parameter and the associated constant, under squaring.

###### Lemma 6.

For any , if a random variable is subweibull() then is subweibull(). Moreover,

 ∥X2∥ψγ≤21/γ∥X∥2ψ2γ.

We now define the subweibull norm of a random vector to capture dependence among its coordinates. It is defined using one dimensional projections of the random vector in the same way as we define subgaussian and subexponential norms of random vectors.

###### Definition 4.

Let . A random vector is said to be a subweibull() random vector if all of its one dimensional projections are subweibull() random variables. We define the subweibull() norm of a random vector as,

 ∥X∥ψγ:=supv∈Sp−1∥v′X∥ψγ

where is the unit sphere in .

To prove our results, we need measures that control the amount of dependence in the observations across time as well as within a given time period.

###### Assumption 6.

The process is geometrically -mixing; i.e., there exist constants and such that

 β(n)≤2exp(−c⋅nγ1),∀n∈N.
###### Assumption 7.

Each random vector in the sequences and follows a subweibull() distribution with , for .

Finally, we make an joint assumption on the allowed pairs .

###### Assumption 8.

Assume where

 γ :=(1γ1+2γ2)−1.
###### Remark 6.

Note that the parameters and defines a difficulty landscape. The "easy case" where and are already addressed in the literature (e.g. in wong2017lasso ()). This paper serves to provide theoretical guarantees for the difficult scenario when the tail probability decays slowly () and/or data exhibit strong temporal dependence () and hence extends the literature to all spectrum of positive and .

Now, we are ready to provide high probability guarantees for the deviation bound and restricted eigenvalue conditions.

###### Proposition 7 (Deviation Bound, β-Mixing Subweibull Case).

Suppose Assumptions 1-3 and 6-8 hold. Let be a constant and let be defined as

Then with sample size , we have

 ≤2exp(−c′log(pq))

where the constants depend only on and the parameters appearing in Assumptions 6 and 7.

###### Proposition 8 (Re, β-mixing Subweibull Case).

Suppose Assumptions 1-3 and 6-8 hold. Let

 K:=22/γ2K2X.

Then for sample size

 T≥max⎧⎪ ⎪⎨⎪ ⎪⎩54K(2C1log(p))1/γλmin(ΣX(0)),(54Kλmin(ΣX(0)))2−γ1−γ(C2C1)11−γ⎫⎪ ⎪⎬⎪ ⎪⎭

we have with probability at least

 1−2Texp{−~cTγ}, where ~c=(λmin(ΣX(0)))γ(54K)γ2C1,

that for all ,

 1T∥∥Xv∥∥22 ≥α∥v∥22−τ∥v∥21.

where and . Note that the constants depend only on the parameters appearing in Assumptions 6 and 7.

### 4.2 Estimation and Prediction Errors

Substituting the RE and DB constants from Propositions 7-8 into Theorem 1 immediately yields the following guarantee

###### Corollary 9 (Lasso Guarantee for Subweibull Vectors under β-Mixing).

Suppose Assumptions 1-3 and 6-8 hold. Let be constants as defined in Propositions 7-8, and let ,

then for sample size

 T≥max⎧⎪ ⎪⎨⎪ ⎪⎩C1(log(pq))2γ−1,54K[2max{8s/~c,C1}log(p)]1/γλmin(ΣX(0)),(54Kλmin(ΣX(0)))2−γ1−γ(C2C1)11−γ⎫⎪ ⎪⎬⎪ ⎪⎭

we have with probability at least

 1−2Texp{−~cTγ}−2exp(−c′log(pq))

that the lasso error bounds (2.5) and (2.6) hold with

 α =12λmin(ΣX) λT =4Q(X,W,Θ⋆)R(p,q,T)

where

 Q(X,W,Θ⋆) =C2K, R(p,q,T) =√log(pq)T.

### 4.3 Examples

We explore applicability of our theory in Section 4 beyond just linear Gaussian processes using the examples below. Together, we will show that the high probabilistic guarantees for lasso potentially extend to data generated from DGM involving subgaussianity, model mis-specification, and/or nonlinearity.

###### Example 3 (Subweibull VAR).

We study a generalization of VAR, one that has subweibull() realizations. Consider a VAR() model defined as in Example 1 except that we replace the Gaussian white noise innovations with iid random vectors from some subweibull() distribution with a non-singular covariance matrix . Now, consider a sequence generated according to the model. Then, each will be a mean zero subweibull random vector.

Now, we identify and for . Assuming that ’s are sparse, and is stable, we can verify (see Appendix E.1 for details) that Assumptions 1-3 and 6-8 hold. Note that . As a result, Propositions 7 and 8 follow and hence we have all the high probability guarantees for lasso on Example 3. This shows that our theory covers DGMs beyond just the stable Gaussian processes.

###### Example 4 (VAR with Subweibull Innovations and Omitted Variable).

Using the same setup as in Example 2 except that we replace the Gaussian white noise innovations with iid random vectors from some subweibull() distribution with a non-singular covariance matrix . Now, consider a sequence generated according to the model. Then, each will be a mean zero subweibull random vector.

Now, set and for . Assume . It can be shown that . We can verify that Assumptions 1-3 and 6-7 hold. See Appendix E.2 for details. Therefore, Propositions 7 and 8 and thus Corollary 9 follow and and hence we have all the high probabilistic guarantees for subweibull random vectors from a non-Markovian model.

###### Example 5 (Multivariate ARCH).

We explore the generality of our theory by considering a multivariate nonlinear time series model with subweibull innovations. A popular nonlinear multivariate time series model in econometrics and finance is the vector autoregressive conditionally heteroscedastic (ARCH) model. We chose the following specific ARCH model just for convenient validation of the geometric -mixing property of the process; it may potentially be applicable to a larger class of multivariate ARCH models.

Let be random vectors defined by the following recursion, for any constants , , , and A sparse with :

 (4.1)

where are iid random vectors from some subweibull() distribution with a non-singular covariance matrix , and clips the argument to stay in the interval . Consequently, each will be a mean zero subweibull random vector. Note that , the transpose of the coefficient matrix A here.

Now, set and for . We can verify (see Appendix E.3 for details) that Assumptions 1-3 and 6-7 hold. Therefore, Propositions 7 and 8, and thus Corollary 9 follow and and hence we have all the high probabilistic guarantees for lasso on nonlinear models with subweibull innovations.

### Acknowledgments

We thank Sumanta Basu and George Michailidis for helpful discussions, and Roman Vershynin for pointers to the literature. We acknowledge the support of NSF via a regular (DMS-1612549) and a CAREER grant (IIS-1452099).

## References

• [1] Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Fast global convergence of gradient methods for high-dimensional statistical recovery. The Annals of Statistics, 40(5):2452–2482, 2012.
• [2] Pierre Alquier, Paul Doukhan, et al. Sparsity considerations for dependent variables. Electronic journal of statistics, 5:750–774, 2011.
• [3] Sumanta Basu and George Michailidis. Regularized estimation in sparse high-dimensional time series models. The Annals of Statistics, 43(4):1535–1567, 2015.
• [4] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
• [5] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
• [6] Richard C Bradley. Basic properties of strong mixing conditions. a survey and some open questions. Probability surveys, 2(2):107–144, 2005.
• [7] Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
• [8] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex geometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805–849, 2012.
• [9] Xiaohui Chen, Mengyu Xu, and Wei Biao Wu. Covariance and precision matrix estimation for high-dimensional time series. The Annals of Statistics, 41(6):2994–3021, 2013.
• [10] Alexander Chudik and M Hashem Pesaran. Infinite-dimensional VARs and factor models. Journal of Econometrics, 163(1):4–22, 2011.
• [11] Alexander Chudik and M Hashem Pesaran. Econometric analysis of high dimensional VARs featuring a dominant unit. Econometric Reviews, 32(5-6):592–649, 2013.
• [12] Alexander Chudik and M Hashem Pesaran. Theory and practice of GVAR modelling. Journal of Economic Surveys, 2014.
• [13] Richard A Davis, Pengfei Zang, and Tian Zheng. Sparse vector autoregressive modeling. arXiv preprint arXiv:1207.0520, 2012.
• [14] Richard A Davis, Pengfei Zang, and Tian Zheng. Sparse vector autoregressive modeling. Journal of Computational and Graphical Statistics, (just-accepted):1–53, 2015.
• [15] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009.
• [16] JianQing Fan, Lei Qi, and Xin Tong. Penalized least squares estimation with weakly dependent data. Science China Mathematics, 59(12):2335–2354, 2016.
• [17] Sergey Foss, Dmitry Korshunov, Stan Zachary, et al. An introduction to heavy-tailed and subexponential distributions, volume 6. Springer, 2011.
• [18] Shaojun Guo, Yazhen Wang, and Qiwei Yao. High dimensional and banded vector autoregressions. arXiv preprint arXiv:1502.07831, 2015.
• [19] Fang Han and Han Liu. Transition matrix estimation in high dimensional time series. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 172–180, 2013.
• [20] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015.
• [21] Fumio Hayashi. Econometrics princeton university press, 2000.
• [22] IlÊ¹dar Abdulovič Ibragimov and Yurii Antolevich Rozanov. Gaussian random processes. Springer, 1978.
• [23] Anders Bredahl Kock and Laurent Callot. Oracle inequalities for high dimensional vector autoregressions. Journal of Econometrics, 186(2):325–344, 2015.
• [24] Sanjeev Kulkarni, Aurelie C Lozano, and Robert E Schapire. Convergence and consistency of regularized boosting algorithms with stationary -mixing observations. In Advances in neural information processing systems, pages 819–826, 2005.
• [25] Eckhard Liebscher. Towards a unified approach for proving geometric ergodicity and mixing properties of nonlinear autoregressive processes. Journal of Time Series Analysis, 26(5):669–689, 2005.
• [26] Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637–1664, 2012.
• [27] Helmut Lütkepohl. New introduction to multiple time series analysis. Springer Science & Business Media, 2005.
• [28] Daniel J Mcdonald, Cosma R Shalizi, and Mark J Schervish. Estimating beta-mixing coefficients. In International Conference on Artificial Intelligence and Statistics, pages 516–524, 2011.
• [29] Timothy L McMurry and Dimitris N Politis. High-dimensional autocovariance matrices and optimal linear prediction. Electronic Journal of Statistics, 9:753–788, 2015.
• [30] Marcelo C Medeiros and Eduardo F Mendes. -regularization of high-dimensional time-series models with non-gaussian and heteroskedastic errors. Journal of Econometrics, 191(1):255–271, 2016.
• [31] Florence Merlevède, Magda Peligrad, and Emmanuel Rio. A bernstein type inequality and moderate deviations for weakly dependent sequences. Probability Theory and Related Fields, 151(3-4):435–474, 2011.
• [32] Yuval Nardi and Alessandro Rinaldo. Autoregressive process modeling via the lasso procedure. Journal of Multivariate Analysis, 102(3):528–549, 2011.
• [33] Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, pages 1069–1097, 2011.
• [34] Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1):1665–1697, 2012.
• [35] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. Statistical Science, 27(4):538–557, 2012.
• [36] Rodrigue Ngueyep and Nicoleta Serban. Large vector auto regression for multi-layer spatially correlated time series. Technometrics, 2014.
• [37] William Nicholson, David Matteson, and Jacob Bien. VARX-L: Structured regularization for large vector autoregressions with exogenous variables. arXiv preprint arXiv:1508.07497, 2015.
• [38] William B Nicholson, Jacob Bien, and David S Matteson. Hierarchical vector autoregression. arXiv preprint arXiv:1412.5250, 2014.
• [39] Gilles Pisier. Subgaussian sequences in probability and fourier analysis, 2016. arXiv preprint arXiv:1607.01053v3.
• [40] Huitong Qiu, Sheng Xu, Fang Han, Han Liu, and Brian Caffo. Robust estimation of transition matrices in high dimensional heavy-tailed vector autoregressive processes. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1843–1851, 2015.
• [41] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11:2241–2259, 2010.
• [42] Murray Rosenblatt. A central limit theorem and a strong mixing condition. Proceedings of the National Academy of Sciences of the United States of America, 42(1):43, 1956.
• [43] Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18(82):1–9, 2013.
• [44] Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic random measurements. Information Theory, IEEE Transactions on, 59(6):3434–3447, 2013.
• [45] Vidyashankar Sivakumar, Arindam Banerjee, and Pradeep K Ravikumar. Beyond sub-gaussian measurements: High-dimensional structured estimation with sub-exponential designs. In Advances in Neural Information Processing Systems, pages 2206–2214, 2015.
• [46] Song Song and Peter J Bickel. Large vector auto regressions. arXiv preprint arXiv:1106.3915, 2011.
• [47] Terence Tao and Van Vu. Random matrices: Sharp concentration of eigenvalues. Random Matrices: Theory and Applications, 2(03):1350007, 2013.
• [48] Dag Tjøstheim. Non-linear time series and markov chains. Advances in Applied Probability, pages 587–611, 1990.
• [49] Yoshimasa Uematsu. Penalized likelihood estimation in high-dimensional time series models and uts application. arXiv preprint arXiv:1504.06706, 2015.
• [50] Mathukumalli Vidyasagar. Learning and generalisation: with applications to neural networks. Springer Science & Business Media, second edition, 2003.
• [51] Gabrielle Viennet. Inequalities for absolutely regular sequences: application to density estimation. Probability theory and related fields, 107(4):467–492, 1997.
• [52] Hansheng Wang, Guodong Li, and Chih-Ling Tsai. Regression coefficient and autoregressive order shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(1):63–78, 2007.
• [53] Zhaoran Wang, Fang Han, and Han Liu. Sparse principal component analysis for high dimensional vector autoregressive models. arXiv preprint arXiv:1307.0164, 2013.
• [54] Kam Chung Wong, Zifan Li, and Ambuj Tewari. Lasso guarantees for time series estimation under subgaussian tails and -mixing. arXiv preprint arXiv:1602.04265 [stat.ML], 2017.
• [55] W. B. Wu and Y. N. Wu. High-dimensional linear models with dependent observations. Electronic Journal of Statistics, 10(1):352–379, 2016.
• [56] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994.
• [57] Danna Zhang and Wei Biao Wu. Gaussian approximation for high dimensional time series. arXiv preprint arXiv:1508.07036, 2015.

## Appendix A Proof of Master Theorem

###### Proof of Theorem 1.

We will break down the proof in steps.

1. Since is optimal for 2.4 and is feasible,

 1T∣∣∣∣∣∣Y−XˆΘ∣∣∣∣∣∣2F+λT∥∥vec(ˆΘ)∥∥1≤1T∣∣∣∣∣∣Y−XΘ⋆∣∣∣∣∣∣2F+λT∥vec(Θ⋆)∥1
2. Let

Note that

 {∥∥vec(Θ⋆S)∥∥1−∥∥vec(^ΔS)∥∥1} +∥∥vec(^ΔSc)∥∥1−∥vec(Θ⋆)∥1 =∥∥vec(^ΔSc)∥∥1−∥∥vec(^ΔS)∥∥1

where denote the support of .

3. With constant and tolerance , deviation bound constant and