Sparse Multivariate ARCH Models: Finite Sample Properties

# Sparse Multivariate ARCH Models: Finite Sample Properties

## Abstract

We provide finite sample properties of sparse multivariate ARCH processes, where the linear representation of ARCH models allows for an ordinary least squares estimation. Under the restricted strong convexity of the unpenalized loss function, regularity conditions on the penalty function, strict stationary and -mixing process, we prove non-asymptotic error bounds on the regularized ARCH estimator. Based on the primal-dual witness method of Loh and Wainwright (2017), we establish variable selection consistency, including the case when the penalty function is non-convex. These theoretical results are supported by empirical studies.

Keywords: Multivariate ARCH; non-convex regularizer; statistical consistency; support recovery.

## 1 Introduction

Modelling the joint behavior of a high-dimensional random vector has become a key challenge for academics and practitioners. In the financial area, it is arduous to build a realistic model that is statistically relevant and consistent with some well-known stylized features of asset returns such as fat tails, volatility clustering, cross-sectional dependency, and the like. In such discrete time multivariate framework, the usual key quantities are the covariance matrices of the current asset returns, given their past values.

In the literature, many specifications for discrete-time multivariate dynamic models have been proposed. Most of them basically belong to the multivariate GARCH (MGARCH) family or to the multivariate stochastic volatility family: see the surveys of Bauwens, Laurent and Rombouts (2006) and Asai, McAleer and Yu (2006), respectively. These models can be simulated by specifying the dynamics of the first two conditional moments of the underlying distributions on one side, and the law of the innovations on the other side. However, a classical drawback is the so-called “curse of dimensionality” as the specification of a general multivariate dynamic model induces an explosion of the number of parameters to be estimated. This implies practical problems of inference and possibly over-fitting.

The inference for -dimensional multivariate GARCH models is usually led by the quasi-maximum likelihood method (see Francq and Zakoïan (2010), e.g.), where the order of the free parameters is for the most general MGARCH model called vectorial GARCH (VEC-GARCH). The corresponding objective function is highly non-linear (say a multivariate Gaussian QML) and thus fosters the use of reduced versions of such multivariate models: for instance the scalar BEKK of Engle and Kroner (1995). However, when is large, capturing heterogeneous patterns will be difficult within scalar models.

Another approach is given by factor modeling, where a small number of factors aims at capturing the cross-sectional information. Among others, Fan, Fan and Lv (2008) emphasized the relevance of factor models for high-dimensional precision matrix estimation. However, this modeling requires the identification of the corresponding factors. An “expert” analysis is based on some priors regarding the leading underlying factors. Otherwise, latent unobserved factors induce particular estimation issues and their number is questionable.

In this paper, to tackle the curse of dimensionality inherent to MGARCH modeling, we propose to use the multivariate ARCH (MARCH) process derived from the VEC-GARCH model and to consider a regularized - or penalized - ordinary least squares (OLS) statistical criterion. MARCH models actually admit a linear representation with respect to the parameters, contrary to GARCH ones. Besides, any “invertible” GARCH process may be written as an infinite order ARCH model, under some conditions on its coefficients. Therefore, we argue that highly parameterized ARCH models (with numerous lags) should behave at least as well as more GARCH models that entail an autoregressive component, in terms of realism and flexibility. Nonetheless, we believe that not all the covariance/variance components have a significant influence on the other components. Based on this sparsity assumption, we consider a regularization procedure of the multivariate ARCH model to recover the true set of non-zero parameters. This regularized MARCH approach is a major contribution compared to previous studies in the GARCH literature that proposed methods for high-dimensional vectors: for instance Engle, Ledoit and Wolf (2017) propose a composite likelihood approach to estimate the covariance dynamic on pairs of variables, instead of the whole vector at once; Bauwens, Grigoreyva and Ortega (2016) propose an algorithmic method to estimate non-scalar version of large-dimensional GARCH models. In these studies, sparsity is not explicitly considered and no penalization methods have been proposed so far for MGARCH models.

The theoretical properties of the penalized OLS loss function have widely been investigated in numerous studies, which rely on the key assumption that the samples are independent and identically distributed: e.g. Knight and Fu (2000), Fan and Li (2001), Zhang and Zou (2009) for the asymptotic properties; e.g. van de Geer (2008), van de Geer and Bühlmann (2009) for finite sample studies. Recent works have extended these results to the dependent case: asymptotic properties of the Lasso for high-dimensional time series were considered by Loh and Wainwright (2012) or Wu and Wu (2014); Basu and Michailidis (2015) derive finite sample error bounds for Lasso regularized Vector Autoregression models. The convexity of the statistical criterion is a key point to derive such results.

In this paper, we are interested in regularized ARCH models, where the regularizer can potentially be non-convex such as the SCAD or the MCP. We thus rely on the general framework developed by Loh and Wainwright (2014, 2017), which covers a broad range of non-convex objective functions such as the corrected Lasso for error in variables linear models, or simply the scad-penalized OLS loss function. Under the assumption of restricted strong convexity (Negahban, Ravikumar, Wainwright and Yu, 2012) of the unpenalized loss function and suitable regularity conditions on the regularizer, they establish that any stationary points of the penalized function, including both local and global optimum, lies within statistical precision of the true sparse parameter and provide conditions for variable selection consistency, a property also called support recovery. Our study uses this setting to derive finite sample properties of the regularized MARCH model under stationary and -mixing process with sub-Gaussian innovations. To do so, we show that the unpenalized loss function satisfies a certain type of restricted strong convexity called restricted eigenvalue condition. Then our main contribution is to quantify the statistical accuracy of the sparse MARCH estimator by deriving bounds on the Frobenius and entrywise max norm errors between the penalized estimator and the true parameter. Furthermore, we provide sufficient conditions for the sparse MARCH estimator to satisfy the support recovery property. Interestingly, the usual incoherence condition can be dropped under suitable conditions on the regularizers.

The remainder of the paper is organized as follows. In Section 2, we describe the multivariate ARCH framework and the penalized ordinary least squares criterion. Section 3 is devoted to the model assumptions and a preliminary concentration result. In Section 4, we demonstrate that the unpenalized loss function satisfies the restricted strong convexity property and we derive non-asymptotic upper bounds on the estimation error together with the conditions for which the support recovery property is satisfied. Section 5 illustrates these theoretical results through a simulated experiment and the relevance of the proposed regularization method is carried out on real a portfolio analysis.

Notations. Throughout this paper, we denote the cardinality of a set by . For a vector , we denote the norm as for , and . Let the subset , then it the vector restricted to . For a matrix , we denote , and the spectral, infimum and Frobenius norms, respectively, and write the coordinate-wise maximum (in absolute value). We write to denote the vectorization operator that stacks the columns of on top of one another into a vector. For a -square and symmetric matrix , we denote by the vector that stacks the columns of the lower triangular part of its argument. We denote by (resp ) the positive definiteness (resp semi-definiteness) of . The notation refers to the Kronecker operator. For a function , we denote by the gradient or subgradient of and the Hessian of . We denote by the Hessian of restricted to the block .

## 2 Sparse VEC-ARCH process

### 2.1 VEC-ARCH model

The vector GARCH model is a direct generalization of univariate GARCH models, where the conditional covariance matrix is a function of lagged conditional variances and lagged cross-products of all components. Denoting by a sequence of -dimensional random vectors, whose dynamics is specified by , a finite-dimensional parameter, the vector GARCH model corresponds to

 {ϵt=H1/2tηt,withHt:=E[ϵtϵ′t|Ft−1]≻0so thatvech(Ht)=ω+∑k=1qAkvech(ϵt−kϵ′t−k)+∑i=1rBivech(Ht−i), (2.1)

where is a sequence of i.i.d. variables with distribution independent of the natural filtration, and are square matrices and .

This VEC-GARCH(r,q) approach definitely hampers any high-dimensional modelling as everything is explained by everything. Since the estimation procedure is based on a quasi-maximum likelihood method, it is challenging to apply a penalization procedure to obtain a parsimonious estimator due to the non-linear objective loss function. Moreover, the theoretical properties of the sparse estimator, such as asymptotic probability bounds or finite sample error bounds, are hard to derive since the penalized objective function would not be convex. In this paper, we thus omit the autoregressive part in (2.1), that is , and using the vectorial specification, we consider the VEC-ARCH(q) process

 vech(Ht)=ω+q∑k=1Akvech(ϵt−kϵ′t−k)=AqXq,t−1, (2.2)

where , with , so that with and . This specification includes the diagonal VEC-ARCH process, where

 Ht=Ω+q∑k=1diag(ϵt−k)¯Akdiag(ϵt−k), (2.3)

so that and . The diagonal VEC-ARCH will be of interest in our applications due to the shrinking number of parameters it allows for.

Importantly, the conditional process given in (2.2) must be parameterized so that is positive-definite. To fix the ideas, we write and denote by the entry of located in the same row as (where ) and belonging to the same column as the element of of . Hence, we have

 hij,t=ωij+q∑k=1p∑1≤j′≤i′akij,i′j′ϵi′,t−kϵj′,t−k⇔hij,t=ωij+q∑k=1ϵ′t−kAij,kϵt−k, (2.4)

where and symmetric with th entries as for and on the diagonal. Thus, following the notations of Chrétien and Ortega (2014), we introduce the matrix operator defined as

where , with denotes the map that yields the position of component , of any symmetric matrix in its equivalent representation. Then relationship (2.4), and thus (2.2), is equivalent to

 Ht=Ω+q∑k=1(Ip⊗ϵ′t−k)Σ(Ak)(Ip⊗ϵt−k),

with a symmetric matrix so that . Following Gouriéroux (1997), we have the sufficient condition

 (Ht)is positive definite ifΣ(Ak)⪰0,Ω≻0. (2.5)

There is a one-to-one mapping between the components of and and the positivity condition on forms a convex set, which will be taken into account in the parameter constraint set.

### 2.2 Regularized statistical criterion

Interestingly, (2.2) can be written as a linear model

 vech(ϵtϵ′t)=AqXq,t−1+ζt,E[ζt|Ft−1]=0, (2.6)

so that with for every couple . Note that is a martingale difference. The previous linear model (2.6) will be estimated by a penalized least squares procedure. In terms of inference, this is a dramatic advantage w.r.t. the usual quasi-maximum likelihood estimation procedure of GARCH models. Therefore, in practical terms, it will be easier to estimate ARCH-type models with a lot of variables and lags (,) than a GARCH model with the same and .

In this parameterization, the total number of parameters is . As for the diagonal ARCH, the total number of parameters is . Our main assumption concerns the sparsity of the true support, which corresponds to the total number of non-zero elements supposed to be strictly inferior to (or for the diagonal case). Our objective is to estimate a sparse under the suitable positive definiteness constraint. To do so, we consider the OLS loss function

 GT(Aq)=12TT∑t=1∥Yt−AqXq,t−1∥22, (2.7)

where . The true parameter is denoted by and is supposed to uniquely minimize . The sparsity assumption concerns both parameters and . Actually, the proposed framework could also accommodate a variance targeting procedure with respect to to reduce the number of parameters1.

To recover the sparse support, we consider the regularized OLS estimator

 ^Aq=argmin{Aq:g(vec(Aq))≤R;Ω≻0,∀l,Σ(Al)⪰0}{GT(Aq)+\boldmathp(λT,vec(Aq))},

is a regularizer applied to each component of ; is the regularization parameter, which depends on the sample size, and enforce a particular type of sparse structure in the solution; is a supplementary regularization parameter. Due to the potential non-convexity of this penalty and following Loh and Wainwright (2014), we include the side condition with a convex function and to ensure the existence of local/global optima. We also impose , where is the true parameter vector of interest. is supposed to be sparse so that , with . Using the one-to-one mapping between and , must belong to the convex set of positive semi-definite matrices, which also contains .

Our proposed framework applies to any estimator that belongs to the interior of the constrained parameter set and satisfies the first order othogonality conditions, that is

 ∇vec(Aq)GT(^Aq)+∂vec(Aq)\boldmathp(λT,Aq)=0,

where is the sub-gradient of the regularizer. Such parameter is called a stationary point.

## 3 Preliminaries and assumptions

Establishing theoretical properties for penalized multivariate ARCH models requires some probabilistic structure on the process. In particular, Boussama (2006) provides conditions for the existence of a unique stationary and non-anticipative solution of the general vector GARCH model. This solution is unique, geometrically ergodic and -mixing, which implies that the solution is geometrically -mixing. Importantly, the -mixing coefficients of a process are maintained when applying a measurable mapping. For the sequence , we define the algebras . The -mixing (also called completely regular) coefficient of the sequence is defined as

 β(ϵ–)n=12sup{∑i=1I∑j=1J|P(Ai∩Bj)−P(Ai)P(Bj)|:Aiany finite partition inσl,Bjany finite % partition inσ′l+n,l≥1}.

For a measurable function , the stationary sequence has its -mixing rate bounded by the corresponding rate of .

We now provide the conditions of Boussama (2006) adapted to the VEC-ARCH(q) model (2.2).

###### Theorem 3.1.

(Boussama, 2006)
There exists a strictly stationary and non-anticipative solution of the VEC-ARCH(q) model (2.2) if

• the positivity condition on and given in (2.5) is satisfied.

• the distribution of has a density, positive on a neighborhood of , with respect to the Lebesgue measure on .

• .

This solution is unique, -mixing and ergodic.

In this paper, the conditions obtained by Boussama (2006) are assumed so that is a striclty stationary and -mixing process.

###### Assumption 1.

Under the conditions of Theorem 3.1, the process is strictly stationary, ergodic, non-anticipative and is geometrically -mixing so that it satisfies .

The next assumption requires the definition of the sub-Gaussian norm for a random variable , which is

 ∥U∥ψ2=supp≥11p1/2(E[|U|p])1/p,

so that has a sub-Gaussian constant if .

###### Assumption 2.

The components of are sub-Gaussian such that , where and does not depend on .

Assumption 1 is key to derive concentration inequalities. Under -mixing processes, the concentration inequalities for dependent variables will be close to the concentration inequalities for i.i.d. variables, the mixing coefficients behaving like a “penalty” in the probability. The sub-Gaussian property on in assumption 2 is another key ingredient to derive the concentration inequality in Lemma 3.3. This assumption implies that is also a sub-Gaussian process since is a measurable process and thus entirely depends on past observations.

###### Assumption 3.

Sparsity assumption: .

###### Assumption 4.

We consider regularization functions that are supposed to be amenable regularizers defined as follows. We denote the penalty function - or regularizer -, which is supposed to be coordinate-separable, id est . Furthermore, let , and is called -amenable if

• is symmetric around zero and .

• is non-decreasing on .

• is non-increasing on .

• is differentiable for any .

• .

• is convex for some .
The regularizer is -amenable if in addition

• There exists such that for .

We denote by the function so that the function is convex.

Assumption 3 implies that the true support (unknown) is sparse, that is the matrix contains zero components. The regularization - or penalization - procedure provides an estimate of by discarding past volatilities and covariances. To derive our theoretical properties, assumption 4 provides regularity conditions that potentially encompass non-convex functions. These regularity conditions are the same than Loh and Wainwright (2014, 2017) or Loh (2017). In this paper, we focus on the Lasso, the SCAD due to Fan and Li (2001) and the MCP due to Zhang (2010), given by

where and are fixed parameters for the SCAD and MCP respectively. The Lasso is a -amenable regularizer, whereas the SCAD and the MCP regularizers are -amenable. More precisely, (resp. , resp. ) for the Lasso (resp. SCAD, resp. MCP): see the derivations of Loh and Wainaright (2014).

Contrary to the large sample approach such as the study of Hafner and Preminger (2009), who provide consistency and asymptotic normality results based on the Gaussian quasi-maximum likelihood estimator for the VEC-GARCH process, we are interested in the finite sample analysis. To derive non-asymptotic results that hold with a certain probability, the key tool is the concentration inequality. In this paper and for some reasons that will appear hereafter, we aim at obtaining a concentration inequality for a sum of polynomials of degree two and four of dependent sub-Gaussian variables. To derive such bound, we rely on Theorem 1.4 of Adamczak and Wolff (2015) obtained for the i.i.d. setting. We first need to introduce the following notations. Let and be integers and . For and , let . We denote by the set of partitions of nonempty and pairwise disjoint sets. Given a partition , we define the norm

 ∥\boldmathC∥J=sup{∑i∈[p]daik∏l=1x(l)iJl:∥x(l)iJl∥F≤1,1≤l≤k}, (3.1)

where is a indexed matrix. We note that taking the union of subsets in the partition increases the norm.

###### Theorem 3.2.

Let be a polynomial of variables of total degrees smaller or equal to . For any integer , let denote the th derivative of . Let be a dimensional sub-Gaussian vector so that . Let a positive constant. Then for any

 P(|f(X)−E[f(X)]|>η)≤2exp(−1CDγf(η)),

with

 γf(η)=min1≤d≤DminJ∈Pd{ηLd∥E[\boldmathDdf(X)]∥J}2|J|.

The -mixing condition allows us to extend this concentration inequality for dependent variables.

###### Lemma 3.3.

Let be a strictly stationary and geometrically -mixing process, centered with and sub-Gaussian so that . Let a polynomial function of degree 4 such that , with . Let a strictly positive block length and . Then for any , we have the concentration bound

 P(|1Tf(ξ–)−E[f(ξ–)]|>η)≤6exp(−μTγf(η))+2c(μT−1)exp(−kβsT),

with

 γf(x)=x2∥\boldmathC∥2F∧minJ∈P4{∧k=2,3,4(x2∥% \boldmathC∥2J)1/k}.
###### Proof.

To carry out the proof, we use the independent block sequence technique to derive the concentration inequality for dependent variables, developed e.g. in the proof of Theorem 3.1 in Yu (1994). To do so, we construct an independent block sequence based on the original dependent sequence so that this new sequence is close in terms of distribution to the mixing sequence. And by Corollary 2.7 of Yu (1994), the expected value of a random variable defined over the dependent blocks is close to the one based on these independent blocks under the -mixing assumption. The idea is to transfer the dependent problem to the independent block and use standard concentration techniques of the independent case. Following Yu (1994), we divide the stationary process into independent blocks of equal size , where and the remainder block is of size . The partition has blocks so that and a residual block. We denote the indices in these independent blocks by (resp. ) for the odd (resp. even) blocks and the indices in the remainder part. Generally, for , we have

 Hj={i:2(j−1)sT+1≤i≤(2j−1)sT},Nj={i:(2j−1)sT+1≤i≤(2j)sT},Re={2μTsT+1,⋯,T},

so that and . We thus denote the random variable set (resp. , resp. ) the collection of random vectors belonging in the odd blocks (resp. even blocks, resp. remainder block). We take a sequence of identically distributed independent blocks so that is independent of and each block has the same distribution as a block from the original sequence . We do the same for the even and remainder blocks, and denote these sequences as and . We finally denote (resp. ) the union of odd (resp. even) block indices. From this block construction, we have

 P(|1Tf(ξ–)−E[f(ξ–)]|>η)≤P(2T|μT∑j=1∑i∈Hjf(ξi)−μT∑i∈HjE[f(ξi)]|>η) + P(2T|μT∑j=1∑i∈Njf(ξi)−μT∑i∈NjE[f(ξi)]|>η)+P(1T|∑i∈Ref(ξi)−|Re|∑i∈ReE[f(ξi)]|>η).

We first consider a collection of random variables in the even block. Let and a positive constant, we have

 P(2T|f(ξN)−E[f(ξN)]|>η)=E[12T|f(ξN)−E[f(ξN)]|>η] ≤ E[12T|f(¯ξN)−E[f(¯ξN)]|>η]+c(μT−1)β(sT) ≤ P(2T|f(¯ξN)−E[f(¯ξN)]|>η)+c(μT−1)β(sT) ≤ P(∀j=1,⋯,μT:1sT|f(¯ξNj)−E[f(¯ξNj)]|>η)+c(μT−1)β(sT) ≤ 2exp(−μTγf(ηsT))+c(μT−1)β(sT),

where we used Lemma 4.1 of Yu (1994) to relate and using the mixing condition to obtain the first inequality and Theorem 1.4 of Adamczak and Wolff (2015) to obtain the last inequality. More precisely, corresponds to a linear combination of sub-Gaussian polynomials of degree 4 of the blocks of the sub-Gaussian variable with respect to . Since is the sum of an homogeneous degree 4 polynomial, we only have to consider the derivatives of order 4 in (3.2) since the terms of order 1, 2 and 3 are null. For each of the even block, using the four indexed matrix , the functional form is

 ∀Nj,f(¯ξNj)=sT∑i1,⋯,i4=1p∑k1,⋯,k4=1\boldmathC(i1,k1),(i2,k2),(i3,k3),(i4,k4)¯ξi1,k1¯ξi2,k2¯ξi3,k3¯ξi4,k4,

with the real symmetric matrix . The expectation of is obtained by a symmetrization of . For any index

 E[\boldmathD4f(¯ξNj)]j1,j2,j3,j4=∑π\boldmathCjπ(1),⋯,jπ(4),

where the sum runs over all permutations of the index set . We shall obtain an upper bound on to upper bound . For any partition , we hence need to upper bound . Starting with , we have . If we consider a partition that has a cardinality larger than , we observed in the previous subsection . Moreover, the sub-Gaussian property implies that . Hence

 minJ∈P4{ηsTsd/2TLd∥E[\boldmathD4f(¯ξNj)]∥J}2|J|≥(ηsT)2s2TL4∥\boldmathC∥2F∧{∧k=2,3,4((ηsT)2s2TL2∥\boldmathC∥2J)1/k}.

We thus obtain

 γf(η)≥η2L4∥\boldmathC∥2F∧minJ∈P4{∧k=2,3,4(η2L2∥\boldmathC∥2J)1/k}.

Furthermore, we have . This implies

 P(|2Tf(ξN)−μTE[f(ξN)]|>η)≤2exp(−μTγf(η)+c(μT−1)exp(−kβsT).

The same reasoning holds for the collection in the odd block. Taking the union of the collection of odd and even blocks, we obtain

 P(1μT|μT∑i,j=1f(ξHi∪Nj)−μTμT∑i,j=1E[f(ξH∪N)]|>η)≤4exp(−μTγf(η))+2c(μT−1)exp(−kβsT).

Let , we obtain

 P(1μT|μT∑i,j=1f(ξHi∪Nj)−μTμT∑i,j=1E[f(ξH∪N)]|>η)≤4exp(−μTγf(η))+2c(μT−1)exp(−kβsT).

The size of the remainder block is . Using Theorem 1.4 of Adamczak and Wolff (2015), we have

 P(1T|f(ξRe)−|Re|E[f(ξRe)]|>η)≤2exp(−μTγf(η)).

We thus obtain by taking the union on all blocks

 P(1T|f(ξ–)−E[f(ξ–)]|>η)≤6exp(−μTγf(η))+2c(μT−1)exp(−kβsT),

with

 γf(η)=η2L4∥\boldmathC∥2F∧minJ∈P4{∧k=2,3,4(η2L2∥\boldmathC∥2J)1/k}.

• This concentration inequality holds for strictly stationary and -mixing sub-Gaussian processes. In this paper, we would also need a concentration inequality for second degree sub-Gaussian polynomials for -mixing and strictly stationary processes. It is given as

 missingP

(— 1T g(ξ) - missingE[g(ξ)]— ¿ η) ≤6 exp(-μ_T γ_g(η)) + 2 c (μ_T - 1) exp(-k_β s_T),

where is a polynomial function of degree 2 such that , with , and The proof follows the exact same steps as in the proof of Lemma 3.3.

By assumption 3, for any , conditionally on , the random vector is actually sub-Gaussian. Indeed, the process is -measurable. Then for with each any compact set, we have , where . Thus dep