High dimensional VAR with low rank transition

High dimensional VAR with low rank transition

Pierre Alquier111This author gratefully acknowledges financial support from the research programme New Challenges for New Data from LCL and GENES, hosted by the Fondation du Risque and from Labex ECODEC (ANR-11-LABEX-0047)., Karine Bertin, Paul Doukhan222The work of the second and the third author has been developed within the MME-DII center of excellence (ANR-11-LABEX-0023-01) and with the help of PAI-CONICYT MEC N 80170072. The authors have been supported by Fondecyt project 1171335 and Mathamsud 18-MATH-07., Rémy Garnier

(1) CREST, ENSAE
(2) CIMFAV, Universidad de Valparaiso
(3) Laboratoire AGM UMR 8088, Université Paris Seine & CIMFAV, Universidad de Valparaiso
(4) CDiscount & Laboratoire AGM UMR 8088, Université Paris Seine
Abstract

We propose a vector auto-regressive (VAR) model with a low-rank constraint on the transition matrix. This new model is well suited to predict high-dimensional series that are highly correlated, or that are driven by a small number of hidden factors. We study estimation, prediction, and rank selection for this model in a very general setting. Our method shows excellent performances on a wide variety of simulated datasets. On macro-economic data from Giannone et al. (2015), our method is competitive with state-of-the-art methods in small dimension, and even improves on them in high dimension.

1 Introduction

Machine learning became omnipresent in time series forecasting. With the growing ability to collect and store data, one often has to predict series with dimensions from a few hundreds (economic time series Jochmann et al. (2013)) to millions (as it might be the case for sales forecasts for a huge set of items, see for example Kumar and Patel (2010)). The applications fields range from health data analysis Lipton et al. (2016), electricity consumption forecasting Gaillard et al. (2016); De Castro et al. (2017), economics and GDP forecasting  Cornec (2014), traffic forecasting Lippi et al. (2013) and public transport attendance (Chapter 5 in Carel (2019)), social media analysis Saha and Sindhwani (2012), ecology Purser et al. (2009), sensor data analysis Basu and Meckesheimer (2007)… In many of these applications, accurate predictions are of the uttermost importance. For online retail, a bad predictions of the sales before Christmas might lead to inadequate storage and huge losses of money. Poor electricity forecasts might lead to power outage. When it comes to health data, a good prognosis may even be vital…

In many cases, algorithms that were primarily designed for i.i.d observations are used on time series with good results. For example, the additive model of Buja et al. (1989) is used in Carel (2019) to predict bus lines attendance, with promising results. While generalization bounds were developed first for i.i.d data (see e.g  Vapnik (1998) for an overview), some authors actually proved similar results when the same algorithms are applied on time series. We refer the reader to the pioneering work of Meir Meir (2000) and Steinwart with various co-authors Steinwart and Christmann (2009); Steinwart et al. (2009); Hang and Steinwart (2014) for an early overview. Model selection tools are investigated in Alquier and Wintenberger (2012); McDonald et al. (2017) while aggregation methods are studied via PAC-Bayesian bounds in Alquier et al. (2013); Shalizi and Kontorovich (2013); London et al. (2014); Alquier and Guedj (2018). While these results hold for stationary series, some authors also tackled non-stationary series. De-trending techniques are considered in Chen and Wu (2018) while Kuznetsov and Mohri (2015); Alquier et al. (2018) directly prove generalization bounds on non-stationary series. An original approach was also developped in Giraud et al. (2015) for time-varying AR, relying on sequential prediciton methods Cesa-Bianchi and Lugosi (2006).

However, we believe that there is a need to develop machine learning methods that aim at capture stylized facts in time series, in the same spirit that stochastic volatility models were developed to capture stylized facts in finance Engle (1995); Francq and Zakoian (2019). In this paper, we propose a vector auto-regressive (VAR) model that is suitable to predict high-dimensional series that are strongly correlated, or that are driven by a reasonable number of hidden factors. These features are captured by imposing a low-rank constraint on the transition matrix. The coefficients can be efficiently computed by convex optimization techniques.

Let us briefly describe the motivation for this model. Assume we deal with an -valued process with

(1)

for a very large . For examples, think of daily sales of items on Amazon, where an item might be iPhone 7 128Go black, Lord of the Rings - Harper Collins Box Set, 1991 or Estimation of Dependences Based on Empirical Data: Second Edition, by Vladimir Vapnik, Springer. Then it is obvious that even with a few years of observations, we observe at most for a few thousands of days while is probably of the order or or . Thus the estimation of the coefficients of the matrix is impossible. Some constraints are necessary to reduce the dimension of the problem. We believe that the sparsity of , studied by Davis et al. (2016) in another context, does not make sense here. On the other hand, it is clear that a few factors, like the current economic conditions, the period of the year, have a strong influence on the series. Assuming these factors are linear functions of , we can write them as where is with . Then, assuming that can be linearly predicted by , we can predict by for some matrix . At the end of the day, we indeed predict by , but the rank of is .

Note that the assumption that the coefficient matrix is low-rank in a multivariate regression model where and was studied in Econometric theory as early as in the 50’s Anderson (1951); Izenman (1975). It was referred to as RRR (reduced-rank regression). We refer the reader to Koltchinskii et al. (2011); Suzuki (2015); Alquier et al. (2017); Klopp et al. (2017a, b); Moridomi et al. (2018) for state-of-the-art results. Low-rank matrices were actually used to model high-dimensional time series by De Castro et al. (2017) and Alquier and Marie (2019), however, the models described in these papers cannot be straightforwardly used for prediction purposes. Here, we study estimation and prediction for the model (1).

The paper is designed as follows. In the end of the introduction, we provide notations that will be used in the whole paper. We describe our low rank contractive VAR(1) model in Section 2; contraction yields the existence of a stationary solution and together yields standard weak dependence properties. The estimation is developed in Section 3. We first tackle the case where the rank is known, we then prove an oracle inequality for rank selection based on penalization. Simulations comparing full rank estimators, oracle estimators and low rank estimators are given in Section 4. The comparison is also carried on the macro-economic dataset from Giannone et al. (2015) in Section 5. Finally, the proofs of the results of Section 3 are given in Section 7.

Notations

We introduce here notations that are used in the whole paper. For a vector , will denote the Euclidean norm of . We will denote by the singular values of an matrix , where . Note that the singular value decomposition (SVD) of can be written

(2)

for some semi-unitary matrices and : , where is the identity matrix. This means that, from the implicit function theorem, the set of such matrices is a subspace of a Riemanniann manifold with dimension .
For we let denote the Schatten--norm of defined by

if , and . For and , we define:

2 Vector auto-regressive process with low-rank constraint

We consider the -valued VAR process , defined from (1), where are i.i.d and centered random variables and where the matrix satisfies assumption:

Assumption 1.

We assume that for some and .

For now, the rank assumption is not restrictive, as the largest possible value is not forbidden. However, we will see that a smaller will lead to better rates of convergence in the estimation.

The assumption is more restrictive, but it is important in what follows. Indeed, setting , the following contraction property holds:

which yields in case of the existence of a moment for , . This implies the -dependence property (with a geometric decay, ) and the ergodicity of this model Dedecker et al. (2007). The assumption also implies that the matrix is invertible and its inverse writes as

The stationary distribution of this model is the distribution of . From now, we will assume that , that is, the process is stationary. In the special case of Gaussian noise then

We also introduce the following assumption.

Assumption 2.

There exist some positive constants and such that the noise process satisfies for all

This assumption is for example satisfied for a Gaussian noise , but it also holds in any case where the components of are independent and follow any sub-Gaussian distribution, such as a bounded distribution. Assumption 2 is not required for the process to be well defined. However, we will use it to derive our theoretical results on estimation and prediction.

3 Estimation, prediction and rank selection

Following the approach described in Vapnik (1998), we measure the quality of an estimator by its generalization ability, that is, by its out of sample prediction performances. The quality of a prediction will be assessed by a loss function : stands for the cost of having predicted when the truth appears to be .

Assumption 3.

The loss function is -Lipschitz with respect to the Euclidean norm for some . That is, for all , .

We impose the condition for the sake of simplicity, but note that it is not restrictive as we can rescale the loss anyway. This includes, for example, the Euclidean norm or the max norm . However, the squared Euclidean norm, , only satisfies Assumption 3 when the process and the predictions are bounded.

We are now in position to define the generalization error.

Definition 3.1.

The generalization error of an matrix is given by

Note that does not depent on as is starionary. Given a sample we can obviously estimate this error by its empirical counterpart, and define an estimator based on empirical risk minimization.

Definition 3.2.

The empirical error of an matrix is given by

The rank empirical risk minimizer, , is defined, for any , by

(3)

Note that in Assumption 3, we only required the loss to be Lipschitz. However, in practice, loss functions that are also convex will be preferred in order to ensure the feasibility of the minimization of the empirical risk.

Remark 3.1.

Algorithms are known to compute such a low-rank factorization in practice. The most popular method is to write for matrices and , and to optimize with respect to and (the constraint can be ensured for example by imposing and ). The popular ADMM method boils down to the alternate minimization with respect to and , see Section 9 in Boyd et al. (2011). This strategy works well in practice, despite its lack of theoretical support in this situation: even when is convex, the minimization problem is usually non convex with respect to the pair . Still, recent works Ge et al. (2017) shows that this non-convexity is not “severe” in the sense that it is still possible to find a global minimum in a reasonable time.

The following condition is purely technical and will only be used to make more accurate the statement “for large enough”.

Assumption 4.

We have where and .

Theorem 3.1.

Let Assumptions 1, 2, 3 and 4 hold. Then for any , with probability larger than , we have

(4)
Corollary 3.2.

Let Assumptions 1, 2, 3 and 4 hold. For any fixed , we have with probability larger than

where denotes a positive constant, only depending on , , and .

Corollary 3.3.

Let Assumptions 1, 2, 3 and 4 hold. For , we have for some suitable positive constant depending on , and that

with probability larger that .

Remark 3.2.

[Variations on low rank VAR(1)] Instead of considering finite rank matrices, an alternative is to consider a low rank perturbation around another matrix. As an example, the -model consists in a diagonal matrix , and . Then is estimated as in the so-called AR case with decoupled (independent) AR(1)-models in one dimension. The parameter set is now a subset of a Riemanniann manifold with dimension . Recall that, in order that be estimable, the AR(1)-coordinate models must be stable thus their coefficients belong . Indeed and imply with rank that .

Coming back to the low rank model, note that if the true rank of is known, then leads to

which means that the estimator predicts asymptotically as well as the true matrix . However, is of course usually unknown. For this reason, we now describe a model selection procedure for .

Definition 3.3.

Let Assumption 4 holds for at least one value . Define as the largest integer such that . Put where

In practice, in order to compute , we need to know , and which is still not realistic. However, for such a model selection by penalized risk minimization, it is known anyway that the constants in front of the penalization are usually too large. The slope heuristic leads to better results in practice Baudry et al. (2012).

Theorem 3.4.

Let Assumptions 1, 2, 3 and 4 hold. Let . With probability at least , we have

A more precise analysis of the involved factors yields the following corollaries.

Corollary 3.5.

Under Assumptions 1, 2, 3 and 4, for any fixed ,with probability larger than

where and are positive constants depending on , , and .

Corollary 3.6.

Let Assumptions 1, 2, , 3 and 4 hold. For , we have for some suitable positive constant depending on , and that

with probability larger that .

When is large enough so that , we obtain

so the estimator is adaptive: it does not depend on .

4 Simulations

4.1 Simulation design

In order to evaluate the performance of low-rank reconstruction, we run simulations for different values of the parameters and . We set to have reasonable computation time. The simulations are done with Python and are available online (Garnier (2019)).

More precisely, for fixed,

  • we generate a matrix of rank , as in (2), by setting

    where and are semi-unitary matrices and is a diagonal matrix whose diagonal coordinates follow a distribution Beta of parameters and . The matrices and are obtained making orthogonal two matrices whose coefficients are i.i.d uniform on . We consider in the following three possible values for : . Therefore, when , the distribution of the singular values of is uniform. When is smaller, the singular values tend to be smaller and is close to some matrix with lower rank.

  • we generate a sample of length from an valued VAR(1) process with

    where the are truncated centered Gaussian variables with variance , where the i.i.d. coordinates admit the support , and .

For each triplet , such a data set is simulated times.

4.2 Estimators and quality criteria

Our estimation procedures use the quadratic loss, that is

where we denote by the Euclidean norm on .

For each simulation, we compute three estimators:

  • The standard full-rank VAR(1) estimator for the quadratic loss, i.e.:

  • The oracle VAR(1), where we suppose that we know the underlying rank :

  • The rank-penalized estimator, where is calibrated using a slope heuristic Baudry et al. (2012):

The use of quadratic loss allows one to obtain exact expression for some of the minimization, which speeds up convergence.
To assess the quality of an estimator , we calculate the excess risk computed on a new sample generated with , that is

where we set, for a matrix :

4.3 Results

We perform simulations with different values for the triplets . The results are given in the following tables and figures. Tables 1 and 2 contain the mean of the excess risk over the simulations for the three estimators and different values of , and . Figures 1 to 3 show the dispersion of the excess risk over the simulations for and .

5 7 10 15 20 25 50 75 100
full rank 3.23 3.27 3.28 3.23 3.26 3.23 3.25 3.23 3.25
oracle 0.57 0.72 1.01 1.39 1.73 2.00 2.90 3.19 3.25
penalized 0.38 0.55 0.84 1.25 1.50 1.54 2.52 3.27 3.76
full rank 3.25 3.21 3.24 3.24 3.24 3.25 3.25 3.28 3.22
oracle 0.59 0.78 1.04 1.44 1.76 2.05 2.92 3.24 3.22
penalized 0.24 0.31 0.45 0.80 1.02 1.33 1.94 2.63 3.13
full rank 3.25 3.25 3.23 3.24 3.22 3.24 3.28 3.28 3.26
oracle 0.60 0.80 1.07 1.46 1.78 2.07 2.97 3.24 3.26
penalized 0.15 0.16 0.17 0.20 0.26 0.31 0.57 0.91 1.26
Table 1: Mean excess risk for and .
5 7 10 15 20 25 50 75 100
full rank 8.39 8.32 8.34 8.35 8.35 8.35 8.28 8.37 8.36
oracle 1.77 2.25 3.06 4.04 4.91 5.59 7.57 8.29 8.36
penalized 6.62 6.65 6.71 6.81 6.92 7.00 7.36 7.78 8.11
full rank 3.25 3.21 3.24 3.24 3.24 3.25 3.25 3.28 3.22
oracle 0.59 0.78 1.04 1.44 1.76 2.05 2.92 3.24 3.22
penalized 0.24 0.31 0.45 0.80 1.02 1.33 1.94 2.63 3.13
full rank 1.74 1.74 1.75 1.75 1.74 1.74 1.75 1.75 1.75
oracle 0.29 0.39 0.53 0.72 0.90 1.05 1.54 1.72 1.75
penalized 0.18 0.30 0.48 0.80 1.09 1.37 1.58 1.70 2.05
Table 2: Mean excess risk for and
Figure 1: Mean excess risk for and .
Figure 2: Mean excess risk for and .
Figure 3: Mean excess risk for and .

We should remark that the behavior of the full rank estimator doesn’t seem to be affected by the underlying rank or the distribution of the singular values. As this was expected, the risks of the three estimators decrease as increases (see Table 2). Moreover the dispersion of the excess risks (See Figures 1,  2 and 3) for the three estimators are comparable.
Besides, the oracle and the full-rank estimators have similar performance when the rank is high, whereas the oracle estimator outperforms the full-rank when the rank is smaller if we have enough observations.
The penalized one has good performance. Indeed when (see Table 1), the penalized one outperforms the two other estimators. In particular, it has even better performances when is smaller. In fact, in this case, the singular values of the matrix are closer to in probability and the matrix is close to a matrix with smaller rank than . Then the penalized estimator tends to under-estimate the rank and obtain better performance than the oracle one.

5 Application to a macro-economic data-set

We want to evaluate the performances of our estimations on real-world datasets. We use macro-economical data known used in Giannone et al. (2015). It consists in different US economic indicator. We used the three different datasets:

  • A small dataset containing quarterly times series between 1959 and 2008.

  • A medium dataset containing quarterly times series between 1959 and 2008.

  • A large dataset containing quarterly times series between 1959 and 2006.

The datasets are described with more details in Giannone et al. (2015). We weren’t able to get those whole datasets; hence our results are not easy to compare with those of Giannone et al. (2015).
In order to test the models, we estimate them by using the data between 1959 and 2001, and perform prediction on the period 2001-2006 for GDP (Gross domestic product) and GDP deflator (also called implicit GDP, this is a measure of the level of prices of all new, domestically produced, final goods and services in an economy in a year). We evaluate this prediction using the Mean Squared Error (MSE).

Method Small Medium Large
Constant Trend 4.72 4.72 4.72
Independent AR(1) 4.56 4.56 4.56
VAR(1) 3.58 2.70 2.90
penalized VAR(1) 3.57 2.78 2.84
-model 3.87 2.58 2.83
Table 3: MSE of the GDP forecast.
Method Small Medium Large
Constant Trend 1.23 1.23 1.23
Independent AR(1) 1.19 1.19 1.19
VAR(1) 1.18 1.08 1.11
penalized VAR(1) 1.18 1.08 1.09
-model 1.17 1.10 1.15
Table 4: MSE of the GDP deflator forecast.

We compare the performances of 5 predictors.

  • First, we use a naive predictor which asserts a constant trend at each date.

  • The second prediction is obtained assuming independent AR(1)-models for each series.

  • In the third, we use a VAR(1) estimator with full rank.

  • The fourth predictor is the one based on our penalized VAR(1) estimator.

  • In the fifth predictor, we consider the -model introduced in the Remark 3.2.

We sum up the results in Tables 3 and 4. The penalized VAR estimator has better results than the Naive or Independent AR estimator on all the datasets. The full-rank VAR algorithm outperforms it for the small or medium datasets. However, if we increase the number of features (or series), the performance of the penalized VAR is less affected than the performance of the full-rank VAR estimator. We suggest that it performs a form of features selection, which reduces over-fitting.

The -model outperforms all other models on GDP forecasts, but the VAR models are better on GDP deflators forecasts. It suggests that the low rank hypothesis may be adapted in other contexts than pure VAR(1) regression.

6 Conclusion

We considered a rank-constrained VAR model to predict high-dimensional time series. We proposed a theoretical study of our estimator and proved that it leads to consistent predictions. However, many questions remain opened. Faster rates should be possible for strongly convex losses, in the spirit of Alquier et al. (2013). It would also be necessary to study the performances of the slope heuristic from a theoretical point of view.
The properties of the -models introduced in Remark 3.2 are checked in §  5; the use of such models deserves a special attention. Finally other heteroskedastic models may be investigated. E.g.:

  • (V)ARCH models may be rewritten from a more elementary rewriting of Bauwens et al. (2006)

    for some symmetric positive matrix valued quadratic form. Recall that the non negative square root of a positive definite matrix exists and is indeed unique. A simple example of this situation is a diagonal quadratic form

    with for non-negative -matrices . Poignard (2019) introduces analogous models with sparsity considerations instead of our suggested of a rank restriction. The rank condition turns as the fact that all the matrices admit a same decomposition (2) wrt rank

  • (V)INARCH models

    for defined by i.i.d. arrays of Poisson processes, some , and some -matrix with non-negative coefficients. The low rank model writes here with ;

  • (V)INAR models (defined through thinning operators)

    for some -matrix with non-negative coefficients, here stands for the thinning operator. The low rank model writes again with .

More examples could be inspired from the models detailed in Doukhan (2018). One may think of multiple Volterra processes, bilinear models, switching VAR models, etc…The heteroskedasticity of these models will require new techniques. Indeed, in the proof section below, it will be clear that our results rely on a Bernstein inequality proven in Dedecker and Fan (2015) that cannot be used for (V)ARCH models. New concentration tools as, for example, those developed in Dedecker et al. (2019) will be used in forthcoming works invoking adapted datasets.

The simulations and the real data study demonstrate that such low rank VAR models are meaningful. Anyway additional features such as heteroskedasticity need to be taken into consideration. Hence, as this was stressed in the introduction, we believe that there is a urgent need for more models involving low rank assumptions for high-dimensional time series.

7 Proofs

7.1 Preliminaries

Let us start with a few additional notations.

Definition 7.1.

When Assumption 2 is satisfied, we put

where and and where and are the constants defined in Assumption 2.

Note that

where and , and and are as in Section 3 above.

In Alquier et al. (2018) the authors propose to use an exponential inequality to control the deviations between the generalization error and the empirical risk in the case of a non-stationary Markov chain. We follow the same approach here. The version of Bernstein inequality from Dedecker and Fan (2015) gives directly the following result.

Corollary 7.1.

Under Assumptions 1 and 2 we have, for any matrix ,

and .

Proof of Corollary 7.1.

This corollary is a consequence of the Bernstein inequality in Dedecker and Fan (2015) which requires the following assumption.

Assumption 5.

There exist some constants , and such that

where and are defined by

Note that for .

(5)

Assumption 2 implies that , for each . With the Euclidean norm thus from Hölder inequality we obtain with

Consider now some odd number such that then

since from Stirling inequality and thus ; we also use the relation .
Note that the independence of coordinates is not required.
Since , Assumption 5 is satisfied with

Moreover has be chosen as an upper-bound of . Then we choose

7.2 Proof of the theorems of Section 3

Proof of Theorem 3.1.

A first issue is to derive some general useful features from empirical processes techniques. Recall that the Frobenius norm of a matrix is satisfies

Define for of a set of matrices , its covering number with respect to the Frobenius norm.
In their Lemma 3.1, Candès and Plan (2011) prove a main combinatorial property about the sphere , which is the set of matrices with rank and with Frobenius norm : for , may be covered by an net (in metric) with cardinal


Now denote and respectively the sphere and the ball of such matrices with rank and with either or .
An homogeneity argument first allows to derive the existence of an net of with cardinal .
Now for , then an net of is provided by with . Then

Still using we can simplify further inequalities: so . As in addition , one has , which leads to

The above inequalities relating the various norms imply that and so

(6)

as soon as .

Note that the above covering numbers are considered wrt the Frobenius norm, thus the above inequality also concerns the covering numbers wrt .

Remark 7.1.

For the model of Remark 3.2 quote that the bound in (6) admits a right hand side inequality transformed as

which does not change the structure of the excess risk behaviour.

Fix and (their value will be given later). By using both Corollary 7.1 and the inequality (6), we have

Now, for any define by

Obviously

and as a consequence,

and

Applying Bernstein Inequality (3.3) in Proposition 3.1 of Dedecker and Fan (2015) with and we have, for any ,

Now let us consider the “favorable” event

The previous inequalities show that

(7)

On , we have: