1 Introduction


Generative moment matching networks (GMMNs) are introduced as dependence models for the joint innovation distribution of multivariate time series (MTS). Following the popular copula–GARCH approach for modeling dependent MTS data, a framework allowing us to take an alternative GMMN–GARCH approach is presented. First, ARMA–GARCH models are utilized to capture the serial dependence within each univariate marginal time series. Second, if the number of marginal time series is large, principal component analysis (PCA) is used as a dimension-reduction step. Last, the remaining cross-sectional dependence is modeled via a GMMN, our main contribution. GMMNs are highly flexible and easy to simulate from, which is a major advantage over the copula–GARCH approach. Applications involving yield curve modeling and the analysis of foreign exchange rate returns are presented to demonstrate the utility of our approach, especially in terms of producing better empirical predictive distributions and making better probabilistic forecasts. All results are reproducible with the demo GMMN_MTS_paper of the R package gnn.


pageheadfoot\setkomafontpagenumber\automarksection\setkomafontcaptionlabel \deffootnote[1em]1em1em\thefootnotemark\pdfstringdefDisableCommands \DeclareNameAliassortnamelast-first\DefineBibliographyExtrasamerican\DeclareQuotePunctuation\renewbibmacro*volume+number+eid\setunit*\addcomma \printfieldvolume\printfieldnumber \DeclareFieldFormat*number(#1) \DeclareFieldFormat*title#1\DeclareFieldFormatdoi\starttheoremfalse\starttheoremfalse \preto\starttheoremfalse \preto

  1. \starttheoremfalse\starttheoremfalse\preto

Multivariate time-series modeling with generative neural networks

Marius Hofert1, Avinash Prasad2, Mu Zhu3



Keywords Generative moment matching networks, learning distributions, copulas, probabilistic forecasts, ARMA–GARCH time series, yield curves, exchange rate dependence. \minisecMSC2010 62H99, 65C60, 60E05, 00A72, 65C10, 62M10.

1 Introduction

The task of modeling multivariate time series (MTS) data arises in a variety of applications in finance, economics and quantitative risk management. In many situations, a suitable model arises from breaking down this task into two key components: the modeling of serial dependence within each univariate time series and the modeling of cross-sectional dependence between the individual time series. There is a plethora of literature on univariate time series modeling with a wide range of models that are tailor-made for capturing various types of serial patterns such as seasonality, volatility clustering or regime switching. In the realm of financial econometrics, the class of generalized auto-regressive conditional heteroscedasticity (GARCH) models (see [bollerslev1986]) is a popular choice. GARCH-type models are designed to account for stylized facts (such as volatility clustering) that are often present in financial return series data; see [mcneilfreyembrechts2015, Chapter 3].

There have been numerous approaches proposed for extending univariate time series modeling approaches to the multivariate case. Within the broad GARCH framework, [bollerslev1990] initially introduced a multivariate model characterized by the distributional assumption of multivariate normality with a constant conditional correlation structure. Dynamic conditional correlation (DCC)-GARCH models were then introduced by [engle2002] and [tse2002]. DCC-GARCH models relax the conditional correlation assumption but still utilize multivariate normal distributions to model the cross-sectional dependence between the univariate time series. Leveraging Sklar’s theorem ([sklar1959]), [jondeaurockinger2006] and [patton2006] presented a flexible family of multivariate GARCH models where the assumption of multivariate normality has been relaxed to allow for any copula of the joint innovation distribution. This popular modeling approach for MTS data is known as the copula–GARCH approach; see [patton2012] for a brief overview in the context of finance and econometrics. It allows us to flexibly model joint innovation distributions with copulas, thereby decomposing the MTS modeling task into modeling of the marginal (univariate) time series and their cross-sectional dependence. There have been various research papers investigating the calibration of copula–GARCH models, for example the more recent work of [oh2017], [almeida2016] or [aas2016].

While there is a growing collection of copula models used to characterize complex dependence structures, most models are rather limited already in moderately large dimensions and often do not provide an adequate fit to given data (see for example [hofertoldford2018]) or require sophisticated, model-specific algorithms for parameter estimation and model selection. In this paper, we propose a framework for MTS modeling in which a classical copula model to account for cross-sectional dependence is replaced by a type of generative neural network known as the generative moment matching network (GMMN). In comparison to classical copulas, GMMNs can capture a large variety of complex dependence structures. For high-dimensional time series data, we incorporate principal component analysis (PCA) as an intermediate step to reduce the dimensionality. Our primary goal is to construct empirical predictive distributions, also known as probabilistic forecasts, rather than point forecasts. Additionally, these empirical predictive distributions can be utilized to further forecast various quantities of interest (e.g., quantiles) via simulation.

The paper is organized as follows. In Section 2, we outline our framework for modeling MTS data. In particular, we focus on the novel integration of GMMNs within this framework. In Section 3, we showcase our GMMN-based multivariate time series models in applications to yield curve and exchange rate data. Section 4 provides concluding remarks. All results in this paper can be reproduced with the demo GMMN_MTS_paper in the R package gnn.

2 Framework for multivariate time series modeling

Let denote a -dimensional time series of interest, where each . Furthermore, consider a stretch of realizations from denoted by . In applications in finance (risk management), these are often log-returns (negative log-returns) of asset prices; see Section 3 for more details and the pre-processing steps applied to each empirical dataset we consider.

Our suggested framework for modeling consists of three primary components:

  1. marginal time series modeling—while many possibilities can be considered, we focus on ARMA–GARCH models;

  2. dimension reduction—again, many tools are available, but we simply utilize PCA; and

  3. dependence modeling—here, the typical approach is to choose a parametric copula, but we introduce the use of GMMNs, the main contribution of this paper.

While Step 1 and Step 3 are essential, the dimension reduction component in Step 2 is optional and typically only used for high-dimensional time series which are amenable to good approximations by lower-dimensional representations.

2.1 Marginal time series modeling

The ARMA–GARCH models in Step 1 are ARMA models with GARCH errors; see [mcneilfreyembrechts2015, Section 4.2.3]. An model has the form

where, for each component , one has , , and for all . Some additional conditions on the coefficients—namely, the ’s, ’s, ’s and ’s—are necessary to ensure that all - and -processes are respectively causal and covariance stationary \parencite[see, e.g.,][Section 4.1.2–4.2.2]mcneilfreyembrechts2015, but we won’t go into detail here.

For each , the innovations in the definition of the ARMA–GARCH model are independent and identically distributed (iid) random variables with and ; their realizations after fitting marginal models are known as standardized residuals and denoted by , and . In financial time series applications, common choices of innovation distributions include the standard normal, the scaled and the skewed distribution.

Fitting the marginal time series models is typically done by fitting low-order models with likelihood-based methods and selecting the most adequate fit using the AIC/BIC model selection criterion among the candidate models. A popular broad-brush approach is to fit a model for financial return series—specifically, an model in our context—and continue the modeling based on the standardized residuals ; see [mcneilfreyembrechts2015, Chapter 4] or [hofertkojadinovicmaechleryan2018, Section 6.2.3]. This procedure is also referred to as deGARCHing. With the help of model diagnostic tools—for example, plots of the autocorrelation function (ACF) of and that of their squared values, Ljung–Box tests or assessment of the innovation distribution through Q-Q plots—one can then assess the adequacy of each marginal time series model. In what follows we use and to denote the estimated conditional mean and variance models for the th marginal time series with corresponding chosen orders and fitted parameters .

Having accounted for the marginal serial dependence in this way, the subsequent analysis in our modeling framework will operate on the standardized residuals , , which are themselves realizations of the innovation random variables, , assumed to be iid in the copula–GARCH approach.

Before we continue, we emphasize once again that any other adequate marginal time series modeling approach can be applied in our framework, as long as the model’s residuals can be considered to be iid from continuous marginal distributions. Our choice of ARMA–GARCH models is motivated only from the fact that these are the most popular marginal time series models used in practice.

2.2 Dimension reduction

Two popular dimension-reduction techniques for multivariate financial time series are factor modeling and PCA; see [mcneilfreyembrechts2015, Chapter 6] and the references therein for a brief summary. An approach that is perhaps less discussed in the financial econometrics literature involves using autoencoder neural networks for dimension reduction in which two separate neural network mappings are learned to and from the lower dimensional space; see [hinton2006]. As dimension reduction is not our main contribution in this work, we simply utilize PCA in what follows.

Note that PCA is often applied to the original MTS data in the literature; see, e.g., [alexander2000] for an investigation of the so-called orthogonal GARCH model. Apart from reducing the burden of marginal time series modeling, there is no strong reason why PCA should be applied to potentially non-stationary data. If dimension reduction is necessary, we find it statistically more sound to apply PCA to the standardized residuals after first accounting for any serial dependence in the marginal time series.

Let denote the sample covariance matrix of the standardized residuals , . The result from PCA is the matrix whose columns consist of the eigenvectors of , sorted according to decreasing eigenvalues . For the purposes of dimension reduction, , , are transformed to , where represent the first columns of for some . As a result, the sample covariance matrix of is (approximately) diagonal, and the components of are (approximately) uncorrelated. The th component series , , forms realizations of the th principal component, and the first principal component series account for of the total variance.

As dimension reduction is an optional component in our modeling framework, the next step involves dependence modeling of either the standardized residuals or the principal components . To unify the notation going forward, we define a -dimensional time series , where if dimension reduction is employed and (the identity matrix in ) otherwise; consequently, in the former case and in the latter. Furthermore, we treat as realizations from , where, naturally, with if dimension reduction is used and otherwise.

2.3 Dependence modeling

The final task in our framework involves the modeling of the iid series . To account for cross-sectional dependence, we model the joint distribution function of using Sklar’s theorem as

where , , are the margins of and is the copula of for each .

Following a classical copula modeling approach, one first builds the pseudo-observations , , , where denotes the rank of among . The pseudo-observations are viewed as realizations from based on which one would fit candidate copula models; see, for example, [mcneilfreyembrechts2015, Section 7.5.1] or [hofertkojadinovicmaechleryan2018, Section 4.1.2]. Note that by considering (non-parametric) pseudo-observations (even in the case when we do not apply a dimension reduction technique and thus know the (fitted) marginal innovation distributions), we reduce the risk of misspecifying one of the margins affecting the estimation of the copula ; see [genest2010] for a theoretical justification of this approach. Therefore, going forward, we will use the pseudo-observations , , to model the cross-sectional dependence structure of .

Dependence modeling with parametric copulas

A traditional approach for modeling the cross-sectional dependence described by involves the fitting of parametric copula models, their goodness-of-fit assessment and finally, model selection. There are numerous families of copula models to consider depending on prominent features of the dependence structure present in such as (a)symmetries or a concentration of points in the lower/upper tail of the joint distribution (or pairs of such) which hints at an adequate model possessing tail dependence.

A well-known problem with this approach is that it is often hard to find an adequate copula model for given real-life data, especially in higher dimensions where typically some pairwise dependencies contradict the corresponding model-implied marginal copulas; see, for example, [hofertoldford2018]. Another problem is that certain copula models are computationally expensive to fit and test for goodness-of-fit. In Section 3, we investigate whether (the much more flexible) GMMNs can outperform prominent elliptical and Archimedean copulas in the context of our framework. To this end, in what follows we shall denote by a (generic) parametric copula model fitted to the pseudo-observations .

Dependence modeling with GMMNs

We propose to utilize generative neural networks (in particular, GMMNs) for modeling the cross-sectional dependence structure of the pseudo-observations . In our framework, a generative neural network with parameters learns the distribution of the pseudo-observations. Let denote the empirical copula based on a sample generated from a trained GMMN , that is, a GMMN with fitted parameter vector .

GMMNs, also known as Maximum Mean Discrepancy (MMD) nets, were introduced simultaneously by [li2015] and [dziugaite2015]. A GMMN utilizes a kernel maximum mean discrepancy statistic as the loss function to learn the distribution of the pseudo-observations. Conceptually, can be thought of as a parametric map from a random vector with (known) prior distribution to . As is standard in the literature, we assume that are iid. Typical choices of are or ; we utilize the latter. Based on the fitted GMMN we can then generate samples with copula as an approximation to the target copula of . As demonstrated in [hofert2018], GMMNs provide a flexible class of models capable of learning a variety of complex dependence structures.

Next, we briefly discuss three key aspects when applying GMMNs: the architecture of the neural network , the loss function and the training procedure for the estimation of .

Feedforward neural networks

We now introduce the neural networks we work with in this paper. The feedforward neural network (also known as the multi-layer perceptron) is the quintessential deep neural network, which we simply refer to as neural network (NN) in what follows. Let be the number of hidden layers in the NN and, for each , let be the dimension of layer , that is the number of neurons in layer . Layer refers to the input layer which consists of the input for , and layer refers to the output layer which consists of the output for . Layers can be described in terms of the output of layer via

with weight matrices , bias vectors and activation functions ; the latter are understood to be applied componentwise for vector inputs. Some commonly used activation functions include the sigmoid activation function and the rectified linear unit (ReLU) activation function .

Input layer(
, )Hidden layer(, )Output layer(, )
Figure 1: Structure of a NN with input , hidden layer with output and output layer with output ; note that in the figure, denotes the th row of and the th element of .

The NN can then be written as the composition

with its flattened parameter vector . Figure 1 visualizes this construction and the notation we use. GMMNs are such type of NNs which, for training (i.e., the fitting of ), utilize a specific loss function introduced next.

Loss function

To learn , we work with training data points consisting of the pseudo-observations . Given an input sample from the prior distribution , the GMMN generates an output sample , where , . In selecting an appropriate loss function, we are naturally interested in measuring whether the two samples and can be deemed to come from the same distribution.

To do so, GMMNs use the maximum mean discrepancy (MMD) as loss function, which was introduced as a two-sample test statistic by [gretton2007]. For a given embedding function , the MMD measures the distance between two sample statistics, and , in the embedded space via

If we can choose to be a kind of “distributional embedding”, for example, in the sense that the two statistics— and —contain all empirical moments of and , respectively, then the MMD criterion will have achieved our desired purpose (of measuring whether the two samples have the same distribution). Amazingly, such embedding does exist.

By the so-called “kernel trick”, known as early as [mercer1909] but not widely until support vector machines became popular almost a century later, the inner product can be computed in a reproducing kernel Hilbert space by , where denotes a kernel similarity function. Hence, for a given kernel function , the statistic above is equivalent to


If is chosen to be a so-called universal kernel function, such as a Gaussian or Laplace kernel, then the associated implicit embedding is indeed a “distributional embedding” in the sense described above, and one can show that the converges in probability to for if and only if \parencitegretton2007,gretton2012.

As suggested by [li2015], we opt to work with a mixture of Gaussian kernels (rather than a single Gaussian kernel) with different bandwidth parameters,


where denotes the number of mixture components and is the Gaussian kernel with bandwidth parameter . Specific details on the choice of number of mixture components and bandwidth parameters , will be provided in Section 3.

Thus, to train the GMMN , we perform the optimization

where and the NN transform is understood to be applied row-wise.

Training GMMNs

We now discuss how we can train the GMMN , that is, how we can estimate the parameter vector . For the sake of convenience, we always simply set while training the GMMN. (However, after training we can still generate an arbitrary number of samples from .)

Directly optimizing the loss function in (1), also known as batch optimization, would involve all pairs of observations which is memory-prohibitive for even moderately large . Hence, we adopt a mini-batch optimization procedure, where we partition the training dataset into batches of size and use the batches sequentially to update . After all the training data are exhausted, i.e., roughly -many gradient steps, one epoch of the training of the GMMN is completed. Batch optimization results as a special case of this mini-batch optimization procedure when we set ; it can be used with relatively small data sets. To update the parameters , we utilize the Adam optimizer of [kingma2014b] which uses a “memory-sticking gradient” procedure—a weighted combination of the current gradient and past gradients from earlier iterations. The trade-off in utilizing mini-batches, particularly with a smaller batch size , is that it uses only a partial loss function when computing each gradient step in the optimization.

Algorithm 2.3 summarizes the training of the GMMN with a mini-batch optimization procedure.


[Training GMMNs]

  1. Fix the number of epochs and the sample size per batch (the so-called batch size) , where is assumed to divide . Initialize the epoch counter and the GMMN’s parameter vector ; we follow [glorot2010] and initialize the components of as and for .

  2. For epoch , do:

    1. Randomly partition the training sample and the prior distribution sample into corresponding non-overlapping batches and , , of size each.

    2. For batch , do:

      1. Compute the GMMN output , .

      2. Compute the gradient from the samples and via automatic differentiation.

      3. Take a gradient step to update to according to Adam; see [kingma2014b, Algorithm 1].

  3. Return ; the fitted GMMN is then .

2.4 Simulating paths of dependent multivariate time series

After utilizing our framework for modeling multivariate time series, a typical next step is to simulate paths from the fitted/trained multivariate model. With these simulated paths we immediately obtain empirical predictive distributions at future time points. Additionally, we can forecast quantities of interest such as (confidence) intervals or risk-measures (for example value-at-risk or expected shortfall) based on the simulated paths. Some of these quantities will be discussed further in Section 3. In this section, we focus on how to simulate the required paths in our framework.

To fix ideas, suppose we are interested in future time points, . Furthermore, let denote the simulation horizon. Then, for every , once all realizations up to and including time —namely, the entire sequence —become available, we can simulate multiple paths, , going forward for a total of time periods.

A key component for simulating these paths is the generation of samples from the estimated dependence model. For fitted parametric copulas , one typically uses a model-specific stochastic representation to sample ; see, for example, [hofertkojadinovicmaechleryan2018, Chapter 3]. Sampling from the fitted GMMN (with corresponding empirical copula ) can be done as follows. {algorithm}[GMMN sampling]

  1. Fix the number of samples to generate from .

  2. Draw from the prior distribution.

  3. Return , .

Since copulas have margins, we typically equip Algorithm 2.4 with a post-processing step by returning the pseudo-observations based on to remove any residual marginal non-uniformity from the GMMN samples.

For any given , we can now utilize Algorithm 2.4 along with the fitted marginal time series models in our framework in order to simulate multiple paths with a fixed simulation horizon , as outlined in Algorithm 2.4 below.


[Simulating paths of dependent multivariate time series via GMMNs]

  1. Fix the number of sample paths and the simulation horizon .

  2. For do:

    1. Generate , , , from the fitted GMMN via Algorithm 2.4.

    2. For every in Step 2a, construct . If no dimension reduction is utilized, the marginals , , are the fitted parametric innovation distributions selected as part of the ARMA–GARCH model setup; otherwise, they are the empirical distribution functions of , .

    3. For every in Step 2b, construct samples from the fitted innovation distributions via the transform . (Note that whereas .)

    4. For each , compute , and , for and , via

      where, for , set , , and for all .

    5. Return , , .

Note that, if one wishes, Step 2a in Algorithm 2.4 can be replaced by sampling from the fitted parametric copula to obtain the classically applied approach for sampling paths.

While Algorithm 2.4 describes how to simulate paths of multivariate time series for any simulation horizon , we will focus on one-period-ahead () empirical predictive distributions henceforth.

2.5 Assessing the quality of predictions of dependent multivariate time series models

In this section, we discuss the metrics we will use in all numerical investigations in this paper to assess and compare various MTS models. Of particular interest is the comparison of GMMN–GARCH and copula–GARCH models. In practice, to assess the out-of-sample performance of our models, realizations of time series will naturally be divided into separate training and test data sets. To that end, suppose that we have realizations , , that have been set aside (i.e., not used for training) as a separate test set; we will also refer to as the test period.

Assessing the quality of dependence models in the test period

We can use the statistic to measure how close the empirical distributions of a fitted GMMN and a fitted parametric copula match the cross-sectional dependence structure of the test set, . This cross-sectional dependence structure can be extracted using the fitted (marginal) ARMA–GARCH models and the fitted PCA models (if dimension reduction is applied), as described in Algorithm 2.5.1 below.


[Extracting underlying dependence structure of the test data set]

  1. Compute , and for and via

  2. Obtain a sample from the underlying empirical stationary distribution via the transform , . (Note that whereas .)

  3. Return the pseudo-observations of , for .

Let denote the pseudo-observations obtained from the test data set via Algorithm 2.5.1. Furthermore, let denote a sample generated from either or . We can then compute one realization of the MMD statistic as in (1). In our analysis in Section 3, we use an average statistic based on repeated samples , given by


To compute the AMMD metric above, we simply set . Furthermore, we use a mixture of Gaussian kernels with bandwidth parameters . These bandwidth parameters are purposefully chosen to be different from the bandwidth parameters used in Section 3 below for the GMMN training procedure, to allow for a fairer out-of-sample assessment.

Assessing the quality of an empirical predictive distribution

While there exist numerous metrics to assess univariate or multivariate point forecasts, there are only a handful of metrics that can be utilized to evaluate the quality of dependent multivariate empirical predictive distributions. We now present two such metrics we will use across all numerical examples.

Firstly, we use a version of the mean squared error (MSE) metric defined via the Euclidean norm to assess how well the empirical predictive distribution concentrates around each true value in the test set. To obtain a single numerical value, we work with an average MSE metric computed over the entire test period , defined by


Secondly, we use the variogram score introduced by [scheuerer2015], which, in our context, assesses if the empirical predictive distribution is biased for the distance between any two dimensions. For a single numeric summary, we work with an average variogram score (of order ) over the entire test period ,


As numerically demonstrated by [scheuerer2015], by focusing on pairwise distances between dimensions, this metric discriminates well between various dependence structures. [scheuerer2015] stated a typical choice of the variogram order might be , but they also noted in their concluding remarks that smaller values of could potentially yield more discriminative metrics when dealing with non-Gaussian data.

3 Applications

In this section, we demonstrate the added flexibility of our GMMN–GARCH models when compared to copula–GARCH models. To that end, we focus on modeling multivariate yield curve and exchange rate time series.

Before delving into the two financial econometric applications, we will first detail the selection and setup of component models within our framework that will be utilized for all examples in this section. Specifically, we will describe the choice of marginal time series models, the implementation details for GMMN models, and the choice of parametric copula models used for comparison.

3.1 Multivariate time series modeling: setup and implementation details

Marginal models

For modeling the marginal time series, we take the broad-brush approach and choose to fit ARMA(1,1)–GARCH(1,1) models with scaled innovation distributions for each component . As mentioned earlier, these ARMA(1,1)–GARCH(1,1) models are popular choices for modeling univariate financial time series. To fit the ARMA(1,1)–GARCH(1,1) models, we use the fit_ARMA_GARCH(,solver="hybrid") function from the R package qrmtools which relies on the ugarchfit() function from the R package rugarch (see [ghalanos2019]).

Dependence models: GMMN architecture and training setup

Taking into consideration that we are working with relatively small number of realizations of time series data in both applications, we find that a single hidden layer architecture () provides sufficient flexibility. Given the single hidden layer, we experiment with three NN architectures with (GMMN model 1),