Abstract
Generative moment matching networks (GMMNs) are introduced as dependence models for the joint innovation distribution of multivariate time series (MTS). Following the popular copula–GARCH approach for modeling dependent MTS data, a framework allowing us to take an alternative GMMN–GARCH approach is presented. First, ARMA–GARCH models are utilized to capture the serial dependence within each univariate marginal time series. Second, if the number of marginal time series is large, principal component analysis (PCA) is used as a dimensionreduction step. Last, the remaining crosssectional dependence is modeled via a GMMN, our main contribution. GMMNs are highly flexible and easy to simulate from, which is a major advantage over the copula–GARCH approach. Applications involving yield curve modeling and the analysis of foreign exchange rate returns are presented to demonstrate the utility of our approach, especially in terms of producing better empirical predictive distributions and making better probabilistic forecasts. All results are reproducible with the demo GMMN_MTS_paper of the R package gnn.
pageheadfoot\setkomafontpagenumber\automarksection\setkomafontcaptionlabel \deffootnote[1em]1em1em^{\thefootnotemark}\pdfstringdefDisableCommands \DeclareNameAliassortnamelastfirst\DefineBibliographyExtrasamerican\DeclareQuotePunctuation\renewbibmacro*volume+number+eid\setunit*\addcomma \printfieldvolume\printfieldnumber \DeclareFieldFormat*number(#1) \DeclareFieldFormat*title#1\DeclareFieldFormatdoi\starttheoremfalse\starttheoremfalse \preto\starttheoremfalse \preto
 \starttheoremfalse\starttheoremfalse\preto
Multivariate timeseries modeling with generative neural networks
Marius Hofert
20200226
Keywords Generative moment matching networks, learning distributions, copulas, probabilistic forecasts, ARMA–GARCH time series, yield curves, exchange rate dependence. \minisecMSC2010 62H99, 65C60, 60E05, 00A72, 65C10, 62M10.
1 Introduction
The task of modeling multivariate time series (MTS) data arises in a variety of applications in finance, economics and quantitative risk management. In many situations, a suitable model arises from breaking down this task into two key components: the modeling of serial dependence within each univariate time series and the modeling of crosssectional dependence between the individual time series. There is a plethora of literature on univariate time series modeling with a wide range of models that are tailormade for capturing various types of serial patterns such as seasonality, volatility clustering or regime switching. In the realm of financial econometrics, the class of generalized autoregressive conditional heteroscedasticity (GARCH) models (see [bollerslev1986]) is a popular choice. GARCHtype models are designed to account for stylized facts (such as volatility clustering) that are often present in financial return series data; see [mcneilfreyembrechts2015, Chapter 3].
There have been numerous approaches proposed for extending univariate time series modeling approaches to the multivariate case. Within the broad GARCH framework, [bollerslev1990] initially introduced a multivariate model characterized by the distributional assumption of multivariate normality with a constant conditional correlation structure. Dynamic conditional correlation (DCC)GARCH models were then introduced by [engle2002] and [tse2002]. DCCGARCH models relax the conditional correlation assumption but still utilize multivariate normal distributions to model the crosssectional dependence between the univariate time series. Leveraging Sklar’s theorem ([sklar1959]), [jondeaurockinger2006] and [patton2006] presented a flexible family of multivariate GARCH models where the assumption of multivariate normality has been relaxed to allow for any copula of the joint innovation distribution. This popular modeling approach for MTS data is known as the copula–GARCH approach; see [patton2012] for a brief overview in the context of finance and econometrics. It allows us to flexibly model joint innovation distributions with copulas, thereby decomposing the MTS modeling task into modeling of the marginal (univariate) time series and their crosssectional dependence. There have been various research papers investigating the calibration of copula–GARCH models, for example the more recent work of [oh2017], [almeida2016] or [aas2016].
While there is a growing collection of copula models used to characterize complex dependence structures, most models are rather limited already in moderately large dimensions and often do not provide an adequate fit to given data (see for example [hofertoldford2018]) or require sophisticated, modelspecific algorithms for parameter estimation and model selection. In this paper, we propose a framework for MTS modeling in which a classical copula model to account for crosssectional dependence is replaced by a type of generative neural network known as the generative moment matching network (GMMN). In comparison to classical copulas, GMMNs can capture a large variety of complex dependence structures. For highdimensional time series data, we incorporate principal component analysis (PCA) as an intermediate step to reduce the dimensionality. Our primary goal is to construct empirical predictive distributions, also known as probabilistic forecasts, rather than point forecasts. Additionally, these empirical predictive distributions can be utilized to further forecast various quantities of interest (e.g., quantiles) via simulation.
The paper is organized as follows. In Section 2, we outline our framework for modeling MTS data. In particular, we focus on the novel integration of GMMNs within this framework. In Section 3, we showcase our GMMNbased multivariate time series models in applications to yield curve and exchange rate data. Section 4 provides concluding remarks. All results in this paper can be reproduced with the demo GMMN_MTS_paper in the R package gnn.
2 Framework for multivariate time series modeling
Let denote a dimensional time series of interest, where each . Furthermore, consider a stretch of realizations from denoted by . In applications in finance (risk management), these are often logreturns (negative logreturns) of asset prices; see Section 3 for more details and the preprocessing steps applied to each empirical dataset we consider.
Our suggested framework for modeling consists of three primary components:

marginal time series modeling—while many possibilities can be considered, we focus on ARMA–GARCH models;

dimension reduction—again, many tools are available, but we simply utilize PCA; and

dependence modeling—here, the typical approach is to choose a parametric copula, but we introduce the use of GMMNs, the main contribution of this paper.
While Step 1 and Step 3 are essential, the dimension reduction component in Step 2 is optional and typically only used for highdimensional time series which are amenable to good approximations by lowerdimensional representations.
2.1 Marginal time series modeling
The ARMA–GARCH models in Step 1 are ARMA models with GARCH errors; see [mcneilfreyembrechts2015, Section 4.2.3]. An – model has the form
where, for each component , one has , , and for all . Some additional conditions on the coefficients—namely, the ’s, ’s, ’s and ’s—are necessary to ensure that all  and processes are respectively causal and covariance stationary \parencite[see, e.g.,][Section 4.1.2–4.2.2]mcneilfreyembrechts2015, but we won’t go into detail here.
For each , the innovations in the definition of the ARMA–GARCH model are independent and identically distributed (iid) random variables with and ; their realizations after fitting marginal – models are known as standardized residuals and denoted by , and . In financial time series applications, common choices of innovation distributions include the standard normal, the scaled and the skewed distribution.
Fitting the marginal time series models is typically done by fitting loworder models with likelihoodbased methods and selecting the most adequate fit using the AIC/BIC model selection criterion among the candidate models. A popular broadbrush approach is to fit a model for financial return series—specifically, an – model in our context—and continue the modeling based on the standardized residuals ; see [mcneilfreyembrechts2015, Chapter 4] or [hofertkojadinovicmaechleryan2018, Section 6.2.3]. This procedure is also referred to as deGARCHing. With the help of model diagnostic tools—for example, plots of the autocorrelation function (ACF) of and that of their squared values, Ljung–Box tests or assessment of the innovation distribution through QQ plots—one can then assess the adequacy of each marginal time series model. In what follows we use and to denote the estimated conditional mean and variance models for the th marginal time series with corresponding chosen orders and fitted parameters .
Having accounted for the marginal serial dependence in this way, the subsequent analysis in our modeling framework will operate on the standardized residuals , , which are themselves realizations of the innovation random variables, , assumed to be iid in the copula–GARCH approach.
Before we continue, we emphasize once again that any other adequate marginal time series modeling approach can be applied in our framework, as long as the model’s residuals can be considered to be iid from continuous marginal distributions. Our choice of ARMA–GARCH models is motivated only from the fact that these are the most popular marginal time series models used in practice.
2.2 Dimension reduction
Two popular dimensionreduction techniques for multivariate financial time series are factor modeling and PCA; see [mcneilfreyembrechts2015, Chapter 6] and the references therein for a brief summary. An approach that is perhaps less discussed in the financial econometrics literature involves using autoencoder neural networks for dimension reduction in which two separate neural network mappings are learned to and from the lower dimensional space; see [hinton2006]. As dimension reduction is not our main contribution in this work, we simply utilize PCA in what follows.
Note that PCA is often applied to the original MTS data in the literature; see, e.g., [alexander2000] for an investigation of the socalled orthogonal GARCH model. Apart from reducing the burden of marginal time series modeling, there is no strong reason why PCA should be applied to potentially nonstationary data. If dimension reduction is necessary, we find it statistically more sound to apply PCA to the standardized residuals after first accounting for any serial dependence in the marginal time series.
Let denote the sample covariance matrix of the standardized residuals , . The result from PCA is the matrix whose columns consist of the eigenvectors of , sorted according to decreasing eigenvalues . For the purposes of dimension reduction, , , are transformed to , where represent the first columns of for some . As a result, the sample covariance matrix of is (approximately) diagonal, and the components of are (approximately) uncorrelated. The th component series , , forms realizations of the th principal component, and the first principal component series account for of the total variance.
As dimension reduction is an optional component in our modeling framework, the next step involves dependence modeling of either the standardized residuals or the principal components . To unify the notation going forward, we define a dimensional time series , where if dimension reduction is employed and (the identity matrix in ) otherwise; consequently, in the former case and in the latter. Furthermore, we treat as realizations from , where, naturally, with if dimension reduction is used and otherwise.
2.3 Dependence modeling
The final task in our framework involves the modeling of the iid series . To account for crosssectional dependence, we model the joint distribution function of using Sklar’s theorem as
where , , are the margins of and is the copula of for each .
Following a classical copula modeling approach, one first builds the pseudoobservations , , , where denotes the rank of among . The pseudoobservations are viewed as realizations from based on which one would fit candidate copula models; see, for example, [mcneilfreyembrechts2015, Section 7.5.1] or [hofertkojadinovicmaechleryan2018, Section 4.1.2]. Note that by considering (nonparametric) pseudoobservations (even in the case when we do not apply a dimension reduction technique and thus know the (fitted) marginal innovation distributions), we reduce the risk of misspecifying one of the margins affecting the estimation of the copula ; see [genest2010] for a theoretical justification of this approach. Therefore, going forward, we will use the pseudoobservations , , to model the crosssectional dependence structure of .
Dependence modeling with parametric copulas
A traditional approach for modeling the crosssectional dependence described by involves the fitting of parametric copula models, their goodnessoffit assessment and finally, model selection. There are numerous families of copula models to consider depending on prominent features of the dependence structure present in such as (a)symmetries or a concentration of points in the lower/upper tail of the joint distribution (or pairs of such) which hints at an adequate model possessing tail dependence.
A wellknown problem with this approach is that it is often hard to find an adequate copula model for given reallife data, especially in higher dimensions where typically some pairwise dependencies contradict the corresponding modelimplied marginal copulas; see, for example, [hofertoldford2018]. Another problem is that certain copula models are computationally expensive to fit and test for goodnessoffit. In Section 3, we investigate whether (the much more flexible) GMMNs can outperform prominent elliptical and Archimedean copulas in the context of our framework. To this end, in what follows we shall denote by a (generic) parametric copula model fitted to the pseudoobservations .
Dependence modeling with GMMNs
We propose to utilize generative neural networks (in particular, GMMNs) for modeling the crosssectional dependence structure of the pseudoobservations . In our framework, a generative neural network with parameters learns the distribution of the pseudoobservations. Let denote the empirical copula based on a sample generated from a trained GMMN , that is, a GMMN with fitted parameter vector .
GMMNs, also known as Maximum Mean Discrepancy (MMD) nets, were introduced simultaneously by [li2015] and [dziugaite2015]. A GMMN utilizes a kernel maximum mean discrepancy statistic as the loss function to learn the distribution of the pseudoobservations. Conceptually, can be thought of as a parametric map from a random vector with (known) prior distribution to . As is standard in the literature, we assume that are iid. Typical choices of are or ; we utilize the latter. Based on the fitted GMMN we can then generate samples with copula as an approximation to the target copula of . As demonstrated in [hofert2018], GMMNs provide a flexible class of models capable of learning a variety of complex dependence structures.
Next, we briefly discuss three key aspects when applying GMMNs: the architecture of the neural network , the loss function and the training procedure for the estimation of .
Feedforward neural networks
We now introduce the neural networks we work with in this paper. The feedforward neural network (also known as the multilayer perceptron) is the quintessential deep neural network, which we simply refer to as neural network (NN) in what follows. Let be the number of hidden layers in the NN and, for each , let be the dimension of layer , that is the number of neurons in layer . Layer refers to the input layer which consists of the input for , and layer refers to the output layer which consists of the output for . Layers can be described in terms of the output of layer via
with weight matrices , bias vectors and activation functions ; the latter are understood to be applied componentwise for vector inputs. Some commonly used activation functions include the sigmoid activation function and the rectified linear unit (ReLU) activation function .
The NN can then be written as the composition
with its flattened parameter vector . Figure 1 visualizes this construction and the notation we use. GMMNs are such type of NNs which, for training (i.e., the fitting of ), utilize a specific loss function introduced next.
Loss function
To learn , we work with training data points consisting of the pseudoobservations . Given an input sample from the prior distribution , the GMMN generates an output sample , where , . In selecting an appropriate loss function, we are naturally interested in measuring whether the two samples and can be deemed to come from the same distribution.
To do so, GMMNs use the maximum mean discrepancy (MMD) as loss function, which was introduced as a twosample test statistic by [gretton2007]. For a given embedding function , the MMD measures the distance between two sample statistics, and , in the embedded space via
If we can choose to be a kind of “distributional embedding”, for example, in the sense that the two statistics— and —contain all empirical moments of and , respectively, then the MMD criterion will have achieved our desired purpose (of measuring whether the two samples have the same distribution). Amazingly, such embedding does exist.
By the socalled “kernel trick”, known as early as [mercer1909] but not widely until support vector machines became popular almost a century later, the inner product can be computed in a reproducing kernel Hilbert space by , where denotes a kernel similarity function. Hence, for a given kernel function , the statistic above is equivalent to
(1) 
If is chosen to be a socalled universal kernel function, such as a Gaussian or Laplace kernel, then the associated implicit embedding is indeed a “distributional embedding” in the sense described above, and one can show that the converges in probability to for if and only if \parencitegretton2007,gretton2012.
As suggested by [li2015], we opt to work with a mixture of Gaussian kernels (rather than a single Gaussian kernel) with different bandwidth parameters,
(2) 
where denotes the number of mixture components and is the Gaussian kernel with bandwidth parameter . Specific details on the choice of number of mixture components and bandwidth parameters , will be provided in Section 3.
Thus, to train the GMMN , we perform the optimization
where and the NN transform is understood to be applied rowwise.
Training GMMNs
We now discuss how we can train the GMMN , that is, how we can estimate the parameter vector . For the sake of convenience, we always simply set while training the GMMN. (However, after training we can still generate an arbitrary number of samples from .)
Directly optimizing the loss function in (1), also known as batch optimization, would involve all pairs of observations which is memoryprohibitive for even moderately large . Hence, we adopt a minibatch optimization procedure, where we partition the training dataset into batches of size and use the batches sequentially to update . After all the training data are exhausted, i.e., roughly many gradient steps, one epoch of the training of the GMMN is completed. Batch optimization results as a special case of this minibatch optimization procedure when we set ; it can be used with relatively small data sets. To update the parameters , we utilize the Adam optimizer of [kingma2014b] which uses a “memorysticking gradient” procedure—a weighted combination of the current gradient and past gradients from earlier iterations. The tradeoff in utilizing minibatches, particularly with a smaller batch size , is that it uses only a partial loss function when computing each gradient step in the optimization.
Algorithm 2.3 summarizes the training of the GMMN with a minibatch optimization procedure.
[Training GMMNs]

Fix the number of epochs and the sample size per batch (the socalled batch size) , where is assumed to divide . Initialize the epoch counter and the GMMN’s parameter vector ; we follow [glorot2010] and initialize the components of as and for .

For epoch , do:

Randomly partition the training sample and the prior distribution sample into corresponding nonoverlapping batches and , , of size each.

For batch , do:

Compute the GMMN output , .

Compute the gradient from the samples and via automatic differentiation.

Take a gradient step to update to according to Adam; see [kingma2014b, Algorithm 1].



Return ; the fitted GMMN is then .
2.4 Simulating paths of dependent multivariate time series
After utilizing our framework for modeling multivariate time series, a typical next step is to simulate paths from the fitted/trained multivariate model. With these simulated paths we immediately obtain empirical predictive distributions at future time points. Additionally, we can forecast quantities of interest such as (confidence) intervals or riskmeasures (for example valueatrisk or expected shortfall) based on the simulated paths. Some of these quantities will be discussed further in Section 3. In this section, we focus on how to simulate the required paths in our framework.
To fix ideas, suppose we are interested in future time points, . Furthermore, let denote the simulation horizon. Then, for every , once all realizations up to and including time —namely, the entire sequence —become available, we can simulate multiple paths, , going forward for a total of time periods.
A key component for simulating these paths is the generation of samples from the estimated dependence model. For fitted parametric copulas , one typically uses a modelspecific stochastic representation to sample ; see, for example, [hofertkojadinovicmaechleryan2018, Chapter 3]. Sampling from the fitted GMMN (with corresponding empirical copula ) can be done as follows. {algorithm}[GMMN sampling]

Fix the number of samples to generate from .

Draw from the prior distribution.

Return , .
Since copulas have margins, we typically equip Algorithm 2.4 with a postprocessing step by returning the pseudoobservations based on to remove any residual marginal nonuniformity from the GMMN samples.
For any given , we can now utilize Algorithm 2.4 along with the fitted marginal time series models in our framework in order to simulate multiple paths with a fixed simulation horizon , as outlined in Algorithm 2.4 below.
[Simulating paths of dependent multivariate time series via GMMNs]

Fix the number of sample paths and the simulation horizon .

For do:

Generate , , , from the fitted GMMN via Algorithm 2.4.

For every in Step 2a, construct . If no dimension reduction is utilized, the marginals , , are the fitted parametric innovation distributions selected as part of the ARMA–GARCH model setup; otherwise, they are the empirical distribution functions of , .

For every in Step 2b, construct samples from the fitted innovation distributions via the transform . (Note that whereas .)

For each , compute , and , for and , via
where, for , set , , and for all .

Return , , .

Note that, if one wishes, Step 2a in Algorithm 2.4 can be replaced by sampling from the fitted parametric copula to obtain the classically applied approach for sampling paths.
While Algorithm 2.4 describes how to simulate paths of multivariate time series for any simulation horizon , we will focus on oneperiodahead () empirical predictive distributions henceforth.
2.5 Assessing the quality of predictions of dependent multivariate time series models
In this section, we discuss the metrics we will use in all numerical investigations in this paper to assess and compare various MTS models. Of particular interest is the comparison of GMMN–GARCH and copula–GARCH models. In practice, to assess the outofsample performance of our models, realizations of time series will naturally be divided into separate training and test data sets. To that end, suppose that we have realizations , , that have been set aside (i.e., not used for training) as a separate test set; we will also refer to as the test period.
Assessing the quality of dependence models in the test period
We can use the statistic to measure how close the empirical distributions of a fitted GMMN and a fitted parametric copula match the crosssectional dependence structure of the test set, . This crosssectional dependence structure can be extracted using the fitted (marginal) ARMA–GARCH models and the fitted PCA models (if dimension reduction is applied), as described in Algorithm 2.5.1 below.
[Extracting underlying dependence structure of the test data set]

Compute , and for and via

Obtain a sample from the underlying empirical stationary distribution via the transform , . (Note that whereas .)

Return the pseudoobservations of , for .
Let denote the pseudoobservations obtained from the test data set via Algorithm 2.5.1. Furthermore, let denote a sample generated from either or . We can then compute one realization of the MMD statistic as in (1). In our analysis in Section 3, we use an average statistic based on repeated samples , given by
(3) 
To compute the AMMD metric above, we simply set . Furthermore, we use a mixture of Gaussian kernels with bandwidth parameters . These bandwidth parameters are purposefully chosen to be different from the bandwidth parameters used in Section 3 below for the GMMN training procedure, to allow for a fairer outofsample assessment.
Assessing the quality of an empirical predictive distribution
While there exist numerous metrics to assess univariate or multivariate point forecasts, there are only a handful of metrics that can be utilized to evaluate the quality of dependent multivariate empirical predictive distributions. We now present two such metrics we will use across all numerical examples.
Firstly, we use a version of the mean squared error (MSE) metric defined via the Euclidean norm to assess how well the empirical predictive distribution concentrates around each true value in the test set. To obtain a single numerical value, we work with an average MSE metric computed over the entire test period , defined by
(4) 
Secondly, we use the variogram score introduced by [scheuerer2015], which, in our context, assesses if the empirical predictive distribution is biased for the distance between any two dimensions. For a single numeric summary, we work with an average variogram score (of order ) over the entire test period ,
(5) 
As numerically demonstrated by [scheuerer2015], by focusing on pairwise distances between dimensions, this metric discriminates well between various dependence structures. [scheuerer2015] stated a typical choice of the variogram order might be , but they also noted in their concluding remarks that smaller values of could potentially yield more discriminative metrics when dealing with nonGaussian data.
3 Applications
In this section, we demonstrate the added flexibility of our GMMN–GARCH models when compared to copula–GARCH models. To that end, we focus on modeling multivariate yield curve and exchange rate time series.
Before delving into the two financial econometric applications, we will first detail the selection and setup of component models within our framework that will be utilized for all examples in this section. Specifically, we will describe the choice of marginal time series models, the implementation details for GMMN models, and the choice of parametric copula models used for comparison.
3.1 Multivariate time series modeling: setup and implementation details
Marginal models
For modeling the marginal time series, we take the broadbrush approach and choose to fit ARMA(1,1)–GARCH(1,1) models with scaled innovation distributions for each component . As mentioned earlier, these ARMA(1,1)–GARCH(1,1) models are popular choices for modeling univariate financial time series. To fit the ARMA(1,1)–GARCH(1,1) models, we use the fit_ARMA_GARCH(,solver="hybrid") function from the R package qrmtools which relies on the ugarchfit() function from the R package rugarch (see [ghalanos2019]).
Dependence models: GMMN architecture and training setup
Taking into consideration that we are working with relatively small number of realizations of time series data in both applications, we find that a single hidden layer architecture () provides sufficient flexibility. Given the single hidden layer, we experiment with three NN architectures with (GMMN model 1),