Efficient goodness-of-fit tests in multi-dimensional vine copula models
We introduce a new goodness-of-fit test for regular vine (R-vine) copula models, a flexible class of multivariate copulas based on a pair-copula construction (PCC). The test arises from the information matrix ratio. The corresponding test statistic is derived and its asymptotic normality is proven. The test’s power is investigated and compared to 14 other goodness-of-fit tests, adapted from the bivariate copula case, in a high dimensional setting. The extensive simulation study shows the excellent performance with respect to size and power as well as the superiority of the information matrix ratio based test against most other goodness-of-fit tests. The best performing tests are applied to a portfolio of stock indices and their related volatility indices validating different R-vine specifications.
keywords:copula, goodness-of-fit tests, information matrix ratio test, power comparison, R-vine
Analyzing complex correlated data has received considerable attention in the current statistical literature. Among many approaches to modeling correlation structures, copula based models offer a powerful and flexible toolbox to characterize dependence profiles among variables, which have been studied extensively. However, it is unfortunate that there is little progress known in the theory and method concerning a goodness-of-fit (GOF) test, an important aspect of statistical model diagnostics. In fact, most of the published work has been only focused on bivariate copula models (see for example Genest et al., 2009).
Copulas join marginal distributions of a (continuous) random vector with their dependency structure by a joint cumulative distribution function (cdf) . Here is the unique cdf with uniform margins on the unit hypercube (Sklar, 1959). Classical copula classes such as the elliptical or Archimedean copulas are very limited with respect to flexibility in higher dimensions. But they are very powerful and well understood in the bivariate case. Thus Joe (1996) and later Bedford and Cooke (2001, 2002) independently constructed multivariate densities using bivariate copulas. They permit flexibility and feasibility of constructing and computing a relatively large dimensional copula model. In Aas et al. (2009) this process is termed a pair-copula construction (PCC) and the statistical inference is developed for it. Since then the theory of vine copulas arising from the PCC were studied in literature (see for example Czado, 2010; Stöber and Schepsmeier, 2013; Czado et al., 2012; Dißmann et al., 2013).
Along with the break through of vine copula constructions model diagnosis becomes ever so imperative in the application of multi-dimensional vine copulas. Developing efficient GOF tests is now a timely task as already noted in Fermanian (2012), and an important addition to the current literature of vine copulas. In addition, comprehensive comparisons for many of the classical GOF tests are lacking in terms of their relative merits when they are applied to multi-dimensional copulas. So far model verification methods for vine copulas are usually based on the likelihood, or on the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) as classical comparison measures, which take the model complexity into account.
In our goodness-of-fit (GOF) tests we would like to test
where denotes the (vine) copula distribution function and is a class of parametric (vine) copulas with being the parameter space of dimension .
For the elliptical and parametric Archimedean copulas many GOF tests were studied in the literature (Genest et al., 2006, 2009; Berg, 2009; Huang and Prokhorov, 2011). However, a GOF test for vine copula models verifying the chosen pair-copula families has, to our knowledge, only be treated in Schepsmeier (2013). Although, already Aas et al. (2009) suggested a GOF test for vine copulas based on the multivariate probability integral transformation (PIT) of Rosenblatt (1952) given in the appendix, but never investigated its small sample performance. We will show that this test and many other copula GOF tests have little to no power in the high dimensional setting of a vine and thus are not appropriate to be utilized there.
The main contribution of this paper is a new GOF test to perform model verification of vine copula models using hypothesis tests. As in Schepsmeier (2013) it is based on the Bartlett identity () as generally suggested by White (1982). Here is the expected Hessian or variability matrix, and is the expected outer product of the gradient or sensitivity matrix. In contrast to the White test, which relies on the difference between and , our new test is based on the information matrix ratio (IMR), (Zhou et al., 2012).
First, the IMR based test statistic for vine models will be derived and its asymptotic normality under the Bartlett identity will be proven. Secondly, the small sample performance for size and power will be investigated and compared to 14 other GOF tests for vines in a high dimensional setting ( and ). In particular, we will compare to GOF tests based on the
difference of Bartlett identity or
empirical copula process , with ,
and being the copula with estimated parameter(s) , and/or
For the tests based on the multivariate PIT aggregation to univariate test data is facilitated using different aggregation functions. For the univariate test data then standard univariate GOF test statistics such as Anderson-Darling (AD), Cramér-von Mises (CvM) and Kolmogorov-Smirnov (KS) are used. In contrast, the empirical copula process (ECP) based test use the multivariate Cramér-von Mises (mCvM) and multivariate Kolmogorov-Smirnov (mKS) test statistics. The different GOF tests are given in the appendix for the convenience of the reader.
The power study will expose that the information based GOF tests such as the information matrix difference approach of Schepsmeier (2013) and in particular our new IMR based test outperform the other GOF tests in terms of size and power. The PIT based GOF tests reveal little to no power against the considered alternatives. But applying the PIT transformed data to the empirical copula process, as first suggested by Genest et al. (2009), is more promising. Here is replaced by the independence copula in the ECP.
The remainder of this paper is structured as follows: Section 2 gives an introduction on vine copula models. The new proposed IR test is introduced and its test statistics derived in Section 3. Additionally the asymptotic normality of the test statistic is proven. Further GOF tests extended from known copula GOF tests are given in Section 4 for the extensive power comparison study in Section 5 investigating their size and power. An application of an 8-dimensional portfolio of stock indices and their related volatility indices is performed in Section 6 comparing different vine specifications and proposed GOF tests. The final Section 7 summarizes and shows areas of further research.
2 Regular vine copula model
Pair-copula constructions (PCC) are a very flexible way to model multivariate distributions with copulas. The model is based on the decomposition of the -dimensional density into (conditional) bivariate copula densities. Bedford and Cooke (2001, 2002) introduced linked trees , where denotes the set of nodes while represents the set of edges, which helps to organize the vine construction. The following conditions have to be fulfilled to call a sequence of trees a vine:
is a tree with nodes and edges .
For , is a tree with nodes and edges .
If two nodes in tree are joined by an edge, the corresponding edges in must share a common node (proximity condition).
An example of a vine tree sequence is given in Figure 1. Here and forming the unconditional pair-copulas. The conditional pair-copulas are the edges of tree 2-4. The complete construction of the joint density is given in Example 1.
Following the notation of Czado (2010) we define a set of bivariate copula densities corresponding to edges in , for . Here denotes the subvector of determined by the set of indices in . The set is called the conditioning set while the indices and form the conditioned set. Then a -dimensional regular vine copula density can be constructed as
For the copula arguments, the conditional cdfs
, Joe (1996) developed a formula derived from the first derivative of the corresponding cdf with respect to the second copula argument, i.e.
Here is an index chosen from the conditioning set, such that is in . In the literature Equation (4) is often called a h-function. It is a recursive function which simplifies the calculation of the density or log-likelihood considerably. See for example Dißmann et al. (2013) for a algorithmic presentation of the log-likelihood of an R-vine. Denoting the pair-copula parameters of with a vine copula model with density given in (3) is abbreviated as . We assume that the copula does not depend on the values , i.e. on the conditioning set without the chosen variable . This is called the simplifying assumption.
Example 1 (5-dim pair-copula construction)
The corresponding copula density to the vine tree sequence given in Figure 1 can be expressed as
There are two special cases of an R-vine tree structure . A line like structure of the trees is called D-vine in which each node has a maximum degree of 2, while a star structure is a canonical vine (C-vine) with a root node of degree . All other nodes have degree 1. Statistical inference methods of D-vines are discussed in Aas et al. (2009). A model selection algorithm as well as the maximum likelihood parameter estimation for C-vines is developed in Czado et al. (2012).
3 Information matrix ratio test
A new approach for a GOF test for vine copulas is the information ratio (IR) test. It is inspired by the paper of Zhou et al. (2012), who propose an IR test for general model misspecification of the variance or covariance structures. Their test is related to the “in-and-out-sample” (IOS) test of Presnell and Boos (2004), which is a likelihood ratio test. Additionally Presnell and Boos (2004) showed that the IOS test statistic can be expressed as a ratio of the expected Hessian and the expected outer product of the gradient. In particular, let be a random vector with copula distribution function . Further let
the expected Hessian matrix of the random (vine) copula log-likelihood function and the expected outer product of the corresponding score function, respectively. Here denotes the derivative with respect to the copula parameter . Now the information matrix ratio (IMR) is defined as
Our test problem is the reformulated general test problem of White (1982):
where is the -dimensional identity matrix. To calculate the corresponding test statistic we follow Schepsmeier (2013) and define the random matrices
using the log-likelihood function of the model with specified vine tree sequence and pair-copulas but unknown parameter . Given an i.i.d. sample from for and the corresponding maximum likelihood estimate based on the sample counter parts are
respectively. The sample equivalents to and are then
where denotes the trace of matrix . To derive the asymptotic normality of the test statistic some conditions have to be set. The first two conditions and guarantee the existence of the gradient and the Hessian matrix.
The density function (3) is twice continuous differentiable with respect to .
- and are positive definite.
Condition are more technical and are the same as in Presnell and Boos (2004).
There exist such that as .
The estimator has an approximating influence curve function such that
where as , , and is finite.
The real-valued function possesses second order partial derivatives with respect to , and
and are finite.
There exists a function such that for all in a neighborhood of and all , where .
In the following represents the vectorization of the symmetric matrix . Let , then Presnell and Boos (2004) showed that
where is the mean vector and is the asymptotic covariance matrix of . Here is the -dimensional zero vector and is the -dimensional identity matrix. Furthermore, let define the partial derivatives of taken with respect to the components of , i.e.
Let satisfy the conditions . Further, let hold for the maximum likelihood estimator with . Additionally, the condition has to be satisfied for both and for each . Then the IR test statistic
where is the standard error of the IR test statistic, defined as
Here is the asymptotic covariance matrix arising from the joint asymptotic normality of and defined above. By we denote the -dimensional vector of partial derivatives of taken with respect to the components of and evaluated at their limits in probability, i.e. .
Since the theoretical asymptotic variance is quite difficult to compute, an empirical version is used in practice. To evaluate the standard error numerically, Zhou et al. (2012) suggest a perturbation resampling approach. Furthermore, Presnell and Boos (2004) state that the convergence to normality is slow and thus they suggest obtaining p-values using a parametric bootstrap under the null hypothesis.
The condition for implies, that the copula density function (3) is four times differentiable with respect to . Furthermore, the first and second moment of the second derivative has to be finite. The vine copula density is four times differentiable if all selected pair-copulas are four times differentiable. These assumptions are satisfied for the elliptical Gauss and Student’s t-copula as well as for the parametric Archimedean copulas in all dimensions.
Let and as in Theorem 1. Then the test
is an asymptotic -level test. Here denotes the quantile of a -distribution.
4 Further goodness-of-fit tests for vine copulas
In the recent years many GOF test were suggested for copulas. The most promising ones were investigated in Genest et al. (2009) and Berg (2009). However only the size and power of the elliptical and one-parametric Archimedean copulas for were analyzed. The multivariate case is therefore poorly addressed. For vine copulas little is done. A first test for vine copulas was suggested but not investigated in Aas et al. (2009). Their GOF is based on the multivariate PIT and an aggregation introduced by Breymann et al. (2003). After aggregation standard univariate GOF tests such as the Anderson-Darling (AD), the Cramér-von Mises (CvM) or the Kolmogorov-Smirnov (KS) tests are applied. They are described in more detail in B. We will denote the resulting tests as Breymann.
Similar approaches based on the multivariate PIT are proposed by Berg and Bakken (2007). Beside new aggregation functions forming univariate test data, they perform the aggregation step on the ordered PIT output data instead of . Again standard univariate GOF tests are applied. These approaches will be called Berg and Berg2, respectively.
Berg and Aas (2009) applied a test for against based on the empirical copula process (ECP) to a 4-dimensional vine copula. As the Breymann test, their GOF test is not described in detail or investigated with respect to its power. We will denote this test as ECP. An extension of the ECP-test is the combination of the multivariate PIT approach with the ECP. The general idea is that the transformed data of a multivariate PIT should be “close” to the independence copula Genest et al. (2009). Thus a distance of CvM or KS type between them is considered. This approach is called ECP2.
Schepsmeier (2013) was the first who analyzed the power of a GOF test for vine copulas in detail. His approach is, as our new IR GOF test, based on the information matrix equality and specification test introduced by White (1982). His power studies show, that the convergence to the asymptotic distribution function of the test statistic is very slow. Further, given copula data with sample size smaller than 10000 the test does not reach its nominal level based on asymptotic p-values. But using bootstrapped p-values the test shows very good power behavior. We denote this approach as White.
In the forthcoming sections we will introduce the vine copula test of Schepsmeier (2013), the multivariate PIT based GOF such as the ones of Breymann et al. (2003) and Berg and Bakken (2007), and the two ECP based GOF tests. A first overview of the considered GOF tests is given in Figure 2.
4.1 White’s information matrix test
The GOF test of Schepsmeier (2013) uses White’s information matrix equality and specification test. It is a rank-based test which is asymptotically pivotal, i.e. the asymptotic distribution is independent of model parameters.
Let be a random vector with vine copula log-likelihood . Further let and be defined as in Equation (6) the expected Hessian matrix and the expected outer product of the score function, respectively. Considering the Bartlett identity we can formulate the vine copula misspecification test problem as
Here, denotes the true value of the vine copula parameter vector. Following the notation of Schepsmeier (2013) we denote by the vectorized sum of and defined in (8). Its empirical version is denoted by , where and are defined in (9). Further, we define the expected gradient matrix of the random vector as
Now, under suitable regularity conditions (A1-A10 in White, 1982), assuring that is a continuous measurable function and its derivatives exist, the following is shown. Given a copula model (Huang and Prokhorov, 2011) or a vine copula model (Schepsmeier, 2013), the asymptotic covariance matrix of is given by
Here, is again the maximum likelihood estimate of given i.i.d. samples. For details on the estimation of and we refer to Schepsmeier (2013).
Thus, the test statistic of the White test is
where is the estimated asymptotic variance matrix given observation.
The test statistic follows asymptoticly a distributed random variable with degrees-of-freedom , where is the number of vine copula parameters. But Schepsmeier (2013) showed that even for small dimensions the asymptotic theory does not hold for relative large data sets, for example in 5 dimensions. However for bootstrapped p-values the power is quite satisfying, i.e. the tests has power against false vine copula model specifications. Denoting the quantile of the distribution by the White test rejects the null hypothesis (11) if . In the bootstrapped case the distribution is replaced by the empirical distribution function of the bootstrapped test statistics.
4.2 Rosenblatt’s transform tests
The vine copula GOF test suggested by Aas et al. (2009) is based on the multivariate probability integral transform (PIT) of Rosenblatt (1952) applied to copula data and a given estimated vine copula model . The general multivariate PIT definition and the explicit algorithm for the R-vine copula model is given in A. The PIT output data is assumed to be i.i.d. with for . Now, a common approach in multivariate GOF testing is dimension reduction. Here the aggregation is performed by
with a weighting function .
Breymann et al. (2003) suggest as weight function the squared quantile of the standard normal distribution, i.e. , with denoting the cdf. Finally, they apply a univariate Anderson-Darling test to the univariate test data . The three step procedure is summarized in Figure 3.
Berg and Bakken (2007) point out that the approach of Breymann et al. (2003) has some weaknesses and limitations. The weighting function strongly weights data along the boundaries of the -dimensional unit hypercube. They suggest a generalization and extension of the PIT approach. First, they propose two new weighting functions for the aggregation in (13):
Further, they use the order statistics of the random vector , denoted by with observed values . The calculation of the order statistics PIT can be simplified by using the fact that are i.i.d. random variables and is a Markov chain (David, 1981, Theorem 2.7). Now Theorem 1 of Deheuvels (1984) can be applied and the calculation of the PIT ease to
Now, Berg and Bakken (2007) construct the aggregation as the sum of a product of two weighting functions applied to and , respectively, i.e.
Here and are chosen from the suggested weighting functions including the one of Breymann et al. (2003). Let be the corresponding random aggregation of . If and or vise versa, the asymptotic distribution of follows a distributed random variable (Breymann et al., 2003). In all other cases the asymptotic distribution of is unknown.
The combinations with and for performed very poorly in the simulation setup considered later. Thus we will not include them in the forthcoming power study. Only the weighting functions listed in Table 1 will be investigated. As final test statistics to the test data we apply the univariate Cramér-von Mises (CvM) or Kolmogorov-Smirnov (KS) test, as well as the mentioned univariate Anderson-Darling (AD) test. All three test statistics are given in B for the convenience of the reader.
Let and denote the quantile of the univariate AD, CvM or KS test statistic, respectively. Then the test rejects the null hypothesis (1) if or , respectively.
4.3 Empirical copula process tests
A rather different approach is suggested by Genest and Rémillard (2008) for copula GOF testing. They propose to use the difference of the copula distribution function with estimated parameter and the empirical copula (see Equation (2)) given the copula data . This stochastic process is known as the empirical copula process (ECP) and will be used to test (1). For a vine copula model the copula distribution function is not given in closed form. Thus a bootstrapped version has to be used.
Now, the ECP is utilized in a multivariate Cramér-von Mises (mCvM) or multivariate Kolmogorov-Smirnov (mKS) based test statistic. The multivariate distribution functions and in Equation (18) and (19) of B.1 are replaced by their (vine) copula equivalents and , respectively. Thus we consider
To avoid the calculation/approximation of Genest et al. (2009) and other authors propose to use the transformed data of the PIT approach and plug them into the ECP. The idea is to calculate the distance between the empirical copula of the transformed data and the independence copula . Thus, the considered multivariate CvM and KS test statistics are
respectively. Since neither the mCvM nor the mKS test statistic has a known asymptotic distribution function a parametric bootstrap procedure has to be applied to estimate p-values. Thus a computer intensive double bootstrap procedure has to be implemented. As before the test rejects the null hypothesis (1) if or , respectively. Here and are the quantiles of the mCvM and mKS test statistic’s empirical distribution function, respectively. Similar rejection regions are defined for the ECP2 test statistics.
5 Power study
To investigate the power behavior of the proposed GOF tests and to compare them to each other we conduct several Monte Carlo studies of different dimension. The second property of interest is the ability of the test to maintain the nominal level or size, usually chosen at 5%.
If a test has the probability of rejection less than or equal to a small number , called the level of significance, for the hypothesis , then such a test is called a -level test. We speak of rejecting at level . Common values for are 0.05 and 0.01. Since a test of level is also a test of level , the smallest such is called size of the test and is the maximum probability of type I error (Bickel and Doksum, 2007, p.217). The power of a test against the alternative is the probability of rejecting when is true. It is often denoted as .
Given an observed test statistic of the corresponding p-value is defined as
For a given model consider the random statistic based on an i.i.d. sample of size from model with observed value . Define the random variable which takes on values in . Let denote the distribution function of , then is the actual size of the test at level (nominal size). A test maintains its nominal level if . As estimates of the p-value and the distribution function we use their empirical versions. Therefore generate bootstrap realizations of the test statistic , denoted as , when observations are drawn from model .Then the estimate of is given as
Further, the estimated size at level is defined as Generating i.i.d. data sets of an alternative model in to estimate by we get the power of the test when the alternative holds.
5.1 General simulation setup
For the general simulation setup we follow the procedure of Schepsmeier (2013). Given a vine copula model we test for each proposed GOF test if it has suitable power against an alternative vine copula model , where , as follows: