Bayesian estimation of a bivariate copula using the Jeffreys prior

Bayesian estimation of a bivariate copula using the Jeffreys prior


A bivariate distribution with continuous margins can be uniquely decomposed via a copula and its marginal distributions. We consider the problem of estimating the copula function and adopt a Bayesian approach. On the space of copula functions, we construct a finite-dimensional approximation subspace that is parametrized by a doubly stochastic matrix. A major problem here is the selection of a prior distribution on the space of doubly stochastic matrices also known as the Birkhoff polytope. The main contributions of this paper are the derivation of a simple formula for the Jeffreys prior and showing that it is proper. It is known in the literature that for a complex problem like the one treated here, the above results are difficult to obtain. The Bayes estimator resulting from the Jeffreys prior is then evaluated numerically via Markov chain Monte Carlo methodology. A rather extensive simulation experiment is carried out. In many cases, the results favour the Bayes estimator over frequentist estimators such as the standard kernel estimator and Deheuvels’ estimator in terms of mean integrated squared error.


0 \volume18 \issue2 2012 \firstpage496 \lastpage519 \doi10.3150/10-BEJ345


Bayesian estimation of a bivariate copula


1]\fnmsSimon \snmGuillotte\thanksref1label=e1] and 2]\fnmsFrançois \snmPerron\corref\thanksref2label=e2]

Birkhoff polytope \kwdcopula \kwddoubly stochastic matrices \kwdfinite mixtures \kwdJeffreys prior \kwdMarkov chain Monte Carlo \kwdMetropolis-within-Gibbs sampling \kwdnonparametric \kwdobjective Bayes

1 Introduction

Copulas have received considerable attention recently because of their increasing use in multiple fields such as environmental studies, genetics and data networks. They are also currently very popular in quantitative finance and insurance; see Genest et al. Genest3 (). Since it is precisely the copula that describes the dependence structure among various random quantities, estimating a copula is part of many techniques employed in these fields. For instance, in risk measurement, the value at risk (VaR) is computed by simulating asset log returns from a fitted joint distribution, for which the dependence structure between the assets is modelled by a copula. Further financial examples in which copulas are estimated are provided in Embrechts et al. Embrechts03 () and the books written by Cherubini et al. Cherubini1 (), McNeil et al. McNeil05 () and Trivedi and Zimmer Trivedi1 (). In this paper, we provide new generic methodology for estimating copulas within a Bayesian framework.

Let us first recall that a bivariate copula is a cumulative distribution function on with uniform margins. In this paper, we denote the space of all copulas by . Every is Lipschitz continuous, with a common Lipschitz constant equal to one:


The space is bounded above and below by the so-called Fréchet–Hoeffding copulas, that is, for every ,

Sklar’s theorem states that a bivariate cumulative distribution function is completely characterized by its marginal cumulative distribution functions and its copula . More precisely, we have the representation


where is well defined on ; see Nelsen Nelsen1 (). In particular, the copula is unique if and are continuous and, in this case, we have the following expression for the copula:


Let be a sample where every is a realization of the random couple , , with joint cumulative distribution function , and continuous marginal cumulative distribution functions and . We consider the problem of estimating the unknown copula of by a copula , where depends on the sample. In this problem, the individual marginal distributions are treated as nuisance parameters. The literature presents three generic approaches for estimating , namely the fully parametric, the semi-parametric and the nonparametric approaches. Below, we briefly describe each approach and focus on two nonparametric estimators, since we will subsequently compare our estimator with these.

The fully parametric approach. In this framework, parametric models are assumed for both the marginal distribution functions and and for the copula . See Joe Joe1 (); Cherubini et al. Cherubini1 (), Joe Joe2 (), and, in a Bayesian setup, Silva and Lopes Silva1 ().

The semi-parametric approach. Here, a parametric model is assumed only for the copula function , not for the margins. In this setup, Genest et al. Genest2 () have proposed the use of rescaled empirical distribution functions, such as the estimates and and a pseudo-likelihood estimator for . The authors show that the resulting estimator is consistent and asymptotically normal. In Kim et al. Kim1 (), comparisons are made between the fully parametric approach and the semi-parametric approach proposed by Genest et al. Genest2 (). More recently, in a Bayesian setup, Hoff Hoff1 () proposes a general estimation procedure, via a likelihood based on ranks, that does not depend on any parameters describing the marginal distributions. The latter methodology can accommodate both continuous and discrete data.

The nonparametric approach. This approach exploits equation (3). Here, we describe Deheuvels’ estimator and the kernel estimator. Let be the empirical cumulative distribution function, and let and be the generalized inverses of its marginal cumulative distribution functions. Any copula is said to satisfy the Deheuvels constraint associated with provided that for all ,

In Deheuvels Deheuvels1 (), the asymptotic behaviour of the class of copulas satisfying the Deheuvels constraint associated with is described. Note that, in the literature, the so-called empirical copula , for all is a function that satisfies the Deheuvels constraint and is often used as an estimator for even though it is not a genuine copula. In Lemma 3, we propose an estimator that satisfies Deheuvels’ constraint and which, unlike , is itself a copula so that . This estimator, which we call Deheuvels’ estimator henceforth, is based on ranks. One nice property of rank-based estimators is their invariance under strictly increasing transformations of the margins. Therefore, if and are two strictly increasing functions, then the Deheuvels estimator based on the original sample and the one based on the sample are identical. This is a desirable property for a copula estimator since it is inherent to copulas themselves.

Moreover, if is a smooth kernel estimator of ( and are continuous say), then


is called a kernel estimator for , and we have . Asymptotic properties of such estimators are discussed in Fermanian and Scaillet Fermanian1 (), and the reader is referred to Charpentier et al. Charpentier1 () for a recent review. In particular, the so-called Gaussian kernel estimator is given by (4) using

where denotes the standard univariate Gaussian cumulative distribution function and is the value of the bandwidth.

Both of the nonparametric estimators discussed above have good asymptotic properties. On the other hand, they may not be optimal for small sample sizes. This could be an inconvenience when working with small samples, and we think practitioners should be aware of this. We illustrate some of these situations by a simulation study in Section 5.

Our aim is to develop a Bayesian alternative for the estimation of that circumvents this problem. Following Genest et al. Genest2 (), when the marginal distributions are unknown, we use rescaled empirical distribution functions as their estimates. In view of this, our methodology can be called empirical Bayes. When the marginal distributions are known, the sample is replaced by , which is a sample from the uniform distribution . In this case, our procedure is purely Bayesian. In both cases, our estimator has the property of being invariant under monotone transformations of the margins, just like Deheuvels’ estimator.

Our model is obtained as follows. First, in Section 2 we construct an approximation subspace . This is achieved by considering the sup-norm and setting a precision so that for every copula there exists a copula such that . Moreover, is finite dimensional; it is parametrized by a doubly stochastic matrix . Then, our estimator is obtained by concentrating a prior on and computing the posterior mean, that is, the Bayes estimator under squared error loss. Now two problems arise, the first one is the prior selection on and the second one concerns the numerical evaluation of the Bayes estimator. These are the topics of Sections 3 and 4, respectively. While the problem of evaluating the Bayes estimator is solved using a Metropolis-within-Gibbs algorithm, the choice of the prior distribution is a much more delicate problem. A copula from our model can be written as a finite mixture of cumulative distribution functions. The mixing weights form a matrix that is proportional to a doubly stochastic matrix. Therefore, specifying a prior on boils down to specifying a prior for the mixing weights. We assume that we do not have any information that we could use for the construction of a subjective prior. It is not our intention to obtain a Bayes estimator better than some other given estimator. For these reasons we shall rely on an objective prior, and a natural candidate is the Jeffreys prior. The main contributions of our paper are the derivation of a simple expression for the Jeffreys prior and showing that it is proper. The fact that these results are generally difficult to come up with, for finite mixture problems, has been raised before in the literature; see Titterington et al. Titterington85 () and Bernardo and Girón Bernardo88 (). Moreover, here we face the additional difficulty that the mixing weights are further constrained, since their sum is fixed along the rows and the columns of . To the best of our knowledge, nothing has yet been published on this problem. In Section 5, we report the results of an extensive simulation study in which we compare our estimator with Deheuvels’ estimator and the Gaussian kernel estimator. Finally, a discussion is provided in Section 6 to conclude the paper.

2 The model for the copula function

For every , we construct a finite-dimensional approximation subspace . The construction of uses a basis that forms a partition of unity. A partition of unity is a set of nonnegative functions such that is a probability density function on for all and

Particular examples are given by indicator functions


and Bernstein polynomials



See Li et al. Li1 () for more examples of partitions of unity. In the following, let , where , for all , , and let


where is an doubly stochastic matrix. The following lemma is straightforward to prove.

Lemma 1

For every doubly stochastic matrix , is an absolutely continuous copula.

For a fixed partition of unity, we now define the approximation space as

The approximation order of is now discussed. It depends on the choice of the basis . Let be a uniformly spaced grid on the unit square . For a given copula , let be the restriction of on . Let

Then is a doubly stochastic matrix. Upper bounds for are given in the following lemma.

Lemma 2

Let be a copula and let , where is obtained by (7). {longlist}

For a model using indicator functions basis (5), we have and .

For a model using the Bernstein basis (6), we have .


(a) A direct evaluation shows that . From the Lipschitz condition (1), if two copulas and satisfy the constraint , then .

(b) First, it is well known that . For any , consider two independent random variables, and , where follows a distribution and follows a distribution. We have


In Lemma 6 of the Appendix, we give the exact value of . However, a simple expression for an upper bound is given by Hölder’s inequality


Bernstein copulas have appeared in the past literature and their properties have been extensively studied in Sancetta and Satchell Sancetta1 () and Sancetta and Satchell Sanchetta2 (). However, in view of Lemma 2 and of the simplicity of indicator functions, we subsequently use the indicator functions basis given in (5) for in our model given by equation (7), and is the family of copulas generated by this model. Since for any doubly stochastic matrix , our model is rich in the sense that we have , which is the set of doubly stochastic matrices.

Notice that in a data reduction perspective, if is a sample from our model , and if represents the indicator functions in (5), then is a sample from the distribution. As in a multinomial experiment with probabilities given by , the vector of cell count statistics follows a distribution and is sufficient for .

The following lemma is used to define what we call Deheuvels’ estimator. The estimator corresponds to the so-called bilinear extension of the empirical copula and has been considered by Deheuvels Deheuvels2 (), Nelsen Nelsen1 (), Lemma 2.3.5, Genest and Nešlehová GN07 () and Nešlehová Nes07 (), Section 5.

Lemma 3

Let be a sample, and let be the matrix given by

If we use the indicator basis (5) with for , then the copula

satisfies Deheuvels’ constraint.

3 The prior distribution

The choice of a prior concentrated on the approximation space is delicate. The prior distribution is specified on , the set of doubly stochastic matrices of order , . Here, we adopt an objective point of view and derive the Jeffreys prior. We also discuss two representations of doubly stochastic matrices that can be useful for the specification of other prior distributions on .

The set is a convex polytope of dimension . It is known in the literature as the Birkhoff polytope and has been the object of much research. For instance, computing the exact value of its volume is an outstanding problem in mathematics; it is known only for (see Beck and Pixton Beck1 ()).

The Fisher information matrix is obtained as follows. For , let , and let . The copula (7) is a mixture of bivariate distribution functions

where , and , for all , with . The probability density function of the copula is thus

The last equality expresses the fact that there are free parameters in the model. Recall that we are considering the indicator functions basis (5) in our model. It follows that for all ,

Although the information matrix is of order , the following result shows how to reduce the computation of its determinant to that of a matrix of order . The important reduction provided by (8) is greatly appreciated when running an MCMC algorithm, which computes the determinant at every iteration. Most important, this expression enables us to derive the main result of this paper, that is, Theorem 1. The proofs of these two results are quite technical, so we have put them in the Appendix.

Lemma 4

The Fisher information for is given by




Theorem 1

The Jeffreys prior is proper.

Now, in order to specify different priors, we can consider the two following representations.

The Hilbert space representation. Let and . Consider the Frobenius inner product on . Thus, is an -dimensional Hilbert space and an orthonormal basis is given by with

For every , there exists a unique matrix such that


where is the matrix given by . In this representation . Therefore, if we let , then we have a bijection between and . The set is a bounded convex subset of with positive Lebesgue measure. From this, priors on can be induced by priors on , and later on, we shall refer to the uniform prior on the polytope as the uniform distribution on . The above representation is also particularly useful to construct a Gibbs sampler for distributions on the polytope.

The Birkhoff–von Neumann representation. Another decomposition is obtained by making use of the Birkhoff–von Neumann theorem. Doubly stochastic matrices can be decomposed via convex combinations of permutation matrices. In fact, is the convex hull of the permutation matrices and these are precisely the extreme points (or vertices) of . Furthermore, every doubly stochastic matrix is a convex combination of, at most, permutation matrices; see Mirsky Mirsky1 (). In other words, if is the set of permutation matrices and if , then there exists such that for some weight vector lying in the -dimensional simplex . A prior distribution over the polytope can be selected using a discrete distribution over the set and a continuous distribution over the simplex , such as a Dirichlet distribution. See Melilli and Petris Melilli95 () for work in this direction.

4 The MCMC algorithm

Let be a sample, where each is a realization of the random couple , , with dependence structure given by a copula , and with continuous marginal distributions and . If the marginal distributions are known, then the transformed observations and , , are both samples from a uniform distribution on . If the marginal distributions are unknown, then we follow Genest et al. Genest2 () and consider the pseudo-observations and , where and are the empirical distributions. The algorithm below describes the transition kernel for the Markov chain used to numerically evaluate the Bayesian estimator associated to the Jeffreys prior . The type of algorithm is called Metropolis-within-Gibbs; see Gamerman and Lopes Gamerman2006 (). An individual estimate is approximated by the sampling mean of the chain.

Let be the length of the chain, and at each iteration , , let be the current doubly stochastic matrix. From representation (9) in the previous section with ,

Repeat for :

  1. [1.]

  2. Select direction and compute the interval as follows:

    1. [1.1]

    2. For every , find the largest interval such that

    3. Take

  3. Draw from the uniform distribution on and set and , for every . The proposed doubly stochastic matrix is given by

  4. Accept with probability


    where is the likelihood derived from expression (7).

Note that the above algorithm could also be used with any prior specified via the Hilbert space representation described in the previous section, including the uniform prior on the polytope described in the previous section. In particular, it could be adapted to draw random doubly stochastic matrices according to such priors by replacing the acceptance probability (10) with

In order to further describe the Jeffreys prior, we use the algorithm to approximate the probability of the largest ball contained in with respect to the Euclidean distance on . This distance may be computed using the Frobenius inner product described in the previous section. The largest ball has radius , where is the size of the doubly stochastic matrix. Although this probability can be obtained exactly for the uniform distribution, we nevertheless approximate it using our algorithm, meanwhile providing some validation of the MCMC algorithm. Figure 1 shows the results we get for .

(a) (b)
Figure 1: Convergence of 1000 parallel MCMC runs for the probability of the largest ball contained in the polytope with . Shaded region represents the range of the entire set of approximations at each iteration. Figure (a) is the convergence for the probability in the case of the uniform distribution. The flat line, in this case, corresponds to the true probability . Figure (b) is the same for the Jeffreys prior.

Notice that this probability is much smaller for the Jeffreys prior, because it distributes more mass towards the extremities of the polytope than the uniform prior does. This may also be observed by plotting the density estimates of the radius of the doubly stochastic matrix, that is, the Euclidean distance of the doubly stochastic matrix from the center of the polytope . These are shown in Figure 2.

(a) (b)
(c) (d)
Figure 2: Plots of samples and density estimates of the radius (Euclidean distance of the doubly stochastic matrix from the center of the polytope ), on the interval , where is the 95th quantile of its distribution. Figures (a) and (b) are results when sampling from the uniform prior and figures (c) and (d) are those of the Jeffreys prior. Here .

5 Simulation experiments

The goal of the experiment is to study the performance of our estimator on artificial data sets generated from various bivariate distributions. We provide evidence that the estimators derived from our model give good results in general, and most important, that the Jeffreys prior is a reasonable choice.

Six parametric families of copulas are considered: {longlist}

Clayton family:

Gumbel family:

Frank family:

Gaussian family: where is the standard bivariate Gaussian cumulative distribution function with correlation coefficient , and  is the inverse of the univariate standard normal cumulative distribution function.

Gaussian cross family: , where  belongs to the Gaussian family.

Gaussian diamond family:

where and belongs to the Gaussian cross family.

(a) (b)
Figure 3: Densities of the Gaussian cross copula and the Gaussian diamond copula with .

For the Clayton, Frank and Gaussian families, values of the parameter away from 0 indicate departure from independence, while a parameter away from 1 indicates departure from independence for the Gumbel family. These four families are among the popular ones in the literature, but the last two families are not, and so we now describe them in more detail. The Gaussian cross family is obtained by the following: Let be a random vector with uniform margins and with the Gaussian copula as its joint distribution. Let  be an independent uniformly distributed random variable, and consider the random vector

The distribution of is given by the Gaussian cross copula. Here, the superscript is to highlight the “cross-like” dependence structure; see Figure 3(a) for a plot of its density when . The Gaussian diamond family corresponds to the distributions of the random vectors

See Figure 3(b) for an illustration of its density when .

An extensive simulation experiment is carried out in two parts. In the first part, we consider the case of known marginal distributions and use bivariate data sampled from the copula families above. For each family, we consider 11 models corresponding to equally spaced parameter values in some interval. For the first four families, the interval is determined so that the Kendall’s values associated to the particular models range between 0 and ; see the simulation in Silva and Lopes Silva1 (). Kendall’s associated with a copula  is the dependence measure defined by

The values of Kendall’s for the first four families are respectively given by , , , where is the Debye function and . For families 5 and 6, we consider the models corresponding to 11 values of ranging between 0 and 1. In the second part of the experiment, we simulate an unknown margins situation. We focus on families 4, 5 and 6 and consider 11 equally spaced values of ranging between 0 and 1 for the copula models. Here, a Student with seven degrees of freedom and a chi-square with four degrees of freedom are considered as the first and second margins, respectively.

In the experiment, 1000 samples of both sizes and are generated from each model. For every data set, the copula function is estimated using five estimators. The first two are the Bayes estimators associated to the Jeffreys and the uniform priors, respectively. For the uniform prior, we mean the uniform distribution on defined in Section 3, we use the bijection given by expression (9). The third estimator is the maximum likelihood estimator (MLE) from our model , where  maximizes the likelihood derived from expression (7). This estimator is evaluated numerically. For the above three estimators, we take as the order of the doubly stochastic matrix in our model. Finally, we consider the two frequentist estimators, that is, Deheuvels’ estimator given in Theorem 3 and the Gaussian kernel estimator described in the Introduction. Values of the bandwidth for the latter estimator are based on the commonly used rule of thumb: , where , , is the sample standard deviation of the th margin; see Fermanian and Scaillet Fermanian1 () and Sheather Sheather04 (). Figures 45 and 6 report the values of the mean integrated squared errors,

for the five estimators as a function of the parameter .

(a) Family 1 (b) Family 2
(c) Family 3 (d) Family 4
(e) Family 5 (f) Family 6
Figure 4: Plots of MISE against in the known margins case. The MISE is approximated using 1000 samples of size . Thick solid line is the MISE of the Bayes estimator using the Jeffreys prior, dashed line is that of the Bayes estimator using the uniform prior, dashed–dotted line is the MISE of the MLE, while dotted and thin solid line is the MISE of Deheuvels’ and the Gaussian kernel estimators, respectively.
(a) Family 1 (b) Family 2
(c) Family 3 (d) Family 4
(e) Family 5 (f) Family 6
Figure 5: Plots of MISE against in the known margins case. The MISE is approximated using 1000 samples of size . Thick solid line is the MISE of the Bayes estimator using the Jeffreys prior, dashed line is that of the Bayes estimator using the uniform prior, dashed–dotted line is the MISE of the MLE, while dotted and thin solid line is the MISE of Deheuvels’ and the Gaussian kernel estimator, respectively.
(a) Model 4, (b) Model 4,
(c) Model 5, (d) Model 5,
(e) Model 6, (f) Model 6,
Figure 6: Plots of MISE against in the unknown margins case. The MISE is approximated using 1000 samples each of sizes and . Thick solid line is the MISE of the Bayes estimator using the Jeffreys prior, dashed line is that of the Bayes estimator using the uniform prior, dashed–dotted line is the MISE of the MLE, while dotted and thin solid line is the MISE of Deheuvels’ and the Gaussian kernel estimators, respectively.

As the results indicate, the Bayesian approach outperforms Deheuvels’ estimator and the kernel estimator near independence for the Clayton, Gumbel, Frank and Gaussian families. Unfortunately, this is not necessarily the case when the value of the parameter increases, that is, when the true copula approaches the Fréchet–Hoeffding upper bound, also called the comonotone copula, corresponding to (almost sure) perfect positive linear dependence. For families 5 and 6, the Bayes estimators outperform both frequentist estimators when the sample size is small (). One remarkable feature that appears when comparing the results obtained in the known margins case with the results obtained in the unknown margins case is the decrease in performance of the kernel estimator. Recall that the latter estimator is the only one for which the invariance property mentioned in the Introduction does not hold. The other estimators seem to behave similarly when comparing their resulting MISE in the known margins case with their MISE in the unknown margins case. Notice the resemblance in shape of the MISE for Deheuvels’ estimator and the kernel estimator in the unknown margins cases. Finally, the performance of the MLE is worth mentioning, since in many cases it has the smallest MISE, especially for large values of . This is because the MLE will go on the boundary of the parameter space easily, while the Bayes estimator will always stay away from the boundary with the types of priors that we have selected. However, if such an extreme case is to happen in a real life problem, it is probable that the practitioner has some insight on the phenomenon beforehand, and may choose to work with a more appropriate (subjective) prior.

6 Discussion

Two points need to be further discussed. First, our methodology is purely Bayesian only when the marginal distributions are known. When these are unknown, our methodology is empirical Bayes. In fact, in this case we propose a two-step procedure by first estimating the marginsvia the empirical marginal distributions and then plugging them in as the true distributions thereafter. We have chosen to do this because it is common practice to do so (see Genest et al. Genest2 ()), it is simple to implement, it is robust against outliers and our estimator is consequently invariant under increasing transformations of the margins. One way to propose a purely Bayesian estimator by using our model for the copula is to use finite mixtures for the margins. This way, if the densities used in the latter mixtures have disjoint supports, then the Jeffreys prior for the mixing weights has a simple form and is proper; see Bernardo and Girón Bernardo88 (). Now by selecting independent Jeffreys priors for the margins and for the copula, the resulting prior is proper as well.

Finally, our models given by the approximation spaces , , are called sieves by some authors; see Grenander Grenander81 (). In the present paper, we have chosen to work with a fixed sieve, so this makes our model finite dimensional. In this case, the methodology falls in the semi-parametric approach described in the Introduction. Here, the rather subjective choice of the sieve to work with can be viewed as a weakness of the proposed methodology. On the other hand, by using the entire set of sieves, we can construct a nonparametric model for the copula that can, in some sense, respect the infinite-dimensional nature of the copula functions. In fact, if we take , then is dense in the space of copulas. Our Bayesian methodology can be easily adapted here. This can be achieved by selecting an infinite support prior for the model index and using our methodology inside each model. The Bayesian estimator becomes an infinite mixture of the estimators proposed in this paper (one for each model ), where the mixing weights are given by the posterior probabilities of the models.



Proof of Lemma 4 Here we show how to compute efficiently. First, notice that can be written as where


Thus, . If we let , then since , for all , and for all ,

By elementary row and column operations, we get

so that