Semiparametric Copula Quantile Regressionfor Complete or Censored Data

# Semiparametric Copula Quantile Regression for Complete or Censored Data

Mickaël De Backer, Anouar El Ghouch, Ingrid Van Keilegom 111Université catholique de Louvain, Institut de Statistique, Biostatistique et Sciences Actuarielles. Voie du Roman Pays 20, B-1348 Louvain-la-Neuve, Belgium. E-mail: mickael.debacker@uclouvain.be.

Université catholique de Louvain
###### Abstract

When facing multivariate covariates, general semiparametric regression techniques come at hand to propose flexible models that are unexposed to the curse of dimensionality. In this work a semiparametric copula-based estimator for conditional quantiles is investigated for complete or right-censored data. In spirit, the methodology is extending the recent work of Noh et al. (2013) and Noh et al. (2015), as the main idea consists in appropriately defining the quantile regression in terms of a multivariate copula and marginal distributions. Prior estimation of the latter and simple plug-in lead to an easily implementable estimator expressed, for both contexts with or without censoring, as a weighted quantile of the observed response variable. In addition, and contrary to the initial suggestion in the literature, a semiparametric estimation scheme for the multivariate copula density is studied, motivated by the possible shortcomings of a purely parametric approach and driven by the regression context. The resulting quantile regression estimator has the valuable property of being automatically monotonic across quantile levels, and asymptotic normality for both complete and censored data is obtained under classical regularity conditions. Finally, numerical examples as well as a real data application are used to illustrate the validity and finite sample performance of the proposed procedure.

\DefineFNsymbolsTM

myfnsymbols ** ††

Key words: Semiparametric regression, censored quantile regression, multidimensional copula modelling, semiparametric vine copulas, kernel smoothing, polynomial local-likelihood, probit transformation.

## 1 Introduction

Quantile regression is a prevailing method when it comes to investigating the possible relationships between a -dimensional covariate and a response variable . Since the seminal work of Koenker and Basset (1978), quantile regression has received notable interest in the literature on theoretical and applied statistics as a very attractive alternative to the classical mean regression model based on quadratic loss. As the latter only captures the central tendancy of the data, there are many cases and nice examples where mean regression is uninformative with respect to studying the conditional upper or lower quantiles. For an interesting application, see for example Elsner et al. (2008). A comprehensive review of quantile regression as a robust (to outliers) and flexible (to error distribution) method can be found in Koenker (2005).

A wide literature on the estimation of the quantile regression function is devoted to the case where the response variable is completely observed. This is not necessarily the case in survival analysis, where right censoring of may arise, i.e. instead of fully observing the variable of interest, one only observes the minimum of it and a censoring variable. For instance, in clinical studies, censoring may occur because of the withdrawal of patients from the study, the end of the follow-up period, etc. In this context, quantile regression becomes attractive as an alternative to popular regression techniques like the Cox proportional hazards model or the accelerated failure time model, as is argued in Koenker and Bilias (2001), Koenker and Geling (2001) and Portnoy (2003). Additional appealing properties of the method include the fact that it allows for modelling heterogeneity of the variance and it does not necessarily impose a proportional effect of the covariates on the hazard over the duration time as opposed to the popular Cox model.

As is the case for the uncensored situation, existing literature on censored quantile regression includes fully parametric, semiparametric and nonparametric methodologies. When several covariates are to be taken into account, fully parametric methodologies are known to be highly sensitive to model misspecification and may lack the flexibility needed for an adequate modelling. On the other hand, in spite of their great flexibility, fully nonparametric methods such as local linear smoothing proposed by El Ghouch and Van Keilegom (2009) are typically affected by the curse of dimensionality. In light of these restrictions, semiparametric estimation procedures such as a single-index model suggested by Bücher et al. (2014) come at hand when the dimension of the covariate is high. It is the object of this paper to extend the rather sparse literature on flexible multidimensional methodologies in the context of censored quantile regression.

Censored quantile regression was first introduced by Powell (1986) for linear models and fixed censoring, that is, presuming that the censoring times are the same for all observations. For random censoring, Ying et al. (1995) proposed a semiparametric linear median regression model, assuming that the survival time and the censoring variable are unconditionally independent. Despite this important contribution, the proposed procedure involves solving non-monotone discontinuous equations, hereby introducing practical and computational difficulties. Furthermore, the unconditional independence assumption may sound restrictive as conditional independence, given the covariates, seems more natural in certain applications.

Based on conditional independance between the survival time and the censoring variable, Portnoy (2003) developed a novel estimating procedure based on the idea of redistribution-of-mass introduced by Efron (1967). A major shortcoming of it is the need for a global linear assumption, that is, in order to estimate the -th conditional quantile, one needs to assume the linearity of all the conditional functionals at lower quantiles. Extending Portnoy’s work, Wang and Wang (2009) relaxed this constraint by only assuming linearity at one pre-specified quantile level of interest. Recently, Leng and Tong (2013) proposed an alternative to Wang and Wang’s methodology by extending the work of Ying et al. in order to relax the constraining unconditional independance assumption and provide an efficient algorithm for estimation. Still, a linear approach may be too restrictive for real data applications.

Finally, an interesting alternative approach to the above-mentioned literature was proposed by Bücher et al. (2014), where a single-index model for the conditional quantile function is studied under the assumption of independence between the covariates and the censoring variable. The single-index structure assumes that the objective function depends linearly on the covariates through an unknown link function, making the proposed model (i.e. under the aforementioned assumption) insensitive to the curse of dimensionality since the nonparametric part is of dimension one.

In this paper, we aim to extend the literature on multivariate quantile regression estimation in the possible presence of censored data by providing a rich, flexible and robust alternative based on the copula function that defines the dependence structure between the variables of interest. By taking advantage of copula modelling, we intend to provide a new class of estimators that would allow practitioners to analyse, in a flexible way, multidimensional survival data. Actually, our methodology is, in essence, an extension of the recent work of Noh et al. (2013) and Noh et al. (2015), as the central idea is to express the conditional quantile function in terms of an appropriate copula density and marginal distributions. In their original paper, Noh et al. (2013) suggested subsequently to leave the marginal distributions unspecified while assuming a parametric model for the copula. Overall, their suggested approach results in a semiparametric regression estimator that is not exposed to the curse of dimensionality. However, in order to avoid possible shortcomings highlighted by Dette et al. (2014) that are induced by the misspecification of the parametric copula, we propose in this work, both for complete and censored data, an alternative semiparametric estimation strategy for the copula itself. Our resulting methodology is flexible for multidimensional data with or without censoring, easy to implement and does not require any iterative procedure in opposition to existing semiparametric alternatives.

The rest of this paper is organised as follows. Developing the copula-based estimation procedure for the quantile regression is the topic of Section 2. The asymptotic properties of the proposed estimator are obtained in Section 3 and the finite sample performance is illustred by means of Monte Carlo simulations in Section 4, where both the semiparametric copula estimation strategy and the overall performance of our estimator are investigated. Section 5 provides a brief application to real data. Lastly, the proofs of our asymptotic properties are deferred to the Appendix.

## 2 Methodology and Estimation

### 2.1 Copula-based estimator for complete data

Let be a covariate vector of dimension and be a (time-to-event) response variable with marginal continuous cumulative distribution functions (c.d.f.) and , respectively. Throughout this paper, we denote by and the density of and , respectively. From the pioneering work of Sklar (1959), for a given , the c.d.f. of evaluated at can be expressed as , where and is the unique copula distribution of defined by , with and , . From Sklar’s theorem, it is clear that the copula disjoints the marginal behaviours of and from their dependence structure, hence allowing a great modelling flexibility. For a book length treatment of copulas, see Nelsen (2006) and Joe (2014).

The object of interest of this paper, the -th conditional quantile function of the dependent variable given , denoted by , is defined for any as where is the conditional c.d.f. of given , or, equivalently,

 mτ(x)=argminaE(ρτ(T−a)|X=x), (2.1)

where is the so-called “check” function, and is the indicator function.

To motivate our approach, let us suppose, for the moment, that there is no censoring and that we observe an i.i.d. sample from . In this context, following the definition of a copula function, Noh et al. (2015) noted that the conditional quantile function of given may be expressed as

 mτ(x)=argminaE[ρτ(T−a)cTX(FT(T),F(x))], (2.2)

where is the copula density corresponding to . Consequently, any given estimators , and of , and , , respectively, automatically yield an estimator of given by

 ˆmτ(x)=argminan∑i=1ρτ(Ti−a)ˆcTX(ˆFT(Ti),ˆF(x)), (2.3)

with . As indicated earlier, Noh et al. suggest to estimate the marginals nonparametrically and to consider a parametrization of the copula density, that is, assume that the latter belongs to a certain parametric family of copula densities .

In this paper however, we propose a novel semiparametric strategy for the estimation of the copula density, motivated by the issues related to the possible misspecification of the parametric approach. To highlight this shortcoming and illustrate how one may circumvent it, we consider the simplistic example reported by Dette et al. (2014) with a single covariate, where are i.i.d. random variables with , , and are i.i.d. standard normal random variables. In this situation, where the true quantile regression function is non-monotonic in the covariate, it is found that most of the common parametric copula families still yield a monotone estimation of the regression function, thereby providing a rather poor fit of the latter. This is illustrated in Figure (a)a, where the estimation is carried out for , and using three common parametric copulas.

As the roots of the above-mentioned limitation are not intrinsic to a copula-based approach, but rather to be attributed to the limited set of parametric copula families existing in the literature, a natural alternative, for low dimensional covariates only, would be to consider a fully nonparametric estimation of the copula density itself. The resulting, and adequate, quantile regression estimation is depicted in Figure (b)b.

Recalling that we intend to handle multivariate covariates in this paper, we will not adopt a purely nonparametric approach, but rather prefer a copula estimation strategy that provides sufficient flexibility to the multidimensional estimator while dodging dimension related constraints. More specifically, we note that any multivariate copula density can be decomposed into two parts as follows:

 cTX(FT(t),F(x)) = cTX1(FT(t),F1(x1))×…×cTXd(FT(t),Fd(xd)) (2.5) ×cX1…Xd|T(F1|T(x1|t),…,Fd|T(xd|t)|t),

where , , denotes the conditional c.d.f. of given . The first part of the decomposition contains the product of bivariate copula densities related to the dependence of with every covariate, whereas the second part captures the conditional dependence of given . In the general regression context, part (2.5) may then be interpreted as the dependence of actual interest since it focuses on the relationship between the response variable with every covariate. On the contrary, part (2.5) may be viewed in such framework as a ‘noisy’ dependence, or, more precisely, a correction parameter for possible (conditional) dependence among covariates. Consequently, a natural reasoning suggests to provide as much flexibility as possible to the modelling of part (2.5), while keeping the estimation of part (2.5) uncomplicated. We therefore advocate to estimate nonparametrically the bivariate copulas of interest and, subsequently, exploit standard parametric techniques for the second part of the multivariate copula density.

Concerning the nonparametric estimation of bivariate copula densities, several methodologies have been proposed in the literature. To estimate for instance , a tempting, yet naive, approach would be to opt for standard multivariate kernel techniques. That is, given a bivariate sample , from , one would estimate the copula density at by

 1n|H|1/2n∑i=1K(H1/2(u0−U0iu1−U1i)),

where is a symmetric positive-definite bandwidth matrix and is a bivariate kernel function. This technique is called naive in this context since it completely ignores the fact that the density of interest is only supported on the unit square. This boundedness property is at the origin of distinct schemes designed to correct for well-known bias issues at the boundaries, see for example the mirror reflection method (Gijbels and Mielniczuk (1990)) or the boundary kernel method (Chen and Huang (2007)). An appealing alternative was furthermore proposed by Charpentier et al. (2006) and Geenens et al. (2014), where the main idea is to appropriately project the initial data on an unbounded support with the purpose of then estimating the obtained bivariate transformed density by means of standard techniques (standard kernel (Charpentier et al.) or polynomial local-likelihood (Geenens et al.)). Using the invariance property of copulas to increasing transformations of their margins, the estimation of the copula density is then obtained by back-transformation on the unit square. That is, using the example of a probit transformation, one may estimate the copula density at by

 ˆf01(Φ−1(u0),Φ−1(u1))ϕ(Φ−1(u0))ϕ(Φ−1(u1)),

where and stand for the standard normal density and c.d.f., respectively, and where is a bivariate density estimator of the projected data . This transformation technique, coupled with polynomial local-likelihood estimation for , in order to allow for possible unbounded copula density estimates, is shown to outperform its competitors in most scenarios in a detailed simulation study in Geenens et al.. An exhaustive comparison of existing methodologies for bivariate estimation may be found in Nagler (2014). Furthermore, fully nonparametric multidimensional copulas are studied in Hobæk Haff and Segers (2012) and Nagler and Czado (2015).

Regarding the second part (2.5) of the -dimensional copula, standard high-dimensional parametric techniques involve, among others, nested Archimedean copulas (see e.g. Hofert and Pham (2013) and Joe (2014)), factor copulas (see e.g. Oh and Patton (2012)) and, arguably the most popular, vine copulas (see e.g. Czado (2010), Joe (2014) and references therein). Note, however, that for the estimation of (2.5) in practice, one is first advised to adopt the so-called simplifying assumption which stipulates that the conditioning on is fully captured by the conditional marginals. In other words, in (2.5), the conditional copula itself is not affected by the conditioning on . This assumption turns out to be the cornerstone of vine copula models as it keeps them tractable for inference and model selection. For more details about this and its implications, see Hobæk Haff et al. (2010) and Stöber et al. (2013).

Conclusively, in this article we propose to adopt the following detailed procedure for the modelling and estimation of the multivariate copula density:

• Based on original observations construct ‘pseudo-observations’, needed for the estimation of (2.5), using rescaled versions of empirical distributions:

 ˆU0i=1n+1n∑k=1\mathds1(Tk≤Ti) ˆUji=1n+1n∑k=1\mathds1(Xjk≤Xji),i=1,…,n,j=1,…,d,

where the factor , commonly adopted in the copula literature, aims at keeping the constructed observations in the interior of .

• Based on estimate each bivariate copula density in (2.5) using a bivariate kernel density estimator. This can be achieved via the local-polynomial probit methodology of Geenens et al., or any other estimator satisfying assumption 3 given below.

• Compute the pseudo-observations needed for the estimation of (2.5) as:

 ˆFj|T(Xji|Ti)=∫ˆUji0ˆcTXj(ˆU0i,s)ds.

This relationship is at the origin of the sequential nature of the vine copula estimation scheme (see e.g. Czado (2010)).

• Lastly, for the estimation of , adopt the simplifying assumption and use standard parametric vine techniques on the dataset , .

### 2.2 Copula-based estimator for censored data

In the presence of censoring, the estimation equation (2.3) becomes inappropriate as we do not fully observe the response variables . Instead, we only observe a sequence of i.i.d. triplets , from , where , and denotes the censoring variable, assumed to be independent of given . In order to take censoring into account in the estimation procedure, the first step is to note that, for any measurable function ,

 E(φ(T)|X=x)=E(φ(Y)Δ1−GC(Y−|x)∣∣X=x), (2.6)

where denotes the conditional distribution of given . This, along with (2.1), suggests a natural way to handle censoring for quantile regression by replacing the function with the check function.

At this stage, a naive way of trying to take profit of copula modelling would be to consider introducing copulas in the obtained conditional expectation. Note, however, that the conditional expectation in (2.6) is in fact the joint conditional expectation of given . Adopting an analogous reasoning as the one presented by Noh et al. for the uncensored case at this point would therefore result in the insertion of the joint copula of , hence exposing the estimation procedure to the lack of uniqueness of the copula given that is a discrete (binary) variable. In this situation, to quote Embrechts (2009), “everything that can go wrong, will go wrong”. Details about copulas for discrete variables may be found in Genest and Nešlehová (2007).

Instead, the idea is to work on the joint conditional expectation so as to bypass the issues related to the copula of . In short, our intention is to discard the problem by obtaining the copula of conditionally on , for which no specific technical difficulties are involved. To that end, using the notations (resp. ) as a shorthand for a given distribution (resp. density) conditionally on , first note that

 E(ρτ(Y−a)Δ1−GC(Y−|x)∣∣X=x)=∫R+ρτ(y−a)11−GC(y−|x)dFY,Δ|X(y,1|x), (2.7)

where , with and where and denote the conditional densities of and given , respectively. Hence, using the definition of a copula function in a similar spirit as Noh et al., one obtains

 dFY,Δ|X(y,1|x)=p(x)cuYX(FuY(y),Fu(x))cuX(Fu(x))fuY(y)dy,

where is the copula density corresponding to the copula of , is the copula density of , and . Inserting this last expression in (2.7), we may write

 E[ρτ(T−a)|X=x]=E[ρτ(Y−a)Δ1−GC(Y−|x)p(x)P(Δ=1)cuYX(FuY(Y),Fu(x))cuX(Fu(x))].

Applying this equality in the context of quantile regression, interestingly, one eventually retrieves an expression analogous to (2.2):

 (2.8)

where . Note that the copula in question in (2.8) is determined by strictly fully observed data. Hence, standard literature on copulas can be manipulated without any censoring related constraints. Given estimators , and of , and , , satisfying certain high-level conditions which will be given in Section 3, this suggests to estimate the quantile regression in the presence of censoring by the empirical analogue of (2.8), that is

 ˆmτ(x)=argminan∑i=1[ρτ(Yi−a)ˆWi(x)ˆcuYX(ˆFuY(Yi),ˆFu(x))], (2.9)

where , and where denotes an estimator of based on the four-step procedure described in Section 2.1. Explicitly,

 ˆcuYX(ˆFuY(y),ˆFu(x)) = ˆcuYX1(ˆFuY(y),ˆFu1(x1))×…×ˆcuYXd(ˆFuY(y),ˆFud(xd)) ×ˆcuX1…Xd|Y(ˆFu1|Y(x1|y),…,ˆFud|Y(xd|y)),

where any two-dimensional kernel density estimator may be used for each bivariate copula density , such that condition 3 of Section 3 holds, and where is estimated by standard parametric vine procedures.

The resulting quantile regression estimator in (2.9) may then be viewed as a simple weighted quantile of the observed response variable, and is therefore easy to implement in practice using the efficient quantile regression code developed by Portnoy and Koenker (1997) and Koenker (2005). Nonetheless, in the context of multivariate covariates, the estimation of requires further assumptions to overcome dimension related issues. Popular choices in the literature include, among others, independence between and , the Cox model or the single-index model on . Illustrations of such assumptions are treated in our simulation study.

As an interesting property, and similarly to the case without censoring, we note that the obtained regression function estimator is automatically monotonic accross quantile levels. Applying analogous arguments to the ones adopted in the proof of Theorem 2.5 of Koenker (2005), one can indeed determine that

 (τ2−τ1)(ˆmτ2(x)−ˆmτ1(x))n∑i=1ˆWi(x)ˆcuYX(ˆFuY(Yi),ˆFu(x))≥0. (2.10)

Given that for all , this signifies that for .

Conclusively, in parallel to what has been stated for the uncensored case, the resulting estimator defines a rich class of estimators built on the many different existing methods available in the literature for estimating copula densities and marginal distributions of both complete and censored data.

## 3 Asymptotic Properties

We establish in this section the asymptotic normality of the proposed estimator . To that end, we first report the set of regularity conditions as well as the required high-level conditions on all estimators involved in the expression of . We then develop an asymptotic representation of our estimator for a general -variate covariate. As the latter will result in a somewhat unpleasant expression for the asymptotic bias and variance for a general multivariate covariate, and given that the analytical reasoning is similar in spirit, we eventually restrict ourselves to the detailed asymptotic expression for the case .

For a fixed but arbitrary point of interest in the support of , denoted by supp(), let us suppose that there exists a neighborhood of such that the following regularity conditions hold:

• The conditional distribution of given admits a conditional density that is continuous, strictly positive and bounded uniformly on .

• The point of interest is such that and . Furthermore, , and .

• The point satisfies , for some .

• Denote and define . Then,

• .

• .

Concerning the high-level conditions, it is assumed that the multivariate copula density is estimated using the proposed four-step strategy of Section 2.1, and that, for simplicity, the bivariate kernel copula estimators of step 2.1 are based on the same bandwidth for a certain . The following conditions are then assumed to hold:

• The marginal c.d.f. estimators are such that:

• .

• , where and is an estimator of .

• , and , where and .

• The multivariate copula estimator is such that:

• , , where is the -th coordinate of .

• , , where denotes the partial derivative with respect to the -th argument.

Assumption 3 is standard in the context of quantile regression estimation. As for condition 3, this is similar to assumption (C3)-(i) in Noh et al. (2015) for the simplified case with no censoring, with an additional requirement on the conditional censoring probability that is resulting from the initial transformation of synthetic observations. Assumption 3 is likewise emanating from the handling of censoring through these observations, and is rather usual in survival analysis. Note that, in the quantile regression framework, the latter assumption amounts to defining a natural upper bound for the quantile of interest that can be studied. Assumption 3 reports a set of technical conditions to be met.

As regards conditions 3-3, 3 is routinely made in the copula framework. For instance, it is readily satisfied for the empirical distributions when only uncensored observations are taken into account, and their rescaled versions which are prominent in the copula literature. Assumption 3 imposes restrictions on the estimator one may consider for the conditional distribution of the censoring variable and is, for instance, fulfilled for a simple Kaplan-Meier estimator for (see e.g. Theorem 2.1 in Chen and Lo (1997) for sufficient and necessary conditions for 3). Lastly, the uniform consistency of the kernel density estimator required by assumption 3 is, for instance, alluded to in Geenens et al. (2014) for the probit-transformed copula estimator.

We now state the main result of this section that holds for a general -dimensional covariate vector and for all bivariate kernel copula estimators based on the same bandwidth . In practice, however, it may be recommended to adopt an unconstrained and non-diagonal bandwidth matrix, as is detailed in Section 4 of Geenens et al.. Nevertheless, when considering this general situation, the theoretical results become less tractable while equivalent in nature to the simplified situation considered here.

###### Theorem 3.1.

Let be the common bandwidth of the bivariate kernel copula density estimators. For satisfying as , and under assumptions 3-3, we have

 (nh2)1/2(^mτ(x)−mτ(x))=w(x)fT|X(mτ(x)|x)(nh2)1/2nn∑i=1ψτ(ϵi)Wi(x)[ˆcuYX(FuY(Yi),Fu(x))−cuYX(FuY(Yi),Fu(x))]+op(1),

where is the conditional density of given and .

Theorem 3.1 implies, quite naturally, that the asymptotic behaviour of will be characterized by the properties of the copula estimator, specifically through its nonparametric feature, provided that the estimation of is ‘reasonable’ when confronted to a multidimensional covariate vector (assumption 3). In particular, this suggests that the detailed discussion of Geenens et al. about the asymptotic bias and variance of their distinctive bivariate copula estimators may be transcribed in our context.

Additionally, Theorem 3.1 also covers an asymptotic representation of the copula-based quantile regression estimator when all responses are fully observed. In this situation, one would indeed obtain a similar result for the proposed semiparametric procedure, with the removal of all censoring related terms, that is , , and the superfluous conditioning on for the copula densities and marginal distributions.

We now consider a detailed asymptotic representation of our estimator for the simplified case where , and for a general nonparametric estimator of the bivariate copula densities. For convenience, we use the notation as a shorthand for , and similarly for other functions depending on , .

###### Corollary 3.2.

Suppose that the assumptions of Theorem 3.1 hold for the case . Furthermore, suppose that the bivariate nonparametric copula estimators of and are such that

 (nh2)1/2(ˆcuk(u0,uk)−cuk(u0,uk)−h2bk(u0,uk))=1√nn∑j=1Znjk(u0,uk)+op(1),∀uk∈(0,1), uniformly in u0∈(0,1), % for k=1,2,

for some some deterministic function , and for some function depending on and possibly on , satisfying , for all .

Define

 ˜Zni(u0,u) =[Znj1(u0,u1)cu2(u0,u2)+Znj2(u0,u2)cu1(u0,u1)]cuX1X2|Y(u1,u2|u0), bYX(u0,u) =[b1(u0,u1)cu2(u0,u2)+b2(u0,u2)cu1(u0,u1)]cuX1X2|Y(u1,u2|u0), λn(Yi,Δi,Xi,x) =E[ψτ(ϵ)W(x)˜Zni(Fu(Y),Fu(x))|Yi,Δi,Xi],i=1,…,n.

Suppose furthermore that the following technical conditions hold:

• .

• for all , where the expectation is taken with respect to and .

Then, the copula-based quantile regression estimator at any point of interest satisfying 3-3.2 is such that

 (nh2)1/2(ˆmτ(x)−mτ(x)−h2B(x))\lx@stackrelL⟶N(0,σ2(x)),

where

 B(x) =w(x)fT|X(mτ(x)|x)E(ψτ(ϵ)W(x)bYX(Fu(Y),Fu(x))) andσ2(x) =w2(x)f2T|X(mτ(x)|x)limn→∞E(λn(Y,Δ,X,x)2).

Corollary 3.2 reports the asymptotic normality of our estimator at the expected convergence rate, implied by the nonparametric estimation of the bivariate copula densities. Depending on the choice in step 2.1 of the kernel density estimator fulfilling the conditions of Corollary 3.2, simple plug-in of the expression of , , in all quantities built upon the latter may then lead to the detailed, although arduous, expressions of the asymptotic bias and variance of the proposed estimator.

Furthermore, as this had yet to be covered, it is worth stressing out that Corollary 3.2 also encompasses the asymptotic normality of the suggested estimator based on semiparametric vine copulas with strictly complete data. Similarly to what has been stated for Theorem 3.1, one is indeed only required to withdraw all censoring related elements from Corollary 3.2 to obtain the expressions of the asymptotic bias and variance of the proposed semiparametric quantile regression estimator for complete observations.

## 4 Simulation Study

In this section, we assess the practical finite-sample performance of the proposed methodology by means of Monte Carlo simulations. For this purpose, we first present a brief numerical study to further motivate the semiparametric copula strategy we intend to adopt for multivariate problems for both complete and censored observations. Secondly, focussing on survival data, we illustrate the flexibility of our estimator based on the proposed copula modelling by showing promising results with respect to competitors, including when the generated scenario is to the advantage of the latter. All the simulations are carried out using the statistical computing environment R (R Core Team (2014)) and its freely accessible packages.

### 4.1 Assessing the semiparametric copula estimation

This first section aims at numerically illustrating the choice of our semiparametric copula estimation strategy. For this purpose, we consider two distinctive data generating processes (DGP) and compare our methodology with fully parametric and nonparametric procedures one might consider for the estimation of a multivariate copula density. For the general simulation settings, we consider repetitions of each DGP; three (average) levels of censoring (0%, 30% and 50%), three sample sizes () and the quantile level of interest . As the object of interest here is the copula modelling, when censoring is introduced, we only consider the simple case of independence between the censoring variable and the covariate vector in order to keep the estimation of needed for (2.9) uncomplicated, that is, using the Kaplan-Meier estimator. The detailed DGPs are as follows:

1. DGP A: Gaussian copula with parameters . Given standard uniform marginal distributions for all three variables, the resulting true quantile regression may be calculated as (see Noh et al. (2015)). To include censoring, we introduce the variable , where the parameter is computed in order to obtain the desired average censoring proportion ( for 30% and for 50%).

2. DGP B: Gaussian copula with parameters , . The resulting true quantile regression for standard uniform marginal distributions is determined as . The censoring variable is ( for 30% and for 50% censoring).

For any general copula-based regression estimator, the marginal distribution estimations are performed, as suggested in Section 2.1, using rescaled versions of the empirical distributions:

 ˆFuY(y)=1nu+1n∑i=1Δi\mathds1(Yi≤y) ˆFuj(xj)=1nu+1n∑i=1Δi\mathds1(Xij≤xj),j=1,…,d,

where is the number of uncensored observations.

For the distinctive copula-based estimators, we consider the following procedures:

1. : semiparametric estimation strategy detailed in Section 2.1. That is, we first estimate the bivariate copulas of interest employing the probit transformation technique of Geenens et al. (2014) coupled with local likelihood estimation based on quadratic polynomials. To that end, we follow their proposed nearest-neighbor bandwidth selection procedure. Concerning the estimation of the -dimensional ‘noisy’ copula density (2.5), as mentioned above, we apply standard vine techniques using the R package VineCopula. Specifically, we adopt one automatically selected tree structure for the simplified decompositon of the copula density among many R-vine candidate structures (see Dißmann et al. (2013)), and subsequently determine the appropriate pair-copula family to be selected and parametrically estimated. The selection criterion for bivariate copulas is chosen to be the Akaike information criterion (AIC), which revealed to be adequate in the R-vine context (see Brechmann (2010), chap. 5), and ten potential family candidates, together with their rotations, are considered: eight of them are Archimedian (Clayton, Gumbel, Frank, Joe, Clayton-Gumbel, Joe-Gumbel, Joe-Clayton and Joe-Frank), and the last two are elliptical (Gaussian and Student ).

2. : fully nonparametric estimation of the -dimensional copula using vine techniques. Specifically, while the vine structure is kept identical, here all bivariate building blocks are estimated using the local likelihood technique based on probit-projected data with, here again, the bandwidth selection procedure of Geenens et al. (as is studied in Nagler and Czado (2015)). Given its fully nonparametric nature, it should be mentioned that this estimator is not covered by the theoretical results of Section 3.

3. : fully parametric estimation of the -dimensional copula density, where all bivariate copulas are estimated using the previously mentioned candidate families and selection criteria. However, unlike the above-mentioned estimators, we do not force here any structure for the vine decomposition. As a consequence, no explicit distinction is imposed between dependence of interest and noisy dependence. Instead, one data-driven selected structure is adopted, regardless of the arguments of Section 2.1. This will allow us to analyse the impact of such dependence distinction in our regression context, as is discussed below. Finally, as it is the case for , this estimator is not covered by the asymptotic theory of Section 3.

Both DGPs concentrate on the situation where the dependence structure between the response variable and the covariate vector is characterized by a parametric copula. In such circumstances, will have a critical advantage, and may serve in order to evaluate the impact of the nonparametric part of the estimation scheme, especially when the dimension of the covariate vector increases. As a performance criterion, we consider here the empirical integrated mean squared error (IMSE), defined as

 IMSE(ˆmτ(x))=1NN∑i=1(1BB∑b=1(ˆm(b)τ(xi)−mτ(xi))2),

where is a generated random sample of size serving as an evaluation set spread on the domain of , and denotes the regression estimation for the -th simulated sample.

The results of our simulation study are summarized in Table 1. Based on these, we detail our analysis in two parts, as the outcomes of our study offer relevant information on both the copula decomposition choice and the type of bivariate estimators one may adopt in the multivariate setting. Note that, for both DGPs, as the dependence structure is specified by a Gaussian copula, the simplifying assumption intrinsic to the vine decomposition is here applicable (see Theorem 4 in Stöber et al. (2013)). In our context, this means that any observed difference between copula strategies is not to be attributed to a possible violation of the underlying simplifying assumption.

Focussing first on our decomposition strategy, we note that, as expected, globally outperforms and . However, strikingly enough, this is not observed for DGP A for different censoring proportions, where details better results. This is interpreted here as evidence for the validity of our arguments regarding the decomposition choice: as the censoring proportion grows, the number of observations actually entering the copula estimation becomes more moderate, hereby implying two opposite effects in this context. First, the propagation of estimation approximations tends to be more important, signifying that the further we decompose, the more sensitive becomes the estimation of the involved bivariate copulas as these are tributary of the quality of previously estimated bivariate blocks. Using a purely data-driven decomposition may then result in a poor fit of the (conditional) copula of the response variable with one of the covariates, as it is not required that the latter would be primarily treated. This is interpreted as the reason why is able to outperform , admittedly by a small amount, when censoring increases for a fixed sample size , even though the simulated scenario is issued from a purely parametric copula. However, on the other hand, when observations are more scarse, it is well-known that nonparametric estimations become more sensitive than parametric counterparts. This explains why the estimation results for are not superior to those of for with 50% censoring, as the former requires the nonparametric estimation of two bivariate copulas, whose complexity compared to seems to override the positive effects of our decomposition choice. Overall, these noteworthy results for DGP A illustrate the effectiveness of our proposed copula decomposition in the regression context. When augmenting the covariate vector dimension, the price of estimating now three nonparametric bivariate copulas quite logically exceeds the potential gain of concentrating efforts on the dependence of interest. This is identified in DGP B.

Concentrating now on the modelling choice for the noisy dependence, the comparison between and