# Non-Smooth Backfitting for Excess Risk Additive Regression Model with Two Survival Time-Scales

###### Abstract

We present a new backfitting algorithm estimating the complex structured non-parametric survival model of Scheike (2001) without having to use smoothing. The considered model is a non-parametric survival model with two time-scales that are equivalent up to a constant that varies over the subjects. Covariate effects are modelled linearly on each time scale by additive Aalen models. Estimators of the cumulative intensities on the two time-scales are suggested by solving local estimating equations jointly on the two time-scales. We are able to estimate the cumulative intensities solving backfitting estimating equations without using smoothing methods and we provide large sample properties and simultaneous confidence bands. The model is applied to data on myocardial infarction providing a separation of the two effects stemming from time since diagnosis and age.

Key words: Aalen model, counting process, disability model, illness-death model, generalized additive models, multiple time-scales, non-parametric estimation, varying-coefficient models.

s O m \IfBooleanTF#1∥#3∥#2∥#3#2∥_L_2(Ω)

## 1 Introduction

In many bio-medical applications in survival analysis it is of interest and needed to use multiple time-scales. A medical study will often have a follow-up time (for example time since diagnosis) for patients of different ages, and here both time-scales will contain important but different information about how the risk of, for example, dying is changing. We therefore consider the situation with two time-scales that are equivalent up to a constant for each individual, such as for example follow-up time and age. One may see this as arising from the the illness-death model, or the disability model, where the additional time-scale may be duration in the illness state of the model; see Keiding (1991) for a general discussion of these models. There is rather limited work on how to deal with multiple time-scales in a biomedical context, see for example Oakes (1995); Iacobelli & Carstensen (2013) and Duchesne & Lawless (2000) and references therein. We present a non-parametric regression approach with two time-scales where each time-scale contribute additively to the mortality. The regression setting models the effect of covariates by additive Aalen models on each time-scale (Aalen, 1989; Huffer & McKeague, 1991; Andersen et al., 1993; Martinussen & Scheike, 2006). This allows covariates to have effects that vary on two different time-scales. In a motivating example we consider patients that experience myocardial infarction, and aim at predicting the intensity considering the two time-scales age and time since myocardial infarction. As a consequence, we can make survival predictions for patients given their age at diagnosis. This model was considered previously by Scheike (2001) where estimation was based on smoothing for one of the time-scales. A study closely related to ours is Kauermann & Khomski (2006) who studied the two most common time scales: age and duration. The underlying technical setting of Kauermann & Khomski (2006) was a multiplicative hazard model without covariates that is estimated via splines. In contrast our approach is an additive hazard model including covariates and estimating without smoothing. Alternative smoothing methodologies to multiplicative hazard estimation includes Linton et al. (2003); Huang (1999); Hastie & Tibshirani (1986); Lin et al. (2016). None of the known multiplicative hazard approaches including the ones mentioned above are able to estimate without smoothing, include time varying covariate-effects, or are able to provide simultaneous confidence bands as the additive approach of this paper does provide. We do know that smoothing improves efficiencies of cumulatively estimated quantities, see Guillen et al. (2007) for the simplest possible case. However, smoothing is also a complexity and experts applying survival analysis have developed a practical way of smoothing by eye the underlying rough non-parametric estimators of Kaplan & Meier (1958); Nelson (1972). The advantage of providing estimators without smoothing is that there can be no confusion from the complicated process of picking the smoothing procedure first and the amount of smoothing after that. Even if a smoothing approach is eventually used, then the smoothing free procedure would always count as a benchmark approach to check whether something went wrong during the smoothing. Our backfitting approach is different from standard backfitting in regression, see for example the smooth additive backtiffing approach of Mammen et al. (1999), where data is projected down via a smoothing kernel onto an additive subspace. In the backfitting approach of this paper, the non-parametric dynamics is only taking place in the two time directions, and the end result is therefore closer to the classical approach of Nelson (1972) with a non-smooth estimator of the dynamics in the one-dimensional time axis. What is obtained through Aalen’s additive hazard regression model on two time axis is that the dynamics of the two time effects are adjusted for covariaties in a way that keep the one-dimensional structure of the non-parametric dynamics. The expert user of survival methodology can therefore use the well developed intuition from looking at Nelson-Aalen estimators and Kaplan-Meier estimators when interpreting the empirical results based on the new methodology of this paper. Another advantage of estimating directly the cumulative hazards is that we are able to obtain a simple uniform asymptotic description of our estimators. We are thus able to construct confidence bands and intervals, that are based on bootstrapping the underlying martingales.

The paper is organised as follows. Section 2 presents the model via counting processes. Section 3 gives some least squares based local estimating equations that are solved to give simple explicit estimators of the non-parametric effects of the model. Based on these explicit estimators we are able to derive asymptotic results and provide the estimators with asymptotic standard errors. Sections 4-6 discusses how to solve the equations and compute the estimator practically and how deal with identifiability issues. Section 7 shows how the large sample properties may be derived and in Section 8 we construct confidence bands. Section 9 demonstrates the finite sample properties supporting Section 10 where we use our proposed methods in a worked example. Finally, Section 10 discusses some possible extensions.

## 2 Aalen’s Additive Hazard Model for Two Time-Scales

Let be independent counting processes that do not have common jumps and are adapted to a filtration that satisfy the usual conditions (Andersen et al., 1993). We assume that the counting processes have intensities given by

(1) |

where and are tupels of one dimensional deterministic functions, and are predictable cadlag covariate vectors with and having almost surely full rank, and is a real-valued random variable observed at time . If for all , does not need to be observed.

The model is the sum of two Additive Alalen Models running on two different time scales, see also Scheike(2001). The two time-scales are and where the latter time-scale is specific to each individual and is some lower-limit that depends on the observed range of the second time-scale. Note, that no indicator variables are introduced but are absorbed in the covariates. In the illness-death model, say, might be time since diagnosis (duration) among subjects that have entered the illness stage of the model and could be the age when the transition to the illness stage occurred, such that is the age of the subject.

After introducing some notation we present an estimation procedure that leads to explicit estimators of and . The cumulative effects have the advantage compared to and that they may be used for inferential purposes since a more satisfactory simultaneous convergence can be established for these processes. We derive the asymptotic distribution for these estimators and a bootstrapping procedure quantifying the estimation uncertainty. Based on the cumulative intensity one may estimate the intensity by smoothing techniques.

2.1 Notation Let such that are martingales. Let further be the n-dimensional counting process, is its compensator, such that is an n-dimensional martingale, and define matrices and with dimensions and , respectively. The individual entry times are summarised in one vector . A superscript denotes a shift in the argument, i.e, for a generic function , . For a generic matrix , with rows , and a n-dimensional vector , is defined through shifting the rows: . For a generic matrix , a minus superscript, , denotes the Moore-Penrose inverse. An integral, , with no limits denotes integration over the whole range.

## 3 Identification of the entering nonparametric parameters

In many cases some covariates will enter both the and the design. If this is the case, then the functions and are not identified in model (2) – constants can be shifted for the components that share the same covariate without altering the intensity. Without loss of generality we assume that and share the first columns, i.e., for all ,

We formulate the problem using group-theoretic arguments, see also Carstensen (2007); Kuang et al. (2008). Fix constants and define as valued function having all entries but the and the equal zero:

We define the group by

The identification problem can be rephrased as that the intensity defined in (2) is a function of , which is invariant to transformations . In the sequel we circumvent the identification issue by adding the following constraint

(2) |

noting that for any solution of model (2), there exists a unique solution that fulfills (2). Clearly other choices are also possible.

## 4 Least squares minimisation ignoring the identificating of the nonparametric parameters

We split the identification challenge in two. First we estimate ignoring identification of the parameters, and then we show in next section how to identify the estimated parameters. In this section we therefore ignore the identification problem keeping in mind that the solutions below are hence not unique. We motivate our estimator via the following least squares criteria.

where the integrals can be understood as Stieltjes integrals, noting that and are left continuous. Minimisation runs over all possible integrators. One can already see that the minimiser, if it exists, will be a step-function, since is a step function. To simplify notation we will generally work in matrix notation so that above minimisation criteria can also be written as

Straight forward computations utilzing calculus of variations lead to solving the following first order conditions for all , :

Rearranging yields

The last set of equations can be further rewritten to the backfitting equations

(3) | ||||

(4) |

where

###### Remark 1

In the case with no covariates, i.e.,

with , the risk indicators are

## 5 Establishing existence, identification and uniqueness of the estimator

In section 3 we outlined the identification problem but ignored it when establishing the estimator in the previous section. In this section we provide a fully identified estimator of our problem. When aiming to solve equations (3) and (4) the identification problem can no longer be ignored. In order to get a better grip of the situation we will now rewrite the backfitting equations as a linear operator equation. We can compress equations (3) and (4) into one matrix equation:

where with some miss-use of notation . Or even simpler

(5) |

with obvious notation, and linear operator :

Note that is composed of the marginal Aalen estimators of the two time scales, and . Additionally, the operator is compact because it is the composition of an integral operator, which is compact, and a derivative operator, which is bounded. The operator being compact means that it can be arbitrarily close approximated by a finite dimensional matrix which simplifies both the numerical and theoretical considerations. If the eigenvalues of are bounded away from one, then, is invertible and we have

Hence existence and uniqueness of our proposed estimator can be translated to properties of the eigenvalues of . One can for instance easily verify that if some covariates are both in the and the design, then will have an eigenvalue equal to one - as discussed in the following remark.

###### Remark 2

Consider the most simple case , i.e., . Given a constant , consider the pair of linear function with , as defined in Section 3. Assuming that and are bounded away from zero on the whole range , one can easily verify that

To see this, e.g., for the second equation, note

Hence, we have

So that one is clearly an eigenvalue of with corresponding eigenfunction . In other words the identification issue of the model carries over to the estimator. With analogue arguments one can show that in the more general case the eigenspace corresponding to eigenvalue equal one includes the functions in . Functions are defined in Section 3.

We now utilize constraint (2) and incorporate it into new backfitting equations:

(6) | ||||

(7) |

where is the q-dimensional vector . This translates to the new operator equation

(8) |

where . The next proposition states that the solutions of include all relevant solutions of (5) and that every solution of is a solution of .

###### Proposition 1

The proof can be found in the appendix.

With Proposition 1 at hand it is justified to define our estimator as the solution of (8). We will now discuss existence and uniqueness of the solution of (8).

Note that is known and hence one can calculate a numerical approximation of its eigenvalues by working on a grid. Consider the sub-space

It holds that , where is a projection into . We have . We can check whether equals . This can be done by calculating the dimension of the eigenspace of corresponding to an eigenvalue equal one. The dimension will be at least . If it is exactly , then .

The next proposition states that if , and , then both and are bijective.

###### Proposition 2

Assume that has Eigenvalue 1 with multiplicity . Then, will be bijective. If furthermore E has Eigenvalue 1 with multiplicity , then is bijective and hence invertible. In particular a solution of equations (8) exists and it is unique.

The proof can be found in the Appendix.

## 6 Calculating the estimator

There are two major ways of calculating the proposed estimator. Either one directly calculates and applies it on or something closer to an iterative procedure. For the latter, by iterative application of (8) we derive that

(10) |

If the absolute values of the eigenvalues of are bounded from above by a constant strictly smaller than 1, then (10) is well defined with , and the converging series

so that the iterative algorithm

(11) |

converges from any starting point. Note that (11) is the usual way the backfiting equations (6),(7) or equivalently (8) are solved. Another way is to calculate the finite sum

with some stopping criteria . We conclude that the proposed estimator can be calculated in a straight forward manner from the compound Aalen estimator and the operator

We now briefly discuss how can be calculated in the simple case . Here, can be approximated by a matrix where are the number of grid points in and , respectively. This is done by first calculating the values and for every grid point; see Remark 1 for the definitions of the the functions. We call the resulting matrices and . Afterwards, is derived from , via

The matrices are then transformed to the desired operator via

Finally,

So that given a function , one calculates its values on the grid and summarises it in a vector . The function is then approximated via where the latter is a simple matrix multiplication.

## 7 Asymptotics

Note that we have

(12) |

where arises from by replacing by . It is hereby quite remarkable that is the observable operator from the previous sections and not some asymptotic limit. We further conclude that the least square solution (6) and (7) is a plug-in estimator of (12). The estimation error is then given as

(13) |

As in the last section, If has eigenvalues all bounded away from one, then

So the asymptotic behaviour of can be deduced from the asymptotic behaviour of and , with the latter being the compound estimation error of two additive Aalen models on different time-scales.

###### Theorem 1

Under assumptions (A)–(G), the estimator exists. Furthermore the estimator is consistent:

in Skorohod space . Here, is treated as one stochastic process defined on by setting for and , . And similarly, for and , . The process is a dimensional mean-zero Gaussian process with covariation matrix described in the Appendix, and is the limit of .

The proof can be found in the Appendix.

## 8 Confidence Bands

While we could use the central limit theorem of the previous section to construct confidence bands, it has been suggested that better small sample performance can be achieved by directly bootstrapping the estimation error. We propose a wild bootstrap approach based on the relationship

Since ( is known, it is enough to to only approximate . We do this via the wild bootstrap version

or

where is a mean zero random variable with unit variance. The random variable is generated such that for fixed , it is independent to all other variables. It is straight forward to confirm that is a mean zero process that has the same covariance as (The covariance of is given in the appendix). Hence, we directly derive the following proposition.

###### Proposition 3

Under assumptions (A)–(G), the bootstrapped estimation error is uniformly consistent, i.e., for

in Skorohod space , where is is described in Theorem 1.

The proof can be found in the Appendix.

One useful consequence of this is that we can estimate standard errors of our estimator based on the approximation from the bootstrap. We denote these estimators as for the two components .

###### Corollary 1

Under assumptions (A)–(G), the bootstrapped errors lead to confidence bands for over providing an asymptotic coverage probability of , where

and

We explore the performance of the estimator of the standard error and the uniform bands in the next section.

## 9 Simulations

We generated data from the simple two-time scale model with age and duration that resemble the data we consider in worked example in the next section. Thus assuming that the hazard for those under risk is given as , where and the entry ages where drawn uniformly from but making sure that 10 % of the data started in to (to avoid difficulties with left truncation in the estimation). The component was piecewise constant with rate in the time-interval , then in and then finally to satisfy our constraint in , so that . All subjects were censored after years of follow up.

In all simulations we used a discrete approximation based on a time-grid of either 100 points in both the age direction and on the duration time-scale .

### 9.1 Bias of backfitting

We considered sample sizes 100, 200 and 400 and show the bias for the two-components in Table 1 based on 1000 realizations.

age | n=100 | n=200 | n=400 |
---|---|---|---|

time | n=100 | n=200 | n=400 |

We note that the the backfitting algorithm is almost unbiased across all sample size and improves as the sample size increases. This is despite the fact that the simulated component in the time-direction really is quite wild.

### 9.2 Bootstrap uncertainty

Secondly, we demonstrate that our bootstrap seems to work well to describe the uncertainty of the estimates. We simulated data as before and based on 1000 realisations with 100 bootstrap’s based on we estimated: a) the point-wise standard error for the two-components; b) computed the pointwise coverage baed on these; c) and constructed uniform confidence bands, as described in Corollary 1, for the the two components and its coverage.

Table 2 around here

n | age | mean se | sd | cov | time | mean se | sd | cov |
---|---|---|---|---|---|---|---|---|

We note that the standard error is well estimated by the bootstrapped standard deviation across all sample sizes and for both components. In addition the pointwise coverage is close to the nominal 95 % level for the larger sample sizes. But even for the coverage is reasonable for most time-points for the two components.

Finally, we also considered the performance of the confidence bands based on our bootstrap approach.

Table 3 around here

n | coverage (age) | coverage (time) |
---|---|---|

When gets larger these bands are quite close to the nominal 95 % level, but for the asymptotics have not quite set in to make the entire band work well.

## 10 Application to the TRACE study

The TRACE study group (see e.g. Jensen et al. (1997) ) has collected information on more than 4000 consecutive patients with acute myocardial infarction (AMI) with the aim of studying the prognostic importance of various risk factors on mortality. We here consider a subset of 1878 of these patients that are available in the timereg R package. At the age of entry (age of diagnosis) the patients had various risk factors recorded, but we here just show the simple model with the effects of the two-time-scales age and duration. It is expected that the duration time-scale has a strong initial effect of dying that then disappears when patients survive the first period right after their AMI.

We then estimated the two-time-scale model under the identifiability condition that . Restricting attention to patients more than 40 years of age, and within the first 5 duration years after the diagnosis.

First we estimate the mortality on the two time-scales separately, the two marginal estimates, see Figure 1. Panel (a) shows the cumulative hazard on the age time-scale with the marginal estimate (full line) and the one with adjustment for duration effects (broken line), and panel (b) the mortality on the duration time-scale with the marginal estimate (full line) and with adjustment for age effects (broken line). We note that on the duration time-scale the cumulative hazard is quite steep. In addition we show 95 % confidence bands based on our bootstrap (regions), and the pointwise confidence intervals (dotted line).

Figure 1 about here

Taking out the duration effect slightly alters the estimate of the age-effect. In contrast the duration effect is strongly confounded by age effect estimates, and here the two-time scale model more clearly demonstrates what is going on on the duration time-scale. The duration effect is strong initially and then after surviving the first 220 days we see a protective effect (dotted vertical line).

We stress that the interpretation of the hazards on the two-time scales are difficult, due to, for example, the constraint that needs to be imposed to identify a specific solution. Nevertheless, it very useful to see the components from the two time-scales that jointly make up the hazard for an individual, and can be used for the prediction purposes as we demonstrate further below. Note also that due to the additive structure the duration effect can be interpreted as giving relative survival due to the duration time-scale.

Figure 2 about here

In Figure 2 we show the survival predictions for subjects that are 60, 70, or 80, respectively, using the two-time scale model. Thus computing and constructing the confidence bands using the bootstrap approach for for . These curves are a direct consequence of having the two-components and are directly interpretable.

## 11 Discussion

By utilising the additive structure we have demonstrated that one can estimate the effect of two time-scales directly by a backfitting algorithm that does not involve smoothing. By working on the cumulative this also lead to uniform asymptotic description and a simple bootstrap procedure for getting estimates of the uncertainty and for constructing for example confidence intervals. These cumulative may form the basis for smoothing based estimates when the hazard are of interest, but often the cumulative are the quantities of key interest for example when interest is on survival predictions.

Clearly, the model could also be fitted by a more standard backfitting approach working on the hazard scale as in … for multiplicative hazard models.

Our backfitting approach can be extended for example the age-period-cohort model but here identifiability conditions are more complex to build into the estimation.

## Appendix A Proofs

### a.1 Proof of Proposition 1

With , the proposition directly follows from , and the fact that is a projection into .

### a.2 Proof of Proposition 2

Since the eigenspace of corresponding to the eigenvalue equal 1 has dimension , we know its exact form:

see also Remark 2. One can then verify that . This is because linear functions cannot be constructed as sum of a linear and non-linear functions. Noting that is a compact operator, we conclude that is an isomorphism from to . We introduce the operator , where is the projection onto . The condition is equivalent to and . Since the eigenspace of corresponding to an eigenvalue of 1 has dimension , is not true for non-linear . This is because when restricted on non-linear functions . When considering a linear , then . We conclude that the solution of is trivial. Hence the kern of is trivial. Since is compact this means is bijective, in particular invertible.

### a.3 Assumptions

We first define a few quantities.

For every in, , we define the following matrices

as well as