# Maximum likelihood and pseudo score approaches for parametric time-to-event analysis with informative entry times\thanksrefT1

###### Abstract

We develop a maximum likelihood estimating approach for time-to-event Weibull regression models with outcome-dependent sampling, where sampling of subjects is dependent on the residual fraction of the time left to developing the event of interest. Additionally, we propose a two-stage approach which proceeds by iteratively estimating, through a pseudo score, the Weibull parameters of interest (i.e., the regression parameters) conditional on the inverse probability of sampling weights; and then re-estimating these weights (given the updated Weibull parameter estimates) through the profiled full likelihood. With these two new methods, both the estimated sampling mechanism parameters and the Weibull parameters are consistently estimated under correct specification of the conditional referral distribution. Standard errors for the regression parameters are obtained directly from inverting the observed information matrix in the full likelihood specification and by either calculating bootstrap or robust standard errors for the hybrid pseudo score/profiled likelihood approach. Loss of efficiency with the latter approach is considered. Robustness of the proposed methods to misspecification of the referral mechanism and the time-to-event distribution is also briefly examined. Further, we show how to extend our methods to the family of parametric time-to-event distributions characterized by the generalized gamma distribution. The motivation for these two approaches came from data on time to cirrhosis from hepatitis C viral infection in patients referred to the Edinburgh liver clinic. We analyze these data here.

10.1214/14-AOAS725 \volume8 \issue2 2014 \firstpage726 \lastpage746 \setattributefrontmatterskip0plus 3minus 3 \setattributefrontmattercmd \setattributeabstractwidth280pt \newproclaimassumptionAssumption

Time-to-event analysis with informative entry times

, and \thankstextT1Supported in part by the Medical Research Council (Unit Programme numbers U105261167, U105260794).

Biased data \kwdgeneralized gamma distribution \kwdoutcome-dependent sampling \kwdpseudo score \kwdrobust standard error \kwdsurvival analysis \kwdWeibull distribution

## 1 Introduction

The modeling of the time from disease onset or infection (i.e., initiating event) to an outcome of relevance is of considerable importance in studies of the natural history of a disease and in projection of disease burden. Prospective studies which recruit and follow an appropriate cohort of subjects from disease onset to the event of interest are ideal for this purpose. However, these studies are inefficient in terms of resources if the event of interest tends to occur well after disease onset, as is the case for hepatitis C virus (HCV) studies of progression to cirrhosis from initial infection. The alternative is to follow a prevalent cohort of cross-sectionally sampled subjects who, prior to recruitment, have already experienced the initiating event (e.g., HCV infection) but not yet the event of interest (e.g., cirrhosis). The left truncated time-to-event data obtained from such a study provide a length-biased sample of the incident population, if sampling is such that an assumption of stationarity over calendar time for the occurrence of the initiating event can be made. Methods for handling both incidence data and such length-biased prevalence data have been well described in the (bio)statistics literature [Andersen et al. (1993), Wang, Brookmeyer and Jewell (1993), Kalbfleisch and Prentice (2002), Brookmeyer (2005), Keiding (2005), Wang (2005), Tsai (2009), Qin and Shen (2010)].

A less explored situation is the analysis of prevalence data arising from a referral cohort where entry into the cohort is dependent on a subject’s residual fraction of time remaining to the event of interest, and inference on the incident population is required. Such data are believed to occur in HCV studies conducted in tertiary care settings, where HCV patients are more likely to be referred to specialist clinics at later stages of disease [Fu et al. (2007)]. The conventional truncation likelihood approach which simply conditions on the time of entry into the cohort does not work here, as the referral time and the time to the event are correlated. The ignoring of this referral bias has led to higher rates of progression to cirrhosis being reported in studies in specialist clinics compared to those in community-based settings [Freeman et al. (2001)]. As cirrhosis linked to HCV infection is a major epidemic of the 21st century, it is extremely important to get an accurate picture of the present and future disease burden facing affected regions in order to inform public health decisions and actions.

The aforementioned type of referral or outcome-dependent sampling bias is particularly difficult to deal with unless a full specification (up to unknown parameters) of the probability sampling generating mechanism is provided. In practice, this mechanism will rarely be known and, instead, an approximate formulation of the sampling distribution, which is reasonably robust to misspecification, would be sought.

Previously, Fu, Tom and Bird (2009) proposed a weighted pseudo score [Lawless (1997), Cook and Lawless (2007)] or inverse probability weighted method for estimating the parameters of a Weibull regression model for the incubation period from infection to cirrhosis for the community of hepatitis C virus-infected individuals, when there is cirrhosis-related referral bias to the studied prevalent cohort. The method assumed that everyone in the community would come to clinical attention at or before cirrhosis, so that cirrhosis events are not missed. Therefore, the target community population was assumed “immortal” (in the sense of no competing events), and individuals observed in the study sample to have experienced a cirrhotic event were associated with a weight of one in the estimation procedure. However, for other individuals, Fu, Tom and Bird (2009) used approximate weights and, therefore, consistency of these estimated weights, and, consequently, the regression parameter estimates of interest, was, in general, not guaranteed.

Here we outline a full likelihood approach to this outcome-dependent referral problem in which the likelihood for the joint distribution of the time to referral and the time to outcome of interest, both from the initiating event, is fully specified. In practice, depending on the dimensionality of the joint parameter space, the full likelihood may be difficult to maximize over both the regression parameters of interest and the parameters associated with the time-to-entry process. Therefore, we also investigate another strategy based on a hybrid two-stage approach that iteratively alternates between estimating the parameters associated with the time-to-outcome distribution (i.e., regression and shape parameters) from a pseudo score with fixed weights and then estimating the parameters associated with the time-to-entry/referral process from the profiled full likelihood assuming the regression and shape parameters are known. We retain the assumption of an immortal cohort, although this can be relaxed [Copas and Farewell (2001)]. Primarily, we describe the approaches where the time-to-event distribution is assumed Weibull. However, we show how the methods can be extended to the family of parametric time-to-event distributions characterized by the generalized gamma distribution [Stacy (1962), Stacy and Mihram (1965), Prentice (1974), Farewell and Prentice (1977), Lawless (1980), Cox et al. (2007)], for which the Weibull is an important special case.

## 2 Notation, framework and assumptions

For individuals in the target/incident population, let the calendar time of the initiating event be and the calendar period of interest for inference on this population be between calendar times and . Therefore, . Clinical observation of an individual will be left truncated at their time of referral to the clinic which is the time of entry into the cohort for those referred before . Let the time intervals from to potential referral and to the event of interest be and , respectively, and denote by the vector of explanatory variables. We assume that the time-to-event from in the incident population comes from a Weibull distribution with support on the positive real line and with positive shape and scale parameters, and , respectively, where for given and is a vector of regression parameters associated with . More explicitly, the density and distribution functions of from an initiating event calendar time , and given the vector of explanatory variables , are and , respectively. As there is no dependence on the actual value of in these functions, we simplify the notation for the density and distribution functions of to and , respectively. Additionally, we assume, as is done for length-biased sampling problems, that within the calendar period , the rate of occurrence of the initiating event remains constant. The independence of the distribution of from when its initiating event occurred and the stationarity of the initiating event process within the calendar period of interest are together referred to as the steady state or equilibrium condition [Wang (2005)].

An individual is assumed to be included in the studied prevalent cohort if , with the indicator variable denoting selection/inclusion. In addition to the assumption that selected patients will experience the event of interest and be referred prior to the time of the event, we assume the following for the individuals in the target population.

[(Truncation before outcome)] The truncation (or potential referral or entry) time of an individual is always less than the time to outcome and so .

[(Conditional truncation time)] For a known vector , with , and , and unknown mixture probability vector with , the distribution of given (for ) is a mixture of independent uniform random variables with support in the interval , density function

and cumulative distribution function

The form chosen for this conditional density reflects the belief that the residual fraction, , of time remaining to the event of interest (or, alternatively, the fraction, , of event time elapsed) drives whether a subject is referred [Fu, Tom and Bird (2009)]. It is constructed as a mixture of uniforms so as to allow flexibility in the shape of distribution that can be captured. A notable feature of the random variable (for ), corresponding to the fraction of time elapsed to the event of interest, is its independence from (see theorem in the supplementary material [Tom, Farewell and Bird (2014)]). We will subsequently investigate the impact of misspecifying the partitioning of on results obtained.

For selected subjects (), denote by the censoring time from entry into the cohort, and let be the observed follow-up time until the outcome event or censoring, with the “right censoring” indicator taking the value when uncensored. As the calendar period of interest for inference on this population is between and , then for selected subjects, . That is, follow-up beyond is not planned. Additionally, we assume that is independent of [conditional on either or ] and that the parameters governing the distribution of are distinct from those governing the joint distribution of . That is, the censoring process is ignorable.

To proceed with estimation, we make the following further simplifying assumption:

[(Known initiation time)] The calendar time of the initiating event can be determined for those subjects selected for inclusion in the cohort.

## 3 Estimation methods

### 3.1 Maximum likelihood approach

Let be the number of individuals who have been selected into the cohort. For an included individual , let the observed data be , which are assumed to be independent realizations of . Under Assumptions 2 and 2, the ignorability of the censoring process and conditional on and , the full likelihood for , where , can be written (with, for conciseness, some abuse of notation for continuous variables) as

The first term in the product is the likelihood contribution if corresponds to the true time-to-event (i.e., ) and the second when a right censored event time is observed (i.e., ).

When and setting , it can be shown that

In the situation where (i.e., the hazard rate of increases over time), and defining , the denominator, , can be analytically evaluated and is found to be

(2) | |||

with the cumulative distribution function of a gamma random variable with rate and shape , evaluated at (), where and denote the lower and upper incomplete gamma functions and the ordinary gamma function. Details of the derivation are provided in the supplementary material [Tom, Farewell and Bird (2014)] for the family of parametric time-to-event distributions characterized by the generalized gamma distribution with either monotonically increasing or arc shaped (upside-down bathtub) hazards [Glaser (1980), Cox et al. (2007)].

For selected individuals with , the likelihood contribution in (3.1), , can be written as

where it can be shown (see the supplementary material [Tom, Farewell and Bird (2014)]) that when , the numerator, , takes the closed form

(3) | |||

For the case where (i.e., the hazard rate of is monotonically decreasing over time), similar closed-form expressions for and can be obtained but with the upper incomplete gamma function of the form replaced with in (3.1) and (3.1), where denotes the generalized exponential integral with and . However, for this present paper, we consider only , as it is difficult to envisage in our context a situation where an initially decreasing hazard rate over time would arise.

The maximum likelihood estimates, , for can now be obtained by substituting these various expressions for the terms in (3.1) into and then maximizing over . Estimates of the standard errors for are obtained from inverting the observed information matrix, , evaluated at .

### 3.2 Hybrid pseudo score/profile likelihood approach

As an alternative to the full likelihood approach, a pseudo score method based on inverse probability weights can be developed [Cook and Lawless (2007)]. We assume that the incident population has individuals with initiating event times occurring in the period to . The weighted pseudo score, , with , is constructed by weighting the Weibull score contributions, for selected subjects by (, where is the selection probability for subject . This weighted pseudo Weibull score, which has expectation zero, takes the form

For a selected study subject (i.e., ), is either if or if , with . The former probability expression evaluates to , as a subject who is observed to have experienced the event of interest would have and since (by Assumption 2), then, with probability , . The latter probability expression is shown in the supplementary material [Tom, Farewell and Bird (2014)] to be , which is a function of . These expressions are derived under the supposition that no further follow-up information on referred individuals beyond , the close of the study, is available. This reflects the situation in our application. However, these expressions can be easily modified to take account of further follow-up information beyond the close of study, as shown in the supplementary material [Tom, Farewell and Bird (2014)] for selected individuals with and . The former probability expression for an uncensored selected individual is trivially , where can now be greater than .

Estimation of under this second approach proceeds in two stages. First, , the vector of Weibull shape and regression parameters, is estimated by setting the pseudo score, , to zero and solving for with given to get the maximum weighted pseudo score estimates of . Next, the inclusion probabilities for selected subjects with are reevaluated at these maximum weighted pseudo score estimates and at the maximum profile likelihood estimate of obtained after maximizing over with set in (3.1) to its maximum weighted pseudo score estimates. These two steps are iterated until convergence of the estimates for to . Initially the inclusion probabilities are all assumed to take the value and, therefore, the initial estimate of is from the standard (unweighted) Weibull regression model. This iterative estimation procedure is similar to that used by Hardin and Hilbe (2003) for longitudinal data, although, to minimize efficiency loss, we do not adopt their assumption of orthogonality of the estimating equations.

Estimated standard errors based on this approach can be obtained either through a standard bootstrap procedure or determined based on Taylor series expansion arguments applied to the set of unbiased estimating equations and . Under appropriate regularity conditions, the asymptotic joint distribution of is Gaussian with expectation zero and variance–covariance matrix consistently estimated by the robust sandwich matrix evaluated at , where is

and , where for , with the dependency of on explicitly shown. With this extra notation, it is easily seen that and , and

### 3.3 Simulation study: Consistency, efficiency and robustness considerations

To illustrate the performance of the proposed methods, in particular, with regard to efficiency, bias and robustness, we conducted a small-scale simulation with a design similar to that in Fu et al. (2007), Fu, Tom and Bird (2009). We performed 500 simulation runs and generated, in each of the runs, a community sample size of . We considered three different time-to-event distributions from which to simulate our data. These were (i) the Weibull, (ii) the gamma and (iii) the log-normal. The parameter configurations for these three distributions were (i) , (ii) , and (iii) , corresponding to the shape parameters, and , scale parameter, , and the regression parameters associated with the covariate vector comprising of an intercept, a continuous variable, , generated from a log-normal distribution with location and scale parameters taking the values 3 and 0.3, respectively, and a binary variable, , generated from a Bernoulli distribution with success probability of . These covariates are included through a log-linear regression model on the Weibull’s and gamma’s scale parameters and a linear model on the log-normal’s location parameter. The parameters , , and were chosen to make the log baseline means from the regression models corresponding to the three distribution all equal to . The truncation times (in years), which are entry times for those selected, are generated from the conditional distribution proposed earlier with and . For simplicity in interpretation of the various simulation results to be presented, we assume everyone in the community experienced the initiating event at the same calendar date and those whose truncation time was less than years entered the referral cohort. Administrative right censoring of sampled subjects occurred at years from the calendar date of the initiating event. This was the only type of censoring considered here. The parameters of the Weibull distribution were informed by the data that arose from the Edinburgh Royal Infirmary’s hepatitis C virus liver clinic, which are analyzed later.

#### Correct specification of the time-to-event distribution

Table 1 presents the findings from the aforementioned simulation. Both approaches produce consistent estimates of the parameters from the Weibull model and the sampling mechanism when the time-to-event distribution was Weibull. As expected, more efficient estimates of the shape and regression parameters were obtained from the full likelihood approach than the hybrid pseudo score/profile likelihood approach. Similar estimated standard errors were obtained from both approaches for the corresponding estimates of . This perhaps reflects the near optimality of the hybrid approach when estimating since the relevant part of the full likelihood is being used.

Hybrid pseudo score/ | ||||||||
---|---|---|---|---|---|---|---|---|

Full likelihood | profile likelihood | |||||||

True distribution | Parameters | |||||||

Mean | ESE | Mean | ESE | RE | ||||

Weibull | ||||||||

Gamma | ||||||||

Log-normal | ||||||||

[]Mean, average of the estimate; , average of the estimated standard error; ESE, empirical standard error; , average of the estimated robust standard error; RE, the empirical variance of the maximum likelihood estimator divided by the empirical variance of the maximum hybrid pseudo score/profile likelihood estimator.

#### Misspecification of the time-to-event distribution

Table 1 also shows the results when the true time-to-event distributions are gamma and log-normal but the full likelihood and hybrid pseudo score/profile likelihood approaches were fitted assuming the time-to-event distribution was Weibull. Here we see that the impact of this incorrect assumption for the time-to-event distribution is negligible for the estimation of the regression parameters and and minor for the estimation of . The estimates of the log baseline mean under misspecification of the true gamma and log-normal distributions by the Weibull [i.e., ] were approximately and , respectively. As earlier mentioned, the true log baseline mean is . Therefore, for these particular cases of misspecification there is underestimation of the log baseline mean. This underestimation of the log baseline mean would result in an underestimation of the tail probabilities of the marginal population time-to-event distribution, which would lead to underestimation of the population size.

Hybrid pseudo score/ | ||||||
---|---|---|---|---|---|---|

Full likelihood | profile likelihood | |||||

Parameters | Mean | ESE | Mean | ESE | ||

[]Mean, average of the estimate; , average of the estimated standard error; ESE, empirical standard error; , average of the estimated robust standard error.

#### Misspecification of the referral mechanism

To investigate the relative robustness of the proposed mixture of uniforms for the conditional distribution of the truncation times given the time to event, we began by rerunning our simulation study as before, except with the conditional distribution of the truncation time now generated from a single uniform distribution in the interval zero to the true time to event instead of the five-component mixture of uniforms. However, the less parsimonious five-component mixture of uniforms, with , was assumed as the working conditional distribution of the truncation times when fitting the full likelihood and hybrid approaches to the simulated data at each simulation run. The results are shown in Table 2. The less parsimonious working conditional distribution for the truncation times has no apparent impact, as would be expected, on consistent estimation of the regression and shape parameters of the Weibull distribution. This suggests that finer partitions of than needed do not impact on consistency of estimated regression and shape parameters, although may inflate the standard errors of these estimates and the mixture probability estimates.

Hybrid pseudo score/ | ||||||
---|---|---|---|---|---|---|

Full likelihood | profile likelihood | |||||

Parameters | Mean | ESE | Mean | ESE | ||

[]Mean, average of the estimate; , average of the estimated standard error; ESE, empirical standard error; , average of the estimated robust standard error.

To explore the impact of misspecification due to the incorrect partitioning of , we again repeated the simulation study but now allowing the truncation times to be generated from an eight-component mixture of uniform conditional distribution, with and , mimicking a strong preference for referrals to occur in the last half of individuals’ incubation period. Additionally, we considered three scenarios for recruitment and administrative censoring, which reflected an increasing number of individuals referred and observed experiencing the event of interest and thus providing more information to the analysis: (i) ; (ii) ; and (iii) . We fitted the simulated data sets assuming the working five-component conditional truncation time distribution mentioned earlier, which is based on a coarser partitioning of than the true generating mechanism. The results are shown in Table 3. Here, we see a noticeable negative impact of misspecification, from a cruder partitioning of , on the estimates of the regression parameters and shape parameter (and mixture probabilities) when using the full likelihood approach (i.e., bias), which diminishes as the amount of information from the sample increases (as reflected by the diminishing standard errors). There is, however, no noticeable bias observed in the estimates of the regression parameters and shape parameter obtained using the hybrid approach. Moreover, there is no apparent bias, under this hybrid approach, in the mixture probabilities, except for , where the bias decreases as the information content increases. This result suggests that the hybrid approach is significantly more robust to misspecification than the full likelihood approach under various data scenarios, but at the cost of being less efficient in general. This perhaps is due to the hybrid approach being a two-stage method.

### 3.4 Application to Edinburgh Royal Infirmary’s hepatitis C virus liver clinic

The hepatitis C virus epidemic is a major public health concern in the UK and across the world. To project national hepatitis C virus burden, unbiased estimation of the progression rate from infection to liver cirrhosis is required for the whole community of hepatitis C viral infected individuals. Often, however, the available data on progression to cirrhosis are from a biased sample of the population of interest. In the application we consider here, the data on individuals infected with the hepatitis C virus prior to 2000 (i.e., within the calendar period 1950 to 2000) arose from the Edinburgh Royal Infirmary’s hepatitis C virus liver clinic, a tertiary referral hospital clinic whereby patients with more rapid disease progression, or symptomatic disease, would be preferentially referred, with referral increasingly likely to be closer to onset of cirrhosis. Thus, it is important to account for this outcome-dependent recruitment when analyzing these data so as to provide realistic estimates of the progression rates and the effects of risk factors on time to cirrhosis from infection for the Edinburgh’s community (of unknown size) of hepatitis C virus-infected individuals.

To investigate the pattern of referral over patients’ cirrhosis incubation period, we model the referral time given the cirrhosis time as coming from the probability density function

where are the unknown mixture probabilities, summing to , that are required to be estimated. This distribution is chosen because of the clinical belief that patients are more likely to be referred in the last half of their cirrhosis incubation period and we therefore decided to model this half in more detail.

Furthermore, we assume that, for the th hepatitis C virus patient in the community, the time to cirrhosis from known infection time follows a Weibull distribution with unknown shape and scale parameters, and , respectively. The scale parameter, , is related to the th patient’s continuous and binary explanatory variables, age at hepatitis C viral infection () and excessive alcohol consumption (), through the relationship , where is the vector of regression parameters.

Hybrid pseudo score/ | |||||
---|---|---|---|---|---|

Full likelihood | profile likelihood | ||||

Parameters | Estimate | s.e. | Estimate | Robust s.e. | Bootstrap s.e. |

Log-likelihood | 1259.78 | ||||

N (bootstrap IQR) | 4196 | (3414–5139) |

[]s.e., standard error; N, estimated size of Edinburgh’s hepatitis C virus community prior to 2000; IQR, inter-quartile range; and , regression coefficients corresponding to age at HCV infection and excessive alcohol consumption status.

Table 4 shows the results obtained on fitting the Edinburgh Royal Infirmary data using both the full likelihood and hybrid pseudo score/profile likelihood approaches. The bootstrap standard errors are obtained from a bootstrap sample of . Relatively similar regression parameter estimates and corresponding estimated standard errors are obtained from the two approaches. The belief by clinicians that referral was more likely in the last half of the cirrhosis period is borne out with about of infected individuals estimated to be referred then. Strikingly, about of infected individuals are estimated to have been referred in the last one eighth of their cirrhosis period. On repeating the analysis with a cruder representation of the referral mechanism based on partitioning only into halves produced fairly similar regression estimates under both approaches (data not shown) to those in Table 4. However, for the more variable Weibull shape parameter, there are noticeable differences in the estimates from this cruder referral mechanism to those previously obtained, in particular, for the full likelihood approach. The estimates (with standard errors) of the shape parameter are now 3.833 (0.360) and 4.755 (0.400) for the full likelihood and hybrid approaches, respectively. Additionally, the estimates of the probability of being referred in the last half of the incubation period, obtained assuming the cruder referral mechanism, are roughly equal under the two approaches but now calculated to be approximately as opposed to the previously estimated.

From the hybrid method, we obtain an estimate (bootstrap inter-quartile range) of 4196 (3414–5139) infected individuals in Edinburgh’s hepatitis C virus community prior to 2000, through the summation of the inverse probability weights. Both older age at onset of infection and excessive alcohol consumption are found statistically significantly to increase the rate of progression to cirrhosis. The corresponding relative risk estimates (with 95% confidence intervals) for age at infection onset and for excessive alcohol consumption are 1.13 (1.09, 1.18) and 13.4 (5.1, 35.3), respectively. The (inverse probability weighted) estimates of the mean age at HCV infection (with standard deviation) and the proportion consuming excessive alcohol in the Edinburgh HCV community prior to 2000 are 20.3 (6.3) years and 6.7%, respectively. For comparison, the mean age at HCV infection (with standard deviation) and the proportion consuming excessive alcohol from the Edinburgh Royal Infirmary clinic data are 22.4 (9.8) years and 30%, respectively. There is a striking difference in the community’s and clinic’s estimates of the proportion consuming excessive alcohol.

An estimated marginal 30-year progression rate (with sampling uncertainty) to cirrhosis from infection in the Edinburgh HCV community can also be calculated through a fast parametric bootstrap-like approach [Aalen et al. (1997)]. Here we repeatedly sample from the asymptotic distribution of , specified by a multivariate normal distribution with mean vector and variance–covariance matrix given by and the robust sandwich matrix, evaluated at . For each of these sampled parameter vectors, we define a corresponding hypothetical Edinburgh HCV community (prior to 2000) that can be entirely followed up to cirrhosis. The size, mean and standard deviation of the age at HCV infection and the excessive alcohol consumption proportion for each of these hypothetical communities are calculated by applying the inverse probability weights, calculated using the sampled ’s, to the Edinburgh Royal Infirmary data. The communities’ cirrhosis data can be constructed by generating the times to cirrhosis from the proposed Weibull model using the sampled regression and shape parameters, after first simulating the explanatory variables, age at HCV infection and alcohol consumption status, for the created communities. We assume that the age at HCV infection and alcohol consumption status distributions are independent from one another and are log-normal and Bernoulli, respectively, with mean and standard deviation (for the log normal) and excessive alcohol consumption proportion parameters arising from the application of the sampled , through the generated inverse probability weights, to the Edinburgh Royal Infirmary data. Our assumption of marginal independence is based on an estimated Pearson’s correlation between age at HCV infection and excessive alcohol consumption status of 0.012 in the actual collected data, which we do not anticipate to change dramatically when translated to the community.

The application of this fast parametric bootstrap-like approach, over 500 runs, gave a mean 30-year progression rate to cirrhosis in the hypothetical communities of 14% with a standard deviation of 6% and a 95% range of 6% to 29%. Figure 2 provides an example of a marginal Kaplan–Meier curve for one hypothetical Edinburgh community generated at the maximum weighted pseudo score estimates.

The estimated 30-year Kaplan–Meier progression rate (with 95% confidence interval) to cirrhosis based on the actual Edinburgh Royal Infirmary data, ignoring the outcome-dependent referral and left truncation, is 42% (31%, 52%). The corresponding conditional Kaplan–Meier estimate (conditioning on not experiencing cirrhosis at least roughly 1 year after infection), assuming that the Edinburgh Royal Infirmary data is a length-biased sample of the Edinburgh HCV community, is 86% with a 95% confidence interval of (75%, 92%). Both of these standard estimates dramatically overestimate the 30-year progression rate, as they do not account for the correlation between the referral time and the time to cirrhosis of a referred patient.

To check the robustness of our findings for the Edinburgh data, we implemented our two approaches replacing the assumption of a Weibull time-to-event distribution with the generalized gamma distribution (see the supplementary material [Tom, Farewell and Bird (2014)] for its formulation), which has one extra parameter and includes the Weibull, gamma and log-normal all as special cases. Although a likelihood ratio test on 1 degree of freedom () rejected the Weibull in favor of the generalized gamma, the maximum likelihood estimates for the regression parameters of interest were similar to those previously obtained ( and ) and the estimate of the proportion of infected individuals referred to in the last one eighth of their cirrhosis period was again . Furthermore, the estimated mean 30-year progression rate to cirrhosis was similar. On closer inspection, we found that the differences between the assumption of a generalized gamma and that of a Weibull for the time-to-event distribution was noticeable only in the upper tails of the estimated marginal time-to-event distributions of the Edinburgh HCV community, past the actual largest observed time to cirrhosis (i.e., an uncensored event of years) seen from the Edinburgh Royal Infirmary’s liver clinic. Similar to the marginal Kaplan–Meier curve presented under the assumption that the time-to-event distribution is Weibull, Figure 2 also displays the equivalent Kaplan–Meier curve under the assumption of the generalized gamma, and thus provides an illustration of the discrepancy between curves being evident in the upper tail beyond years.

## 4 Discussion

A weighted pseudo score method is commonly suggested for handling response-biased observations, where specifying the full likelihood is difficult. Provided that the inverse probability weights can be consistently estimated, then consistency of regression parameter estimates will be achieved using this approach. However, if the full likelihood is available, then it is generally preferable to use it to estimate the parameters of interest, as these estimates will be more efficient than those obtained from the weighted pseudo score method. This preference for the full likelihood over the weighted pseudo score method also holds when the time-to-event distribution is misspecified. In the context of misspecification, we would advocate fitting the more flexible generalized gamma time-to-event distribution (or an alternative such as a semi-parametric piecewise exponential-type distribution), instead of the Weibull, in order to get better estimates of the marginal progression rates. Nevertheless, it is worth noting that a by-product of the weighted pseudo score approach, which makes it appealing, is the straightforward estimation of the total incident population size. This is not directly available (although calculable) from the full likelihood approach. In public health terms, estimation of the total number of HCV carriers and the “true” impact of covariates on HCV progression are key.

In the informative entry time problem addressed here, we were able to develop both a full likelihood approach and a hybrid two-stage pseudo score/profile likelihood approach for outcome-dependent referral where sampling is dependent on the residual fraction of time remaining to develop the event. Under correct specification of the referral mechanism, we found that the full likelihood approach was indeed more efficient than the hybrid approach in the estimation of the regression parameters of interest and the shape parameter. The former approach, however, appeared to be more susceptible to bias if the outcome-dependent referral mechanism was misspecified through a coarser representation and the “information content” of the data (in terms of number of referrals, number of events and length of follow-up) was low. In the situation where the “information content” is considered to be relatively high, it perhaps would be more appealing to adopt the hybrid method over the full likelihood approach, as it could be significantly more robust and the decrease in the resulting efficiency may still be acceptable. In general, we would recommend that when using either approach, and, in particular, the full likelihood one, analysts should begin by specifying a reasonably fine partitioning of the which can then be refined to obtain a more parsimonious representation of the referral mechanism. This strategy would allow checking for sensitivity/robustness to misspecification of the referral mechanism. However, convergence issues may arise if the partitioning is too fine or if the selection probability for a subject is too small. We have not investigated these convergence issues here.

The application of these methods to data from the Edinburgh Royal Infirmary’s hepatitis C virus liver clinic allowed us to characterize realistically the extent of Edinburgh’s HCV epidemic prior to 2000 in terms of progression rate to cirrhosis and the impact of alcohol consumption and age at HCV infection on this progression. Standard survival analysis methods severely overestimated the 30-year progression rate and underestimated the relative risks for the explanatory variables.

In our present analysis of the Edinburgh HCV data, we assumed that the time of infection was known. This simplifying assumption was thought reasonable in our case, since even when the times of HCV infection in the Edinburgh liver clinic were uncertain, this uncertainty tended to be only in the determination of the exact date of infection within a calendar year or two. As the mean incubation period to cirrhosis is several orders of magnitude greater than the size of this interval, we expect that, for the analysis of our data set, the added uncertainty in estimation due to this imprecision in timing of infection will be inconsequential. However, in other applications where the timing of the initiation event (e.g., cancer or HIV infection onset) is known only to within an interval, which may be quite large, and where either the mean time to the event of interest (e.g., death or AIDS) or the mean follow-up time are not of an appreciably long enough length compared to the mean width of these intervals, the implications for analyses of assuming-known initiation time (e.g., by choosing the mid-point of the interval) can be major. Struthers and Farewell (1989) discuss an approach to account for unknown onset times, when the time of onset is known only to be in an interval, say, . This approach requires the specification of a density, say, , for the time of infection (e.g., a uniform distribution) over the interval . The likelihood to be optimized over the parameters then takes the form , where is the likelihood contribution from the th subject, given known infection time, and the density, , for this subject’s time of infection may be specified up to an unknown parameter vector, . Therefore, it can be seen that this approach can be adapted to our situation where the sampling is dependent on the residual fraction of time left to developing the event of interest and the onset time is known only to within an interval. However, careful thought is required on the most appropriate form for . For example, in HCV studies where the majority of subjects are injecting drug users and when time of HCV infection is unknown, there is evidence to suggest that infection occurs earlier in a subject’s injecting career [Hutchinson, Bird and Goldberg (2005), Hagan et al. (2008), De Angelis et al. (2009)].

Future application of these methods to the HCV epidemic in Scotland, more generally, is planned with Health Protection Scotland. Health Protection Scotland has developed a clinical database on referrals of HCV patients to liver clinics across all regions of Scotland. Application of our methodology should provide regional estimates for the number of HCV carriers in Scotland and will allow us to examine if the “true” impact of covariates (such as age at HCV infection and heavy alcohol use) are stable across regions although the covariate distribution may differ between regions.

## Acknowledgments

We would like to acknowledge the anonymous referees and Editor for their insightful comments and suggestions which have helped improve the paper.

Appendix: Derivations of the expressions based on the generalized gamma and mixture of uniforms \slink[doi]10.1214/14-AOAS725SUPP\sdatatype.pdf \sfilenameAOAS725_supp.pdf \sdescriptionProofs of the various expressions required in the constructing of the likelihood and pseudo score based on the assumption that the time-to-event distribution is from a generalized gamma distribution and the conditional referral distribution is a mixture of independent uniforms.

## References

- Aalen et al. (1997) {barticle}[pbm] \bauthor\bsnmAalen, \bfnmO. O.\binitsO. O., \bauthor\bsnmFarewell, \bfnmV. T.\binitsV. T., \bauthor\bsnmAngelis, \bfnmD. De\binitsD. D., \bauthor\bsnmDay, \bfnmN. E.\binitsN. E. \AND\bauthor\bsnmGill, \bfnmO. N.\binitsO. N. (\byear1997). \btitleA Markov model for HIV disease progression including the effect of HIV diagnosis and treatment: Application to AIDS prediction in England and Wales. \bjournalStat. Med. \bvolume16 \bpages2191–2210. \bidissn=0277-6715, pii=10.1002/(SICI)1097-0258(19971015)16:19¡2191::AID-SIM645¿3.0.CO;2-5, pmid=9330428 \bptokimsref\endbibitem
- Andersen et al. (1993) {bbook}[mr] \bauthor\bsnmAndersen, \bfnmPer Kragh\binitsP. K., \bauthor\bsnmBorgan, \bfnmØrnulf\binitsØ., \bauthor\bsnmGill, \bfnmRichard D.\binitsR. D. \AND\bauthor\bsnmKeiding, \bfnmNiels\binitsN. (\byear1993). \btitleStatistical Models Based on Counting Processes. \bpublisherSpringer, \blocationNew York. \biddoi=10.1007/978-1-4612-4348-9, mr=1198884 \bptokimsref\endbibitem
- Brookmeyer (2005) {bincollection}[author] \bauthor\bsnmBrookmeyer, \bfnmR.\binitsR. (\byear2005). \btitleBiased sampling of cohorts. In \bbooktitleEncyclopedia of Biostatistics, \bedition2nd ed. (\beditor\bfnmP.\binitsP. \bsnmArmitage \AND\beditor\bfnmT.\binitsT. \bsnmColton, eds.) \bpages427–439. \bpublisherWiley, \blocationNew York. \bptokimsref\endbibitem
- Cook and Lawless (2007) {bbook}[author] \bauthor\bsnmCook, \bfnmR. J.\binitsR. J. \AND\bauthor\bsnmLawless, \bfnmJ. F.\binitsJ. F. (\byear2007). \btitleThe Statistical Analysis of Recurrent Events. \bpublisherSpringer, \blocationBerlin. \bptokimsref\endbibitem
- Copas and Farewell (2001) {barticle}[pbm] \bauthor\bsnmCopas, \bfnmA. J.\binitsA. J. \AND\bauthor\bsnmFarewell, \bfnmV. T.\binitsV. T. (\byear2001). \btitleIncorporating retrospective data into an analysis of time to illness. \bjournalBiostatistics \bvolume2 \bpages1–12. \biddoi=10.1093/biostatistics/2.1.1, issn=1468-4357, pii=2/1/1, pmid=12933553 \bptokimsref\endbibitem
- Cox et al. (2007) {barticle}[mr] \bauthor\bsnmCox, \bfnmChristopher\binitsC., \bauthor\bsnmChu, \bfnmHaitao\binitsH., \bauthor\bsnmSchneider, \bfnmMichael F.\binitsM. F. \AND\bauthor\bsnmMuñoz, \bfnmAlvaro\binitsA. (\byear2007). \btitleParametric survival analysis and taxonomy of hazard functions for the generalized gamma distribution. \bjournalStat. Med. \bvolume26 \bpages4352–4374. \biddoi=10.1002/sim.2836, issn=0277-6715, mr=2405358 \bptokimsref\endbibitem
- De Angelis et al. (2009) {barticle}[mr] \bauthor\bsnmDe Angelis, \bfnmD.\binitsD., \bauthor\bsnmSweeting, \bfnmM.\binitsM., \bauthor\bsnmAdes, \bfnmA. E.\binitsA. E., \bauthor\bsnmHickman, \bfnmM.\binitsM., \bauthor\bsnmHope, \bfnmV.\binitsV. \AND\bauthor\bsnmRamsay, \bfnmM.\binitsM. (\byear2009). \btitleAn evidence synthesis approach to estimating hepatitis C prevalence in England and Wales. \bjournalStat. Methods Med. Res. \bvolume18 \bpages361–379. \biddoi=10.1177/0962280208094691, issn=0962-2802, mr=2750101 \bptokimsref\endbibitem
- Farewell and Prentice (1977) {barticle}[author] \bauthor\bsnmFarewell, \bfnmV. T.\binitsV. T. \AND\bauthor\bsnmPrentice, \bfnmR. L.\binitsR. L. (\byear1977). \btitleA study of distributional shape in life testing. \bjournalTechnometrics \bvolume19 \bpages69–75. \bptokimsref\endbibitem
- Freeman et al. (2001) {barticle}[pbm] \bauthor\bsnmFreeman, \bfnmA. J.\binitsA. J., \bauthor\bsnmDore, \bfnmG. J.\binitsG. J., \bauthor\bsnmLaw, \bfnmM. G.\binitsM. G., \bauthor\bsnmThorpe, \bfnmM.\binitsM., \bauthor\bsnmOverbeck, \bfnmJ. Von\binitsJ. V., \bauthor\bsnmLloyd, \bfnmA. R.\binitsA. R., \bauthor\bsnmMarinos, \bfnmG.\binitsG. \AND\bauthor\bsnmKaldor, \bfnmJ. M.\binitsJ. M. (\byear2001). \btitleEstimating progression to cirrhosis in chronic hepatitis C virus infection. \bjournalHepatology \bvolume34 \bpages809–816. \biddoi=10.1053/jhep.2001.27831, issn=0270-9139, pii=S0270-9139(01)33319-0, pmid=11584380 \bptokimsref\endbibitem
- Fu, Tom and Bird (2009) {barticle}[mr] \bauthor\bsnmFu, \bfnmBo\binitsB., \bauthor\bsnmTom, \bfnmBrian D. M.\binitsB. D. M. \AND\bauthor\bsnmBird, \bfnmSheila M.\binitsS. M. (\byear2009). \btitleRe-weighted inference about hepatitis C virus-infected communities when analysing diagnosed patients referred to liver clinics. \bjournalStat. Methods Med. Res. \bvolume18 \bpages303–320. \biddoi=10.1177/0962280208094688, issn=0962-2802, mr=2750070 \bptokimsref\endbibitem
- Fu et al. (2007) {barticle}[author] \bauthor\bsnmFu, \bfnmB.\binitsB., \bauthor\bsnmTom, \bfnmB. D. M.\binitsB. D. M., \bauthor\bsnmDelahooke, \bfnmT.\binitsT., \bauthor\bsnmAlexander, \bfnmG. J. M.\binitsG. J. M. \AND\bauthor\bsnmBird, \bfnmS. M.\binitsS. M. (\byear2007). \btitleEvent-biased referral can distort estimation of hepatitis C virus progression rate to cirrhosis, and of prognostic influences. \bjournalJ. Clin. Epidemiol. \bvolume60 \bpages1140–1148. \bptokimsref\endbibitem
- Glaser (1980) {barticle}[mr] \bauthor\bsnmGlaser, \bfnmRonald E.\binitsR. E. (\byear1980). \btitleBathtub and related failure rate characterizations. \bjournalJ. Amer. Statist. Assoc. \bvolume75 \bpages667–672. \bidissn=0162-1459, mr=0590699 \bptokimsref\endbibitem
- Hagan et al. (2008) {barticle}[author] \bauthor\bsnmHagan, \bfnmH.\binitsH., \bauthor\bsnmPouget, \bfnmE. R.\binitsE. R., \bauthor\bsnmDes Jarais, \bfnmD. C.\binitsD. C. \AND\bauthor\bsnmLelutiu-Weinberger, \bfnmC.\binitsC. (\byear2008). \btitleMeta-regression of hepatitis C virus infection in relation to time since onset of illicit drug injection: The influence of time and place. \bjournalAm. J. Epidemiol. \bvolume168 \bpages1099–1109. \bptokimsref\endbibitem
- Hardin and Hilbe (2003) {bbook}[mr] \bauthor\bsnmHardin, \bfnmJames W.\binitsJ. W. \AND\bauthor\bsnmHilbe, \bfnmJoseph M.\binitsJ. M. (\byear2003). \btitleGeneralized Estimating Equations. \bpublisherChapman & Hall, \blocationLondon. \bidmr=2000388 \bptokimsref\endbibitem
- Hutchinson, Bird and Goldberg (2005) {barticle}[author] \bauthor\bsnmHutchinson, \bfnmS. J.\binitsS. J., \bauthor\bsnmBird, \bfnmS. M.\binitsS. M. \AND\bauthor\bsnmGoldberg, \bfnmD. J.\binitsD. J. (\byear2005). \btitleModelling the current and future disease burden of hepatitis C among injecting drug users in Scotland. \bjournalHepatology \bvolume42 \bpages711–723. \bptokimsref\endbibitem
- Kalbfleisch and Prentice (2002) {bbook}[mr] \bauthor\bsnmKalbfleisch, \bfnmJohn D.\binitsJ. D. \AND\bauthor\bsnmPrentice, \bfnmRoss L.\binitsR. L. (\byear2002). \btitleThe Statistical Analysis of Failure Time Data, \bedition2nd ed. \bpublisherWiley, \blocationNew York. \biddoi=10.1002/9781118032985, mr=1924807 \bptokimsref\endbibitem
- Keiding (2005) {bincollection}[author] \bauthor\bsnmKeiding, \bfnmN.\binitsN. (\byear2005). \btitleDelayed entry. In \bbooktitleEncyclopedia of Biostatistics, \bedition2nd ed. (\beditor\bfnmP.\binitsP. \bsnmArmitage \AND\beditor\bfnmT.\binitsT. \bsnmColton, eds.) \bpages1404–1409. \bpublisherWiley, \blocationNew York. \bptokimsref\endbibitem
- Lawless (1980) {barticle}[mr] \bauthor\bsnmLawless, \bfnmJ. F.\binitsJ. F. (\byear1980). \btitleInference in the generalized gamma and log gamma distributions. \bjournalTechnometrics \bvolume22 \bpages409–419. \biddoi=10.2307/1268326, issn=0040-1706, mr=0585636 \bptokimsref\endbibitem
- Lawless (1997) {bincollection}[mr] \bauthor\bsnmLawless, \bfnmJ. F.\binitsJ. F. (\byear1997). \btitleLikelihood and pseudo likelihood estimation based on response-biased observation. In \bbooktitleSelected Proceedings of the Symposium on Estimating Functions (Athens, GA, 1996) (\beditor\bfnmV. B.\binitsV. B. \bsnmIshwar, \beditor\bfnmV. P.\binitsV. P. \bsnmGodambe \AND\beditor\bfnmR. L.\binitsR. L. \bsnmTaylor, eds.). \bseriesInstitute of Mathematical Statistics Lecture Notes—Monograph Series \bvolume32 \bpages43–55. \bpublisherIMS, \blocationHayward, CA. \biddoi=10.1214/lnms/1215455037, mr=1837796 \bptokimsref\endbibitem
- Prentice (1974) {barticle}[mr] \bauthor\bsnmPrentice, \bfnmR. L.\binitsR. L. (\byear1974). \btitleA log gamma model and its maximum likelihood estimation. \bjournalBiometrika \bvolume61 \bpages539–544. \bidissn=0006-3444, mr=0378212 \bptokimsref\endbibitem
- Qin and Shen (2010) {barticle}[mr] \bauthor\bsnmQin, \bfnmJing\binitsJ. \AND\bauthor\bsnmShen, \bfnmYu\binitsY. (\byear2010). \btitleStatistical methods for analyzing right-censored length-biased data under Cox model. \bjournalBiometrics \bvolume66 \bpages382–392. \biddoi=10.1111/j.1541-0420.2009.01287.x, issn=0006-341X, mr=2758818 \bptokimsref\endbibitem
- Stacy (1962) {barticle}[mr] \bauthor\bsnmStacy, \bfnmE. W.\binitsE. W. (\byear1962). \btitleA generalization of the gamma distribution. \bjournalAnn. Math. Statist. \bvolume33 \bpages1187–1192. \bidissn=0003-4851, mr=0143277 \bptokimsref\endbibitem
- Stacy and Mihram (1965) {barticle}[mr] \bauthor\bsnmStacy, \bfnmE. W.\binitsE. W. \AND\bauthor\bsnmMihram, \bfnmG. A.\binitsG. A. (\byear1965). \btitleParameter estimation for a generalized gamma distribution. \bjournalTechnometrics \bvolume7 \bpages349–358. \bidissn=0040-1706, mr=0192586 \bptokimsref\endbibitem
- Struthers and Farewell (1989) {barticle}[author] \bauthor\bsnmStruthers, \bfnmC. A.\binitsC. A. \AND\bauthor\bsnmFarewell, \bfnmV. T.\binitsV. T. (\byear1989). \btitleA mixture model for time to AIDS data with left truncation and an uncertain origin. \bjournalBiometrika \bvolume76 \bpages814–817. \bptokimsref\endbibitem
- Tom, Farewell and Bird (2014) {bmisc}[author] \bauthor\bsnmTom, \bfnmB. D. M.\binitsB. D. M., \bauthor\bsnmFarewell, \bfnmV. T.\binitsV. T. \AND\bauthor\bsnmBird, \bfnmS. M.\binitsS. M. (\byear2014). \bhowpublishedSupplement to “Maximum likelihood and pseudo score approaches for parametric time-to-event analysis with informative entry times.” DOI:\doiurl10.1214/14-AOAS725SUPP. \bptokimsref \endbibitem
- Tsai (2009) {barticle}[mr] \bauthor\bsnmTsai, \bfnmWei Yann\binitsW. Y. (\byear2009). \btitlePseudo-partial likelihood for proportional hazards models with biased-sampling data. \bjournalBiometrika \bvolume96 \bpages601–615. \biddoi=10.1093/biomet/asp026, issn=0006-3444, mr=2538760 \bptokimsref\endbibitem
- Wang (2005) {bincollection}[author] \bauthor\bsnmWang, \bfnmM.-C.\binitsM.-C. (\byear2005). \btitleLength bias. In \bbooktitleEncyclopedia of Biostatistics, \bedition2nd ed. (\beditor\bfnmP.\binitsP. \bsnmArmitage \AND\beditor\bfnmT.\binitsT. \bsnmColton, eds.) \bpages2756–2759. \bpublisherWiley, \blocationNew York. \bptokimsref\endbibitem
- Wang, Brookmeyer and Jewell (1993) {barticle}[mr] \bauthor\bsnmWang, \bfnmMei-Cheng\binitsM.-C., \bauthor\bsnmBrookmeyer, \bfnmRon\binitsR. \AND\bauthor\bsnmJewell, \bfnmNicholas P.\binitsN. P. (\byear1993). \btitleStatistical models for prevalent cohort data. \bjournalBiometrics \bvolume49 \bpages1–11. \biddoi=10.2307/2532597, issn=0006-341X, mr=1221402 \bptokimsref\endbibitem