###### Abstract

A case-control family study is a study where individuals with a disease of interest (case probands) and individuals without the disease (control probands)
are randomly sampled from a well-defined population. Possibly right-censored age at onset and disease status are observed for both probands and their
relatives. Correlation among the outcomes within a family is induced by factors such as inherited genetic susceptibility, shared environment, and common
behavior patterns. For this setting, we present a nonparametric estimator of the marginal survival function, based on local linear estimation of
conditional survival functions. Asymptotic theory for the estimator is provided, and simulation results are presented showing that the method performs
well. The method is illustrated on data from a prostate cancer study.

Keywords: case-control; family study; multivariate survival; nonparametric estimator; local linear

An improved fully nonparametric estimator of the marginal survival function based on case-control clustered data

David M. Zucker

Department of Statistics and Data Science,

The Hebrew University of Jerusalem

Mount Scopus, Jerusalem, Israel

email: david.zucker@mail.huji.ac.il

and

Malka Gorfine

Department of Statistics and Operations Research

Tel Aviv University, Ramat Gan, Israel

email: gorfinem@post.tau.ac.il

December 29, 2018

## 1 Introduction

Many epidemiological and medical studies focus on disease events that are rare, so that a random sample from the population provides very few observations where failure has occurred during the monitoring time. In such situations, it is a common practice to consider a case-control strategy. Separate samples of individuals in whom the event has already occurred (case probands) and individuals in whom the event has not yet occurred (control probands) are obtained. Age at onset or age at censoring and disease status of each proband and of one or more of his/her relatives are recorded. For example, in a prostate cancer study, case probands are men diagnosed with prostate cancer, control probands are men without prostate cancer, and each proband is interviewed to obtain detailed disease history information of his relatives. The goals of such case-control family studies are to evaluate the effect of genetic and environmental factors on disease risk, and to estimate the distribution of the age at onset. The analysis of such data is complicated by the case-control selection scheme used to ascertain the families and by the within-family dependence. It is of interest to estimate the marginal survival function under dependent failure times of family members by a fully nonparametric estimator which avoids specific assumptions about the form of the distribution or the dependence structure among failure times with a family.

The estimator that first comes to mind is the Kaplan-Meier survival curve estimator based on the survival data on the relatives. This estimator, however, is biased because it does not take the case-control sampling and the within-family dependence into account. Gorfine et al. (2017) demonstrated the serious bias that can arise with this naive Kaplan-Meier estimator.

Accordingly, Gorfine et al. proposed a nonparametric estimator for this problem, based on a kernel smoothing approach. They presented a simulation study showing that their estimator performs well in terms of bias. The estimator of Gorfine et al. is based on the median of random variables with a complicated dependence structure. Consequently, the asymptotic properties of the estimator were not derived. Also, their use of the median causes some efficiency loss. In addition, their bandwidth selection procedure was not specifically targeted to the estimand of interest.

In the present paper we develop a new nonparametric estimator which performs well in terms of bias and much better in terms of variance than the estimator of Gorfine et al. In some scenarios the new estimator also outperforms Gorfine et al. in terms of bias. We provide asymptotic theory for the estimator. We also present a bandwidth selection procedure that is specifically targeted to the estimand of interest.

## 2 Preliminaries

The setup is as in Gorfine et al. (2017). We denote the maximum observation time among the probands by and the maximum observation time among the relatives by . We write . We let denote the number of relatives in family and we let denote the maximum number of relatives for a given proband. We view as a random variable. We will write some of the formulas as if each proband has exactly relatives, with the extra relatives taken to be censored at time 0. For family , , let denote the failure time of the proband and the failure times of the relatives. Let and denote the corresponding censoring times, and the corresponding observed times, and and the corresponding event indicators. We assume that the censoring is independent of the survival times. The sample includes case probands, with each case proband frequency matched with control probands, so that the total number of control probands is . Thus, the data consist of independent and identically distributed matched sets comprising one case family and control families, and the observed data on family consists of . We seek to estimate the marginal survival function

which we are assuming is the same for probands and relatives, a common assumption in case-control family studies (Shih & Chatterjee, 2002; Chatterjee et al., 2006). We further assume that the bivariate survival function for the proband and a given relative is the same for all relatives. The marginal survival distribution is assumed to be continuous with density . In the case where and have different marginal distributions, our procedure yields a consistent estimate of the marginal distribution of .

Denote and . As in Gorfine et al., we have and .

We now develop an expression for in terms of and . Let and denote the hazard and cumulative hazard functions corresponding to , and define , . Also define and , . We can write . We then have the following:

(1) |

Now define

(2) |

Note that the bracketed integral is nonzero provided that for every there exists a set of values of positive measure for which . Integrating both sides of (1) and rearranging gives

(3) | |||

(4) |

We use (4) to construct our estimator.

## 3 Estimation Procedure

In Gorfine et al., and were estimated using a generalized version of the kernel-smoothed Kaplan-Meier estimator proposed in an unpublished 1981 University of California at Berkeley technical report by R. Beran and examined in Dabrowska (1987), and the resulting estimators were used to construct an estimator of . Here, in light of (4), we work not only with and but also the derivative . Accordingly, we take a local linear estimation approach. Choose a symmetric kernel function and a bandwidth . Let and , and write . Let . Define (with )

We can write

with . A first-order Taylor approximation gives the local linear representation

We now carry out weighted linear least squares with response variable , explanatory variable , and weights . This leads to the local linear estimators

that is, for a given ,

(5) | ||||

(6) |

with the second equation leading to . We can now substitute , , and into (3) and (4) to obtain estimators and for and . We then take . When we want to emphasize the dependence on the bandwidth , we will write .

## 4 Asymptotic Theory

We can write with

We will provide a detailed asymptotic analysis of . Similar arguments can be used to show that converges in probability to zero at faster rate than , and is negligible in comparison with the other two terms.

In this section and in the appendix, we will write and assume that the indices have been arranged so that the control probands appear first in the list of probands, meaning that a sum over probands 1 to is a sum over the control probands.

Defining

we have

Now, the process is a martingale with respect to the filtration , . Accordingly, for any process that is predictable with respect to , the process

is a mean-zero martingale. It follows, even though the process does not have any martingale properties, that for any function we have

(7) |

We now write , where

In the appendix it is shown, under the conditions listed in the appended supplementary document that in probability and that

with

where is the limiting value of . Define . We then obtain the following theorem.

###### Theorem 1.

For each , converges to a limit and converges in distribution to .

The proof of this theorem appears in the appendix. Details are given in the appended supplementary document.

## 5 Practical Implementation Details

In preliminary work, we found that the performance of the estimator of can be improved dramatically by introducing a time transformation that makes the proband observation times approximately uniformly distributed. Along the lines of Doksum et al. (2017), we propose transforming according to an estimate of the distribution function of the proband observation times, which leads to a modified form of nearest neighbor regression. In the appended supplementary document, we show that the consistency and asymptotic normality is maintained under the time transformation if a smooth estimate of the distribution function is used. We believe that this result holds as well when the empirical distribution function is used. In our numerical work, we used the empirical distribution function.

In the context of family survival data, it is usually reasonable to assume that for all and , i.e., for all and . This condition implies that

(8) |

If we let and denote the Kaplan-Meier survival curve estimator based on the case relatives’ survival data and the control relatives’ survival data, respectively, the foregoing inequalities motivate modifying the estimator to the estimator resulting from replacing with if and by if . We implemented this modification in our numerical work. The modification comes into play mainly when the event rate is extremely low or extremely high. Given the consistency of , if the inequalities in (8) are strict, then for large sample sizes the modification no longer comes into play. The inequalities in (8) are strict if the following mild condition holds: for each there exists a set such that and

For bandwidth selection, we propose a bootstrap procedure. Let denote the Kaplan-Meier estimate of the survival function of the time to censoring among the relatives (which is the same for case relatives and control relatives). In each bootstrap replication , for each family we generate event times for proband ’s relatives according to the survival function and censoring times for relatives according to the survival function . We then run our estimation procedure for a given on the resulting data, obtaining the estimate . Denote

We evaluate over a grid of values in the range and choose the values with the minimum .

To construct confidence intervals in finite samples with bandwidth selection, we use the percentile bootstrap method.

## 6 Simulation Study

We carried out a simulation study to evaluate the finite sample properties of the proposed estimator. Data were generated under frailty models in which the within-family dependence is expressed in terms of a shared frailty variate , conditional on which the failure times of the family members are independent with hazard function . We manipulated five design factors, as follows: (1) frailty distribution: gamma or positive stable, (2) cumulative end-of-study event rate: high (60%) or low (15%), (3) number of case probands: 500 or 1000 (with 1:1 matching of control probands to case probands), (4) number of relatives per family: 1 or 4, and (5) strength of within-family dependence: low (Kendall tau of 1/3 between the failure of times of two members of the same family) and moderate (Kendall tau of 1/2). We took with , , and chosen so as to obtain the desired cumulative end-of-study event rate. The end of study age was taken to be 110 years. The overall censoring rate, including both interim and end-of-study censoring, was about 60% in the high event rate case and 90% in the low event rate case. The number of case probands was taken to be 500, with 1:1 matching of control probands to case probands. The data generation was carried out in the same manner as in Gorfine et al. We carried out 1024 simulation replications for each of the 32 combinations of the design. For each replication, we carried out 30 inner replications for the bootstrap bandwidth selection procedure and 100 outer replications for the percentile bootstrap confidence interval procedure. The initial bandwidth was 0.5 and the bandwidth search was done in two stages. In the first stage, we searched over in steps of 0.1 and identified the value with the lowest . In the second stage, we searched over , and and chose the value with the lowest to be the final value. The kernel used was the triweight kernel .

The results for 500 case probands are summarized in Fig. 1 and 2. The left two columns of each figure show the true survival curve, along with Gorfine et al.’s estimator and the new estimator. The finite-sample bias of the new estimator tends to be smaller, and in some settings, such as the gamma frailty model with very low event rates, its finite-sample bias is dramatically smaller. The right two columns of each figure summarize the point-wise coverage rates of the percentile-bootstrap confidence interval of the proposed estimator, along with the standard errors of the Gorfine et al. estimator and the proposed estimators. Clearly, the proposed estimator substantially outperforms the old estimator in terms of efficiency. In general, the coverage rates are reasonably close to , except at very early ages with small number of observed events. Similar results were obtained with 1000 case probands.

## 7 Example

In this section we illustrate our method by re-analysing the data presented in Gorfine et al. (2017), population-based case-control family study of early onset prostate cancer (Stanford et al., 1999). Briefly, case participants were identified from the Seattle-Puget Sound Surveillance, Epidemiology, and End Results (SEER) cancer registry. Cases were those with age at diagnosis between 40 and 64 years. Controls were identified by use of random-digit dialing and they were frequency matched to case participants by age. The information collected on the relatives is the age at diagnosis for prostate cancer if the relative had prostate cancer or age at the last observation if the relative did not have prostate cancer. Here we use the information about age at onset or age at censoring and disease status that was observed for the probands and their fathers, brothers, and uncles. The following analysis is based on 730 prostate-cancer case probands, 693 control probands, and a total of 7316 relatives. Out of the 3793 case-probandsâ relatives, 211 had prostate cancer, and out of the 3523 control-probandsâ relatives, 102 had prostate cancer. The age range of the relatives with prostate cancer was 40–93. Given that frequency matching was used rather than exact age matching, and that the number of relatives per proband varied across the probands, we carried out the time transformation based on the empirical distribution of the proband observation times across all 7316 relatives in the data set. For bandwidth selection we used the same two-stage procedure as in the simulations.

Figure 3 and Table 1 present the estimates of prostate-cancer marginal survival function using the naive Kaplan-Meier estimator based on the relatives’ data, Gorfine et al.’s estimator with nearest-neighbor smoothing and the median operator, the SEER survival curve based on the SEER Cancer Statistics Review 1975–2012, and the proposed estimator. In this dataset, Gorfine et al.’s estimator is closer to the SEER survival curve, but with very large point-wise standard errors compared to the proposed estimator. The standard errors reported in Table 1 are much larger than those reported in Gorfine et al. due to an error in the bootstrap code applied back then.

t | The proposed Estimator | Gorfine et al. | Naive KM | SEER |
---|---|---|---|---|

50 | 0.9997 (0.0007) | 0.9918 (0.0311) | 0.9991 (0.0003) | 0.9958 |

52 | 0.9997 (0.0010) | 0.9801 (0.0340) | 0.9986 (0.0005) | 0.9930 |

54 | 0.9993 (0.0012) | 0.9784 (0.0413) | 0.9981 (0.0006) | 0.9902 |

56 | 0.9990 (0.0023) | 0.9784 (0.0451) | 0.9963 (0.0008) | 0.9843 |

58 | 0.9910 (0.0037) | 0.9784 (0.0451) | 0.9945 (0.0010) | 0.9783 |

60 | 0.9934 (0.0054) | 0.9784 (0.0461) | 0.9881 (0.0015) | 0.9703 |

62 | 0.9908 (0.0063) | 0.9678 (0.0501) | 0.9848 (0.0017) | 0.9603 |

64 | 0.9895 (0.0085) | 0.9423 (0.0577) | 0.9813 (0.0019) | 0.9504 |

## Acknowledgements

The authors would like to thank Dr. Stanford for her generosity in sharing the prostate cancer dataset that was used for illustrating the method. Malka Gorfine gratefully acknowledges support from the U.S.-Israel Binational Science Foundation in carrying out this work.

## Supplementary material

We append a document with details of the proof of the asymptotic properties of the estimator. R code used to carry out the simulations and R code for applying the method to a data set may be found at https://github.com/david-zucker/marginal-survival/.

## Appendix: Asymptotic Theory

A. Preliminaries

We first present some definitions. Let denote the density of . Define and . In addition, define

Note that, by symmetry of , for all odd .

We now present two lemmas, whose proofs appear in the online Supplemental Materials. The notations and , and similarly and , should be understood as being uniform in the relevant arguments.

###### Lemma 1.

For even we have

and for odd we have

###### Lemma 2.

For we have

and for we have

B. Analysis of

We can write as

with

By Taylor expansion, we can write

in which , with , where and denote, respectively, the second and third partial derivatives of with respect to We then have

The term in square brackets does not depend on . Since

we get

We can then write , where

We have

We now consider separately the case of and . For , the results of Lemmas 1 and 2 imply that and , so that and

For , the results of Lemmas A1 and A2 imply that and , so that and

(recalling that the length of is 2). We thus obtain , so that ,
since .

C. Analysis of

Let and define

Define further

We can then write , where

Our claim is that is asymptotically mean-zero normal, and that and are both .

By (7), . Thus, is the sum of i.i.d. mean-zero terms. Accordingly, to show that is asymptotically mean-zero normal we need to show that converges to a limit and that satisfies the Lindeberg condition

(9) |

where

The appended supplementary document provides a proof of the above two assertions, along with a proof that is . The proof of the corresponding result for is similar.

## References

- Chatterjee et al. (2006) Chatterjee, N., Kalaylioglu, Z., Shih, J. H. & Gail, M. (2006). Caseâcontrol and case-only designs with genotype and family history data: estimating relative risk, residual familial aggregation, and cumulative risk. Biometrics 62, 36–48.
- Dabrowska (1987) Dabrowska, D. M. (1987). Non-parametric regression with censored survival time data. Scandinavian Journal of Statistics 14, 181–197.
- Doksum et al. (2017) Doksum, K. A., Jiang, J., Sun, B. & Wang, S. (2017). Nearest neighbor estimates of regression. Computational Statistics and Data Analysis 110, 64–74.
- Gorfine et al. (2017) Gorfine, M., Bordo, N. & Hsu, L. (2017). A fully nonparametric estimator of the marginal survival function based on caseâcontrol clustered age-at-onset data. Biostatistics 18, 76–90.
- Shih & Chatterjee (2002) Shih, J. H. & Chatterjee, N. (2002). Analysis of survival data from caseâcontrol family studies. Biometrics 58, 502–509.
- Stanford et al. (1999) Stanford, J. L., Wicklund, K. G., McKnight, B., Daling, J. R. & Brawer, M. K. (1999). Vasectomy and risk of prostate cancer. Cancer Epidemiology Biomarkers and Prevention 8, 881–886.

Supplemental Web Materials

An improved fully nonparametric estimator of the marginal survival function based on case-control clustered data

David M. Zucker

Department of Statistics and Data Science,

The Hebrew University of Jerusalem,

Mount Scopus, Jerusalem, Israel

email: david.zucker@mail.huji.ac.il

and

Malka Gorfine

Department of Statistics and Operations Research

Tel Aviv University, Ramat Gan, Israel

email: gorfinem@post.tau.ac.il

December 29, 2018

A. Introduction and Technical Assumptions for Asymptotic Results

This document provides details of the proof of the asymptotic properties of the estimator. Notation given in the main paper (including the Appendix) will be used throughout without repeating the definitions; additional notation will be defined as needed. In the development below, the notations and , and similarly and , should be understood as being uniform in the relevant arguments.

Below are the technical conditions assumed in deriving the asymptotic results.

1. The kernel is symmetric, equal to zero outside of , and equal to a polynomial inside . In addition, is twice differentiable with bounded derivatives over the entire real line, including the points and 1.

2. The bandwidth is given by , where and .

3. We have and .

4. The first and second partial derivatives and of with respect to exist and are bounded uniformly over and . Note that Assumptions 3 and 4 imply that .

5. The first and second partial derivatives of with respect to exist and are bounded uniformly over and .

6. The first three partial derivatives of with respect to exist and are bounded uniformly over and .

B. Proofs of Lemmas A1 and A2

We begin with an expanded statement of Lemma A1, continue with the proof of this lemma, and then present Lemma A2 and its proof.

Lemma A1: For even we have

and for odd we have

In addition, for any ,

In general, we have

When is even, the first term dominates for while the second term dominates for , so that we obtain

When is odd, we get

Proof: The analysis of involves a combination of a conditioning argument with the usual change of variable Taylor expansion argument. We have

By Taylor’s theorem, we have

where . So we get

and the claimed result follows.

We now turn to the analysis of . We use Corollary 2.2 of Giné and Guillou (2002), a result concerning empirical processes. For , define and

We can then write

Since is assumed polynomial on , the function satisfies Giné and Guillou’s Condition (K). Hence, by the arguments in Giné and Guillou, the class

is a bounded, measurable VapnikâChervonenkis (VC) class of functions on . Any set of the form can be expressed as the result of Boolean operations on half-spaces, and hence the class of sets is a VC class (see Dudley, 1999, p. 141) (this is well known). Further, any set of the form with can be expressed as

and any set of this form with can be expressed as

Recalling that the Cartesian product preserves the VC property, we can conclude that the class of functions is a bounded VC class. Moreover, since the map