Replica analysis of overfitting in time-to-event regression

Replica analysis of overfitting in regression models for time-to-event data

ACC Coolen, JE Barrett, P Paga, CJ Perez-Vicente Department of Mathematics, King’s College London,
   The Strand, London WC2R 2LS, UK
Institute for Mathematical and Molecular Biomedicine, King’s College London,
   Hodgkin Building, Guy’s Campus, London SE1 1UL, UK
Department of Primary Care and Public Health Sciences, King’s College
   London, Addison House, Guy’s Campus, London SE1 1UL, UK
Departament de Física Fonamental, Universitat de Barcelona,
   08028 Barcelona, Spain
ton.coolen@kcl.ac.uk, james.barrett@kcl.ac.uk,
          pierre.paga@kcl.ac.uk, conrad@ffn.ub.es
Abstract

Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitting, and even for Cox’s proportional hazards model (the main tool of medical statisticians), one finds in literature only rules of thumb on the number of samples required to avoid overfitting. In this paper we present a mathematical theory of overfitting in regression models for time-to-event data, which aims to increase our quantitative understanding of the problem and provide practical tools with which to correct regression outcomes for the impact of overfitting. It is based on the replica method, a statistical mechanical technique for the analysis of heterogeneous many-variable systems that has been used successfully for several decades in physics, biology, and computer science, but not yet in medical statistics. We develop the theory initially for arbitrary regression models for time-to-event data, and verify its predictions in detail for the popular Cox model.

pacs:
05.70.Fh, 02.50.-r

Contents

1 Introduction  3
2 Overfitting in Maximum Likelihood models for survival analysis  7
2.1 Definitions  7
2.2 An information-theoretic measure of under- and over-fitting  8
2.3 Analytical evaluation of the average over data sets  9
2.4 Application to Cox regression 10
3 Asymptotic analysis of overfitting in the Cox model 12
3.1 Conversion to a saddle-point problem 12
3.2 Replica symmetric extrema 13
3.3 Physical interpretation of order parameters 14
3.4 Derivation of RS saddle point equations 16
4 Analysis of the RS equations for the Cox model 16
4.1 RS equations in the limit 16
4.2 Numerical and asymptotic solution of RS equations 19
4.3 Variational approximation 21
5 Tests and applications 26
6 Discussion 27
References 31
Appendix A: Covariate correlations in Cox regression 31
Appendix B: Derivation of the replica symmetric equations 32
Appendix C: The limits and 35
Appendix D: Asymptotic form of the event time distribution 36

1 Introduction

In the simplest possible scenario, survival analysis is concerned with data of the following form. We consider a cohort of individuals, each of whom are at risk of a specified irreversible event, such as the onset of a given disease or death. For each individual in this cohort we are given specific measurements (the covariates) which were taken at a baseline time , as well as the time at which for individual we either observed the irreversible event, or we ceased our observation without having observed the event yet (the latter case is called ‘censoring’). More complex scenarios could involve e.g. having multiple distinct risk types, such as distinct causes of death, or interval censoring, where rather than itself, one is given an interval that contains . The theory developed in this paper can be generalised without serious difficulty to include such extensions, but in the interest of transparency we will focus for now strictly on the simplest case.

xxxx

The aim of survival analysis is regression, i.e. to use our data for detecting and quantifying probabilistic patterns (if any) that relate an individual’s failure time to their covariates . Such patterns may allow us to predict individual patients’ clinical outcomes, distinguish between high-risk and low-risk patients, reveal general disease mechanisms, or design new data-driven therapeutic interventions (by changing the values of modifiable covariates). For general reviews of the considerable survival analysis literature we refer to textbooks such as [1, 2, 3, 4]111Non-medical applications of survival analysis include e.g. the study of the time to component failure in manufacturing, or of the duration of unemployment in economics.. Being able to use the extracted patterns to predict clinical outcomes for unseen patients is the only reliable test of whether our regression results represent true knowledge. Accurate prediction requires that we use as much of the available covariate information as possible, so our focus must be on multivariate regression methods.

Most multivariate survival analysis methods are based on postulating a suitable and plausible parametrisation of the covariate-conditioned event time distribution, whose parameters are estimated from the data via either the maximum likelihood protocol (ML), or (following Bayesian reasoning) via maximum a posteriori probability (MAP). The most popular parametrisation is undoubtedly the proportional hazards model of Cox [5], which uses ML inference, and assumes the event time distribution to be of the so-called proportional hazards form . MAP versions of [5] are the so-called ‘penalised Cox’ or ‘ridge’ regression models (with Gaussian parameter priors), see e.g. [6, 7]. More complex parametrisation proposals, such as frailty or random effects models [8, 9, 10, 11] or latent class models [12], still tend to have proportional hazards type formulae as their building blocks. In all such models the number of parameters is always larger than or equal to the number of covariates. Hence, to avoid overfitting they can be used safely only when . This limitation was harmless in the 1970s and 1980s, when many of the currently used models were devised, and where one would typically have datasets with at most. For the data of post-genome medicine, however, where we regularly have , it poses a serious problem which has for instance prevented us from using genomic covariates in rigorous multivariate regression protocols, forcing us instead to work with ‘gene signatures’.

nr of active covariatesfaction ofcorrect predictions
Figure 1: Illustration of overfitting in Cox-type regression. A breast cancer data set [13] containing samples (129 with recorded events, 180 censored), with clinical and immunological covariates, and disease relapse chosen as event time, was randomly divided into training and validation sets (of roughly equal sizes). L2-regularised Cox regression was used to infer regression coefficients and base hazard rates from the training set (via Breslow’s formula [16]), upon which the model was used to predict survival at time years, for the samples in the training set and for those in the validation set. The fractions of correct predictions are and , respectively. This was repeated multiple times, initially with all covariates, and following repeated iterative removal of the last relevant covariate after each regression. The resulting curves exhibit the standard fingerprints of overfitting: initially the validation performance improves as the number of retained covariates increases, up to a critical point (here around , see arrows), followed by deterioration as increases further.

Overfitting in survival analysis models [14, 15] can be visualized effectively by combining regression with cross-validation. For the Cox model, for instance, one can use the inferred association parameters of [5] in combination with Breslow’s [16] estimator for the base hazard rate (which is the canonical estimator for [5]), to predict whether an event will have happened by a given cutoff time, and compare the fraction of correct predictions in the training set (the data used for regression) to those in a validation set (the unseen data). When drawn as functions of the number of covariates used, the resulting curves typically exhibit the standard fingerprints of overfitting [17, 18]; see Figure 1. Simulations with synthetic data [19] showed that the optimal number of covariates in Cox regression (see arrows in Figure 1) tends to be roughly proportional to the number of samples . Given this observed phenomenology, it seems vital before doing multivariate regression to have a tool for estimating the minimum number of samples or events needed to avoid the overfitting regime. To our knowledge, there is no theory in the literature yet for predicting this number, not even for the Cox model [5]. One finds only rules of thumb – e.g. the number of failure events must exceed 10 times the number of independent covariates – and empirical bootstrapping protocols, often based on relatively small scale simulation data [19, 20, 21]. This situation is not satisfactory.

Figure 2: Inferred association parameters (vertical axis) versus true association parameters (horizontal axis) for synthetic survival data generated according to the Cox model, and subsequently analysed with the Cox model. Covariates and true association parameters were drawn randomly from zero-average Gaussian distributions. In all cases , for all , and experiments were repeated such that the total number of points in each panel is identical. Top row: time-independent base hazard rate . Bottom row: time-dependent base hazard rate (dashed), with chosen such that the average event time is . The errors in the association parameters induced by overfitting are more dangerous than finite sample size errors, since they mainly take the form of a consistent bias and therefore cannot be ‘averaged out’. Moreover, they appear to be independent of the true base hazard rate.
Figure 3: Inferred integrated base hazard rates (solid curves, averaged over multiple experiments) for synthetic survival data, generated and subsequently analysed with the Cox model. Covariates and true association parameters were drawn randomly from zero-average Gaussian distributions. In all cases , for all , and (lower to upper solid curves). Left: data generated with (dashed). Right: data generated with (dashed), with chosen such that the average event time is . The errors induced by overfitting again take the form of a consistent bias: for very short time the base hazard rate is always under-estimated, whereas for large times it is always over-estimated.

To increase our intuition for the problem, we first explore via simple simulation studies the relation between inferred and true parameters in Cox’s model [5]. The parameters of [5] are the vector of regression coefficients (where is the number of covariates), and the base hazard rate . We generated association parameters and covariates randomly from zero-average Gaussian distributions, and corresponding synthetic survival data using Cox’s model without censoring (so all samples correspond to failure events), for different base hazard rates. To understand the nature of the overfitting-induced regression errors we plotted the pairs as points in the plane, where and are the true and inferred association parameters of covariate , respectively, calculated via the recipes of [5]. This resulted in scatterplots as shown in Figure 2. Simulations were done for different values of the ratio , with multiple independent runs such that the number of points in each panel is identical. The true association parameters were drawn independently from a zero-average Gaussian distribution with for all . Perfect regression would imply finding all points to lie on the diagonal. Rather than a widening of the variance (as with finite sample size regression errors) overfitting-induced errors are somewhat surprisingly seen to manifest themselves mainly as a reproducible tilt of the data cloud, which increases with , and implies a consistent over-estimation of associations: both positive and negative will always be reported as more extreme than their true values. These observed errors in association parameters appear to be independent of the form of the true base hazard rate. Similarly, we show in Figure 3 the inferred integrated base hazard rates versus time (solid lines), together with the true values (dashed), which again shows consistent and reproducible overfitting errors. A quantitative theory of overfitting that can predict both the observed tilt and width of the data clouds of Figure 2 and the deformed inferred hazard rates of Figure 3 would enable us to correct the inferred parameters of the Cox model for overfitting, and thereby enable reliable regression up to hitherto forbidden ratios of .

There are mathematical obstacles to the development of a theory of overfitting in survival analysis, which probably explain why it has so far remained an open problem. First, unlike discriminant analysis, it is not immediately clear which error measure to study when outcomes to be predicted are event times. Second, in most survival analysis models (including Cox regression) the estimated parameters are to be solved from coupled transcendental equations, and cannot therefore be written in explicit form. Third, in the overfitting regime one will by definition find even for large that the inferred parameters depend on the realisation of the data set, while at the more macroscopic level of prediction accuracy there is no such dependence. It is thus not a priori clear which quantities to focus on in analytical studies of the regression process, and at which stage in the calculation (if any) averages over possible realisations of the data set may be performed safely.

Our present approach to the problem consists of distinct stages, each removing a specific obstacle, and this is reflected in the structure of our paper. We adapt to time-to-event regression the strategy proposed and executed several decades ago for binary classifiers in the groundbreaking paper by Gardner [22]. We first translate the problem of modelling overfitting into the calculation of a specific information-theoretic generating function, from which we can extract the information we need. Next we use Laplace’s argument to eliminate the maximisation over model parameters that comes with all ML methods, which is equivalent to writing the ground state energy of a statistical mechanical system as the zero temperature limit of the free energy. The third stage is devoted to making the resulting calculation of the generating function feasible, using the so-called replica method. This method has an impressive track record of several decades in the analysis of complex heterogeneous many-variable systems in physics [23, 24, 25, 26, 27], computer science [22, 28], biology [29, 30, 31], and economics [32, 33], and enables us to carry out analytically the average of the generating function over all possible realisations of the data set. Finally we exploit steepest descent integration for , leading to the identification of the ‘natural’ macroscopic order parameters of the problem, for which we derive closed equations within the replica symmetric (RS) ansatz. Some technical arguments are placed in appendices, to improve the flow of the paper. We develop our methods initially for generic time-to-event regression models, and then specialise to the Cox model. The final RS equations obtained for the Cox model involve a small number of scalar order parameters, from which we can compute the link between true and inferred regression parameters, and the inferred base hazard rate. The functional saddle point equation for the base hazard rate is rather nontrivial; while we can calculate the asymptotic form of its solution analytically, we limit ourselves mostly to a variational approximation, which already turns out to be quite accurate. We close with a discussion of our results, their implications and applications, and avenues for future work.

2 Overfitting in Maximum Likelihood models for survival analysis

2.1 Definitions

We assume we have simple time-to-event data of the standard type, consisting of independently drawn samples , with just one active risk and no censoring. Each sample consists of a covariate vector , drawn independently from a distribution , and an associated time to event , drawn from :

(1)

Here describes a parametrised time-generating model, with unknown real-valued parameters collected in a vector that we seek to estimate from the data . We are not interested in estimating , so we take the covariate vectors as given. The data probability for each parameter choice is

(2)

We next define the empirical distribution of covariates and event times, given the observed data:

(3)

This allows us to write

(4)

with the conditional differential Shannon entropy of the event time distribution, and the Kullback-Leibler distance [34] between the empirical distribution and the parametrised form :

(5)
(6)

The parameters estimated via the ML recipe are those that maximise . According to (4) they minimise the Kullback-Leibler distance between the empirical covariate-conditioned event time distribution and the parametrised event time distribution with parameter values :

(7)

If for fixed and , the law of large numbers guarantees that (in a distributional sense), and hence ML regression will indeed estimate the parameters asymptotically correctly, provided the chosen paramerisation is unambiguous:

(8)

In this paper, however, we focus on the regime of large datasets with high-dimensional covariate and parameter vectors where overfitting occurs, namely and . Here no longer converges to for in any mathematical sense, the identity (8) is therefore violated, and minimising as per the ML prescription is no longer appropriate. This is the information-theoretic description of the overfitting phenomenon in survival analysis.

2.2 An information-theoretic measure of under- and overfitting

Maximum likelihood regression algorithms report those parameters for which is as similar as possible to the empirical distribution , as opposed to the true distribution from which the data were generated. The optimal outcome of regression is for the inferred parameters to be identical to the true ones, i.e. to find . We therefore define

(9)

This allows us to interpret the value of as a measure of ML regression performance:

(10)
(11)
(12)

Optimal regression algorithms would reduce until and then stop. Maximum likelihood regression will not do this; if it can reduce the Kullback-Leibler distance further it will do so, and thereby cause overfitting. For we expect to depend on the data only via and , this is the fundamental assumption behind any regression. It allows us to focus on the average of over all realisations of the data, given and :

in which

(14)

Evaluating analytically for is the focus of this paper. Clearly, if the relevant minimum over corresponds to the true value for all , then .

2.3 Analytical evaluation of the average over data sets

Working out (2.2) analytically for large requires first that we deal with the minimisation over . This can be done by converting the problem into the calculation of the ground state energy for a statistical mechanical system with degrees of freedom and Hamiltonian222The rescaling with of the Hamiltonian is done in anticipation of subsequent limits. :

(15)
(16)

For finite , the quantity can be interpreted as the average result of a stochastic minimisation, based on carrying out gradient descent on the function , supplemented by a Gaussian white noise with variance proportional to .

The remaining obstacle is the logarithm in (16), which prevents the average over all data sets from factorising over the samples. This we handle using the so-called replica method, which is based on the identity , and to our knowledge has not yet been applied in survival analysis. In the replica method the average is carried out for integer , and the limit is taken at the end of the calculation via analytical continuation. Application to (16) leads us after some simple manipulations to a new expression in which the average over data sets does factorise over samples:

(17)

The average over data sets has now been done, and we are left with a completely general explicit expression for in terms of the covariate statistics and the assumed parametrised data generating model . We will now work out and study this expression for Cox’s proportional hazards model [5] with statistically independent zero-average Gaussian covariates.

2.4 Application to Cox regression

In Cox’s method [5] the model parameters are a base hazard rate (with ) and a vector of regression coefficients. The assumed event time statistics are then of the following form:

(18)

The factors only induce an irrelevant scaling factor that will make it easier to take the limit . In fact, for large it is inevitable that the typical association parameter in the Cox model will scale as , since otherwise one would not find finite nonzero event times.

For simplicity we assume that the covariates are distributed according to . This restriction of our analysis to uncorrelated covariates is no limitation, since for the Cox model one can always obtain, via a simple mapping, the regression results for data with correlated covariates from those obtained for uncorrelated covariates. This is demonstrated in A.

For the Cox model our general result (17) takes the following form, involving ordinary integration over -fold replicated vectors and functional integration over -fold replicated base hazard rates :

To enable efficient further analysis we define the short-hands

(20)
(21)

and the -dimensional vector . In addition we rename , so that

All are linear combinations of Gaussian random variables, so also will be Gaussian (even for most non-Gaussian covariates this would still hold for large due to the central limit theorem), giving

(23)

in which the entries of the covariance matrix are

(24)

We introduce integrals over -distributions to transport variables to more convenient places, by substituting for each pair :

We then obtain, after some simple manipulations,

(26)

For finite , expressions such as (26) are of course not easy to use, but as with all statistical theories we will be able to progress upon assuming to be large333Note that the standard use of Cox regression away from the overfitting regime, including its formulae for confidence intervals and for p-values (which require Gaussian approximations that build on large expansions around the most probable parameter values, and assume that uncertainty in base hazard rates can be neglected), is similarly valid only when is sufficiently large.. We therefore focus on the asymptotic behaviour of (26) for , but with a fixed ratio , and will confirm a posteriori the extent to which the resulting theory describes what is observed for large but finite sample sizes.

3 Asymptotic analysis of overfitting in the Cox model

3.1 Conversion to a saddle-point problem

Following extensive experience with the replica method in other disciplines, with similar definitions, we assume that the two limits and commute. The invariance of the right-hand side of (26) under all permutations of the sample indices implies that can depend on the true association parameters only via the distribution . With a modest amount of foresight we define , and obtain

(27)

Writing the ratio of covariates over samples as , to be kept fixed in the limit , we may take the limit and obtain an integral that can be evaluated using steepest descent:

(28)

in which the function to be extremized is

(29)

Differentiation with respect to immediately gives . Moreover, for various integrals to be well-defined, the relevant saddle-point must (after contour deformation in the complex plane) be of a form where

(30)

with , and where the matrix is positive definite. Thus at the relevant saddle-point we will have

(31)