# The Brier Score under Administrative Censoring: Problems and Solutions

## Abstract

The Brier score is commonly used for evaluating probability predictions. In survival analysis, with right-censored observations of the event times, this score can be weighted by the inverse probability of censoring (IPCW) to retain its original interpretation. It is common practice to estimate the censoring distribution with the Kaplan-Meier estimator, even though it assumes that the censoring distribution is independent of the covariates. This paper discusses the general impact of the censoring estimates on the Brier score and shows that the estimation of the censoring distribution can be problematic. In particular, when the censoring times can be identified from the covariates, the IPCW score is no longer valid. For administratively censored data, where the potential censoring times are known for all individuals, we propose an alternative version of the Brier score. This administrative Brier score does not require estimation of the censoring distribution and is valid even if the censoring times can be identified from the covariates.

The Brier Score under Administrative CensoringKvamme and Borgan \firstpageno1

survival analysis, time-to-event-prediction, customer churn, inverse probability weighting, progressive type I censoring

## 1 Introduction

Recently, there has been an increasing interest in combining machine learning methodology with survival analysis for improved time-to-event prediction. Some methods extend the well known Cox regression with neural networks (Katzman et al., 2018; Ching et al., 2018; Yousefi et al., 2017; Luck et al., 2017; Kvamme et al., 2019), while others consider a more direct approach for optimizing the likelihood for right-censored time-to-event data (Biganzoli et al., 1998; Lee et al., 2018; Fotso, 2018; Gensheimer and Narasimhan, 2019; Kvamme and Borgan, 2019). Also worth mentioning is the Random Survival Forest (Ishwaran et al., 2008) which makes decision trees based on the log-rank test and estimates the cumulative hazards with the Nelson-Aalen estimator.

Although these methods are available for right-censored event times, a substantial part of the machine learning community is not familiar with survival analysis and might find it reasonable to instead apply binary classifiers for time-to-event prediction. In short, a binary classifier estimates the probability that an individual experience the event by time , and can be fitted by disregarding individuals censored before that time. As empirical evaluation of predictions is central to machine learning, the best way to convince this audience to use survival methodology would likely be to show that the survival methods give improved evaluation scores.

Arguably, the two most common evaluation criteria for survival predictions are the inverse probability of censoring weighted (IPCW) Brier score (Graf et al., 1999; Gerds and Schumacher, 2006) and different versions of the concordance index (Harrell Jr et al., 1982; Antolini et al., 2005; Uno et al., 2011; Gerds et al., 2013). We will, in this paper, show that the IPCW Brier score can be biased under administrative censoring, i.e., when the right-censoring times are known for all individuals. In fact, we will show that this bias benefits the binary classifiers, meaning that the binary classifiers can get better scores than corresponding survival methods, even though the estimates of the binary classifiers are, in reality, much worse than those of the survival methods. Furthermore, we will show that the IPCW Brier score is heavily dependent on the estimates of the censoring distribution, making it unattractive as an evaluation metrics as one can question the validity of the results. We, therefore, propose the administrative Brier score, which is created for handling both these issues. In particular, it does not have the potential bias of the IPCW score under administrative censoring, and it does not require estimation of the censoring distribution.

To give the reader an understanding of the potential problems of the IPCW Brier score, we will start with an illustrative example in Section 2. Then, in Section 3, we will introduce the Brier score in detail and discuss more carefully the issues of the IPCW scheme. We present our proposed alternative, the administrative Brier score, in Section 3.2. The binary classifiers are investigated in Section 4, where we will show their relationship to the potential bias of the IPCW Brier score under administrative censoring. A simulation study is conducted in Section 5 to empirically illustrate our findings, and in Section 6, we investigate a real data set with administrative censoring. A summary and concluding remarks are made in Section 7.

The code for the evaluation metrics, the survival methods, the simulations, and the data sets is available at github.com/havakv/pycox.

## 2 A Real-World Example

To illustrate the potential issues with the IPCW Brier score, we consider an example encountered while researching the KKBox Churn data set in Kvamme et al. (2019). The task is to predict whether or not customers continue to subscribe to the KKBox music streaming service days after their first subscription. If customers leave their subscription they have churned, and these are the events we want to model. The follow-up times of the customers depend on when they first subscribed, as we only follow customers up to January 29, 2017. A substantial part of the customers do not experience the event before this date and are instead right-censored. The type of right-censoring encountered here is called administrative censoring or progressive type I censoring, meaning that we actually know the censoring times of all the customers. So even for customers for whom we have observed the churn time, we know the time they would have been censored.

We approach the modeling of the event-time distribution in two ways. The first is with the Logistic-Hazard method (Brown, 1975; Biganzoli et al., 1998; Gensheimer and Narasimhan, 2019; Kvamme and Borgan, 2019) which accounts for censored observations by considering the likelihood for right-censored event times. We use the version of the method described by Kvamme and Borgan (2019), meaning that we parameterize the discrete hazards with a neural network in the form of a multilayer perceptron (MLP).

The second approach is to fit a binary classifier for each time and remove all customers censored before this time. The responses (labels) given to the classifiers are indicators of whether or not each customer has churned. We denote this as the BCE method, as it minimizes the binary cross-entropy of the survival estimates, where survival means that a customer has not yet churned. The BCE method is an MLP with equivalent network structure to that of the Logistic-Hazard, with each output node corresponding to a binary classifier at time . The method will be described in detail in Section 4. As censored individuals are removed from the data set at their time of censoring, the BCE method has a bias towards higher churn probabilities. We would, therefore, expect the Logistic-Hazard to perform better for the KKBox data set.

The IPCW Brier scores with Kaplan-Meier censoring estimates (Graf et al., 1999) are calculated for the survival estimates of the two methods, and a plot of the results is shown in the left part of Figure 1. For higher times, we see that the BCE method has better scores than the Logistic-Hazard method. This is unexpected, as the censoring proportion increases with time, meaning the bias of the BCE estimates increases with time. However, the Kaplan-Meier censoring estimates are not covariate dependent, so the results are likely explained by poor estimation of the censoring distribution. This was addressed by Gerds and Schumacher (2006), but the Kaplan-Meier estimator continues to be the most common choice for estimating the weights in the IPCW Brier score.

In the right part of Figure 1, we have plotted the IPCW Brier scores where the censoring distribution is estimated with a Logistic-Hazard model similar to that of the survival estimates. We now see that the BCE generally performs worse than the Logistic-Hazard, but we still find that it gets better scores than the Logistic-Hazard for the largest times. We do, however, believe that the BCE actually performs worse than the Logistic-Hazard, also for the largest times, and that the reason for these results can be explained by some artifacts of the IPCW Brier scores applied to a data set with administrative censoring.

In the following sections, we will present what we believe to be the cause of these results, and we will use simulated data to reproduce the artifacts. We will, also, propose the administrative Brier score, which does not suffer from the same vulnerabilities as the IPCW Brier score, and does not even require estimation of the censoring distribution. In Section 6, we will revisit the KKBox data set for a more in-depth analysis.

## 3 The Brier Score

In the following, we present the Brier score for evaluating time-to-event predictions in the form of survival estimates. We will then introduce the topic of right-censoring in survival analysis, and show how the IPCW Brier score accounts for censored observations. This is followed by a discussion of administratively censored observations and how they affect the IPCW Brier score. We end the section by introducing a new Brier score for handling administratively right-censored observations.

Let denote the event time of individual and let denote the density function of this event time. The survival function of individual is then given by

In time-to-event prediction, we want to estimate for all individuals, and we will let denote these estimates. To simplify the presentation, we will disregard the estimation uncertainty, and consider the ’s as known (non-random) functions.

A reasonable metric for evaluating the predictive performance of the ’s, is to calculate the mean squared error of the estimates to the true survival functions. For a data set with individuals, this score is

(1) |

However, is not known outside of simulations, and we instead observe event times drawn from the event time distribution. The Brier score for uncensored data approximates the true survival functions with step-functions with jumps at the event times, giving

(2) | ||||

The expectation of the Brier score is

So the expected Brier score is the sum of the MSE in (1) and a constant that does not depend on our survival estimates . This constant is the irreducible error of approximating the true survival functions with the step-functions . So minimizing the expected Brier score is equivalent to minimizing the MSE, and the minimum is obtained for the true survival functions, i.e., .

### 3.1 The IPCW Brier Score

In many applications, only a subset of the event times is observed. For some individuals, we instead only know that the event time occurs after some censoring time . For modeling these data sets, we consider the right-censored event time and the event indicator . We follow the common convention in survival analysis that when an event and censoring time coincide, we observe the occurrence of the event.

As we only have partial information, the Brier score in (2) cannot be calculated. We can, however, approximate it by weighting the scores of the observed event times by the inverse probability of censoring (Graf et al., 1999; Gerds and Schumacher, 2006). This is called inverse probability of censoring weighting (IPCW), and the IPCW Brier score is given by

(3) |

where , is the survival function of the censoring distribution for individual . Assuming, for simplicity, that the ’s are known functions, the expectation of the IPCW Brier score is

We see that this is identical to the expectation of the uncensored Brier score (2), so the IPCW Brier score is a reasonable approximation of the uncensored Brier Score.

#### IPCW with Administrative Censoring

For administrative censoring, all the censoring times are known, regardless of whether individual experienced an event or was censored. Although , in this case, is not random, we will continue to capitalize for simplicity. As we know the censoring times, the censoring distribution is given by the step-function , and the IPCW Brier score (3) can be written as

(4) |

This is equivalent to the unweighted Brier score (2) on the subset of individuals that are not censored, meaning we simply disregard the set of censored individuals . The step-function violates the assumptions of the IPCW Brier score which, as stated in Section 3.1, requires . As a result, the expectation of the score is no longer equal to that of the uncensored Brier score,

We see that the expectation is only equal to that of the uncensored Brier score if all individuals have censoring times . For this group, with , we still have that the minimizers of the score are . However, for the other group, with , we see that the minimizers are . Hence, according to this score, the optimal survival estimates are . This is problematic, as we want an evaluation metric that decreases as our estimates approach the true survival functions .

Due to these biases, one would, in practice, not use the Brier score (4) with , but instead use the regular IPCW Brier score (3) with estimates of the censoring distributions. We will, however, see that these issues can still be present. Specifically, if the covariates hold sufficient information to identify the administrative censoring times, then the estimated censoring distributions can approach the step functions , and we get a Brier score close to that of (4).

#### Estimation of the Censoring Distribution

In practice, we need to estimate the censoring distributions to use the IPCW Brier score. This estimation is, in itself, a time-to-event prediction problem, and can be addressed in the same manner as the original time-to-event problem. Graf et al. (1999) proposed to use the Kaplan-Meier estimates of the censoring distribution, and this is still the most common approach. However, the Kaplan-Meier estimator disregards the covariates, meaning all individuals are assumed to have the same censoring distribution. This can lead to biased censoring estimates, as addressed by Gerds and Schumacher (2006).

In predictive modeling, we typically split our data set into a training set used to fit the models, a validation set used for hyperparameter tuning, and a test set used for evaluating the models’ predictions. We, therefore, only consider the Brier score calculated on a test set in this paper. Both Graf et al. (1999) and Gerds and Schumacher (2006) use the test set to estimate the censoring distribution, which is reasonable for simple models such as the Kaplan-Meier estimator and Cox regression. If we, however, want to use more flexible models, such as the Logistic-Hazard with neural networks, fitting to the test set is likely to results in overfitted censoring estimates. To the best of our knowledge, this topic has not been addressed in the literature, so there are no “best practices” for how to approach such estimation problems. In this paper, we will, therefore, treat the censoring distribution in the same manner as the event-time distribution, meaning we fit the censoring model to the training set, and use a validation set to verify that the estimates are reasonable.

When the censoring distribution is obtained from the Kaplan-Meier estimator on the test set, the weights of the IPCW Brier score (3) sum to . But for more flexible methods, and methods fitted to the training set, this is not necessarily the case. A more general version of the IPCW Brier score (3), that works for any reasonable censoring estimates, is obtained by replacing with the sum of the weights

This ensures that the Brier score is between 0 and 1,

(5) |

This is the version of the IPCW Brier score we use in all our experiments.

The censoring estimates in (5) are actually functions of the covariates, . If there is enough information in the covariates to identify the censoring times , the censoring estimates will approach the step-functions , meaning the scores approach the biased Brier score (4). As an example of this, consider a study where all censoring is due to administrative censoring at the closure of the study, and one includes the start date for each individual as a covariate, then the ’s can be identified. A more realistic example would be that a combination of certain covariates can identify the start date of some individuals and, consequently, a subset of the censoring times can be identified.

If we estimate the censoring distribution with a flexible method, we might experience that some of the estimates of become very small. This corresponds to very large weights, meaning that a single individual can potentially dominate the score. To prevent this from occurring, we set a maximum weight allowed. As an example, if we have a maximum weight of 100, we do not allow estimates of , giving the interpretation that a single individual can maximally represent 100 individuals. In practice, this is ensured by setting weights larger than 100 to be 100. By decreasing the maximum allowed weight, we reduce the variance of the IPCW Brier score at the expense of introducing some bias.

It should now be clear that the IPCW Brier score’s dependence on the censoring estimates is undesirable, as we do not want an evaluation criterion to be heavily dependent on the choices made by the researcher. If the scores are substantially affected by the censoring estimates, one can question the validity of the results.

### 3.2 The Administrative Brier Score

We have shown that the IPCW Brier score may have undesirable behavior under administrative censoring, and that estimation of the censoring distribution can be problematic in general. We propose an alternative that alleviates both of these problems. Our approach is to use the uncensored Brier score (2) and simply remove individuals from evaluation after their administrative censoring time. This is possible as we know the censoring times for all individuals. However, this also means that the score is not applicable in studies where we do not know all the censoring times.

To define the administrative Brier score, first note that for we know whether or (remember that if we assume that we observe the occurrence of the event). The administrative Brier score is then

(6) |

where we scale by

instead of . The expectation of the administrative Brier score is

which, for the subset of individuals with , is identical to the uncensored Brier score. Individuals with give no contribution to the score. As we have no information about event times , it is reasonable to only consider this subset.

In practice, it is probably reasonable that a subset of the administrative censoring times can be identified by the covariates. We saw in Sections 3.1.1 and 3.1.2 that the IPCW Brier score can be biased for this type of data. As the administrative Brier score is still minimized by the true survival functions , it is more robust to these issues than the IPCW Brier score.

## 4 Binary Classifiers for Time-to-Event Prediction

In machine learning, it is quite common to approach time-to-event prediction with binary classifiers rather than relying on survival methodology. Especially for someone without experience in survival analysis, a binary classifier may seem like a reasonable approach to time-to-event modeling.

A binary classifier can be constructed for a given time by minimizing the binary cross-entropy (negative log-likelihood for Bernoulli data) and disregarding individuals that were censored before time . This gives the loss function,

(7) | ||||

were the labels denote if the events happen after time . If we want survival estimates for a range of times , we can fit a model for each . Alternatively, if we use a model for that can be estimated for multiple ’s simultaneously, such as a neural network with output nodes, we can fit the model to the sum of the losses

(8) |

We refer to this approach as the BCE method.

The binary classifiers and BCE method are clearly biased, as the removal of censored individuals decreases the survival estimates. In fact, if there is sufficient information in the covariates to identify the censoring times , the survival estimates of the binary classifiers will approach

(9) |

The derivations leading to (9) are given in Appendix A. We recognize these estimates as the minimizers of the IPCW Brier score in Section 3.1.1. So given that the ’s can be identified from the covariates and we have a sufficiently large data set, the binary classifiers are essentially optimal for minimizing the IPCW Brier score. If the covariates only identify a subset of the ’s, it is not clear whether or not a binary classifier will give better IPCW Brier scores than a corresponding survival method. We believe this might explain why the BCE method got a lower Brier score on the KKBox data set in Section 2.

If a researcher ignores censored individuals in the test set in the same manner as for the binary classifiers (7), the uncensored Brier score (2) will take the form of the IPCW Brier score in (4) with ,

We know from Section 3.1.1 that the expectation of this Brier score is minimized by the estimates (9).

### 4.1 Simulation with BCE

To illustrate the concerns addressed in the previous sections, we conduct a simple simulation study. We draw event times from a discrete event-time distribution with 1000 discrete time points and constant hazard . The censoring times are drawn uniformly over the full duration span from 0 to 100. We will use the censoring times in two distinct ways. The first is to consider the censoring times as random, which corresponds to a linear survival function for the censoring, identical for all individuals. The second approach is to consider the censoring times identifiable from the covariates, meaning we have step-functions . In this case, we will use a monotone function of as a covariate, which represents the inclusion of covariates that can be used to identify the censoring time. For brevity, we will denote this covariate as , as it is equivalent to using as a covariate. In summary, the only distinction between the two approaches is whether or not the covariates contain information about the censoring time .

We fit two neural networks with the BCE loss (8) to 10,000 simulated samples, the first without the covariate identifying and the second with this covariate. In Figure 2 we have plotted the estimated survival function for six distinct censoring times , marked by the vertical red dotted lines. The plots also contain the true survival function (in blue) for comparison. From the orange lines, we see a clear bias of the BCE method (binary classifier) applied to right-censored data, as it always underestimates the survival. On the other hand, the survival estimates by the BCE method with censoring information, represented by the green lines, follow the true survival function reasonably well up till the censoring times and then fall to zero. This agrees well with the optimal estimates in (9).

Next, we investigate the resulting Brier scores of these survival estimates. In Figure 3 we have plotted the uncensored Brier score (top left), the IPCW Brier scores (5) with Kaplan-Meier estimates of (top right), the IPCW scores with the true censoring distribution (bottom left), and the proposed administrative Brier score (bottom right) given by (3.2). Recall that the IPCW scores with the true censoring distribution (bottom left) correspond to removing censored individuals in the same manner as that of a binary classifier. The uncensored Brier score is computed on an uncensored test set, while the three other scores are computed on a censored test set with the same censoring distributions as in the training set.

From the uncensored Brier score, it is clear that the true survival function does better than the two BCE methods. However, for both the IPCW Brier scores we see that the BCE method with censoring information performs the best. In fact, when we use the true censoring step-function as weights, even the BCE without knowledge of the censoring time has better scores than the true survival function. These figures, therefore, clearly illustrate a potential weakness of the IPCW applied to data sets with identifiable censoring times.

The administrative Brier score (bottom right), rightfully identifies the true survival function as better than the estimates of the BCE methods. For this score, the BCE with censoring information performs very closely to the true survival function, but this is expected as the survival estimates are close to the true survival function for .

The simulations were made to represent a worst-case scenario for the IPCW Brier score, in the sense that the covariates contain information that makes the censoring times easily identifiable for all individuals. It is unlikely to find such extreme results in real data sets, but it serves the purpose of illustrating the potential issues.

## 5 Simulations

The simulations in Section 4.1 illustrated the potential issues of using the IPCW Brier score on a data set with administrative censoring. The simulations were, however, tailored to emphasize such issues by having a very simple survival function, and censoring times that were very easy to identify from the covariates. In reality, the scores depend on how well the BCE method is able to estimate both the survival and censoring distribution. Hence, we conduct a more reasonable simulation study to investigate the extent of these issues.

We will use the framework presented by Kvamme and Borgan (2019) to create simulated data sets. This means that we draw event times by sequentially sampling from discrete-time hazards on the time grid . The hazards are specified through the logit-hazard function . Note that is just the notation used by Kvamme and Borgan (2019), and is not related to . The discrete-time hazards are given by the sigmoid

The logit-hazards are made up of a weighted sum of three functions , , and ,

(10) | ||||

(11) | ||||

Here each of the nine ’s is a function of the covariates , meaning is just a simplified notation of . The exact definitions of these ’s are given in Kvamme and Borgan (2019, Appendix A.1) which also explains the scheme used to draw the covariates. With denoting the ’th time point in , the survival function is given by

(12) |

To incorporate administrative censoring in the simulations, we consider a monotonically decreasing function , and let the censoring time be defined by a threshold such that . This gives the survival function of the censoring distribution

(13) |

Hence, the censoring is deterministic, while the complexity of controls the complexity of the relationship between the covariates and the censoring time . We will let have the same functional form as a survival function, meaning it is defined in the same manner as (12), but with its own set of covariates independent of the covariates of the event-time distribution. In all the experiments we set .

In the experiments, we compare the BCE method with the Logistic-Hazard method. We use the Logistic-Hazard because of its similarity to the BCE method, in that it also minimizes the binary cross-entropy, but use the discrete hazards instead of the survival estimates. We use the implementation of the Logistic-Hazard by Kvamme and Borgan (2019), but the method has been described by multiple authors (Brown, 1975; Biganzoli et al., 1998; Gensheimer and Narasimhan, 2019).

### 5.1 Complicated Censoring Distribution

In the first study, we draw event times using (10) with two covariates per . The function is also defined using (10) and (12), but with an independent set of covariates, ensuring that the event times are independent of the censoring times. Note that although the covariates contain enough information to identify the censoring times, the complexity of the censoring distribution makes this a somewhat hard task.

We fit models using all the covariates from both the event-time distribution and the censoring distribution, giving a total of 36 covariates. We draw 10,000 individuals for training and testing, and 4,000 for a validation set used for early stopping of our training procedure. The networks are ReLU-nets with 4 layers and 32 nodes in each layer. Batch normalization and dropout with a rate of 0.1 are applied between each layer. The BCE and Logistic-Hazard both give estimates for 50 equidistant times points between 0 and 100, but we perform constant density interpolation (linear interpolation of survival estimates, see Kvamme and Borgan, 2019) to obtain predictions for all 1,000 time points.

In Figure 4, we have plotted the Brier score of an uncensored test set, the administrative Brier score (3.2), and four IPCW Brier scores using (5). The IPCW scores are calculated with Kaplan-Meier estimates for the censoring distribution, the true censoring function in (13), and censoring distributions estimated with Logistic-Hazard and max weights of 100 and 1000. Recall from Section 3.1.2 that when we estimate the censoring distribution with other methods than Kaplan-Meier, the weights can become very large, resulting in unstable results. Hence we set a max weight, which is given above the respective plots.

From the figure, we see that both methods perform rather poorly compared to the true survival function, and the Logistic-Hazard performs better than the BCE for all scores except for the IPCW with the true censoring distribution. The problems with IPCW do not appear here because the functional form of the censoring distribution is quite complicated. Hence, the BCE method is not able to identify the ’s to the extent that it can take advantage of the bias of the IPCW scores.

### 5.2 Simple Censoring Distribution

In the second simulations study, we let be defined by (12) with from (11) and let be a function of 5 covariates. This gives a very simple censoring distribution, meaning that the censoring times are quite easily identifiable. We repeat the simulations with the event-time distribution unchanged, and the results are displayed in Figure 5. First, we note that the performance of the BCE method on the uncensored test set is worse than before. This is expected because it is now easier for the BCE method to identify the censoring time , meaning the survival estimates fall to zero right after (as in Figure 2). Nevertheless, the BCE method now gets better IPCW scores than the Logistic-Hazard. This shows that when the administrative censoring function is simple to learn, the BCE method exhibit more of the step behavior of the optimal estimates in (9). We observe, however, that the administrative Brier score rightfully considers the BCE method worse than the Logistic-Hazard. In this regard, we argue that even though the IPCW Brier score might work well for administrative censoring, the administrative Brier score is the safer choice.

In the simulations, we have considered a censoring distribution with covariates that are independent of the survival covariates, so the event-time distribution is independent of the censoring distribution. This might explain why the Kaplan-Meier estimates work so well. If one would encounter this independence for real data, one could simply remove the censoring covariates, as they do not affect the event-time distribution, and use IPCW with Kaplan-Meier weights.

## 6 KKBox Churn Prediction

Finally, we revisit the KKBox data set discussed in Section 2. We fit the BCE method, corresponding to multiple binary classifiers, and the Logistic-Hazard method. The training set is of size 100,000 and we use a validation set of size 20,000 for early stopping. We use 6-layer ReLU networks with 128 nodes in each layer, and with batch normalization and a dropout rate of 0.1 between each layer. Entity embeddings are used for the categorical covariates (Guo and Berkhahn, 2016) with embedding sizes that are the square root of the number of categories. The outputs of the networks are of size 150 representing an equidistant grid of the full time-scale of the training set. Constant density interpolation (Kvamme and Borgan, 2019) is applied to the survival estimates to obtain predictions for all the time-points in the test set. We use the AdamWR (Loshchilov and Hutter, 2019) optimizer with a cycle length of 1 epoch, but we double the cycle length and multiply the learning rate by 0.8 after each cycle. Also, we do not include weight decay. The data set is, essentially, the data set presented by Kvamme et al. (2019), but including all censoring times and an extra categorical covariate stating the payment method. The code for obtaining the data set is available at github.com/havakv/pycox.

The censoring distribution is estimated in two ways. The first is with the Kaplan-Meier estimator and the second is with a Logistic-Hazard with the same hyperparameters as the Logistic-Hazard used to estimate the churn distribution. The Brier scores are computed on a test set of size 100,000 and displayed in Figure 6. The figure contains the two plots from Figure 1, the administrative Brier scores (3.2), and three other IPCW scores (5) with different max weights. Note that a maximum weight of 1 corresponds to the unweighted score (4) where the set of censored individuals are removed. This is the same way the BCE method (7) handles censored individuals.

As we recall from Section 2, the IPCW Brier score with Kaplan-Meier estimates considers the BCE method the best (top left Figure 6). We now see from the administrative scores (top right) that this is probably a wrong conclusion, as the BCE method has worse administrative Brier scores than the Logistic-Hazard. We also see that the IPCW scores obtained with Logistic-Hazard are very dependent on the maximum weight allowed in the scores. For higher allowed weights, we would consider the BCE method to perform worse than the Logistic-Hazard, but note that for the highest times this is not the case.

The administrative scores are simpler to obtain than the IPCW scores, as they do not require estimation of the censoring distribution. Also, from this case study, we see that the administrative scores are probably a safer chose of evaluation metric than the IPCW weighted Brier score.

## 7 Discussion

In this paper, we have addressed potential issues of the inverse probability of censoring weighted (IPCW) Brier score, in particular for administrative censoring. If the covariates have sufficient information to determine the administrative censoring times, the IPCW Brier score will not be minimized by the true survival functions, but instead by a function that falls to zero after the censoring time. We have also shown that a binary classifier that disregards censored individuals will approach this minimizer. As a consequence, the binary classifier might actually get better IPCW Brier scores than the true survival function. Due to this bias, we argue that the IPCW Brier score needs to be applied with care.

The IPCW Brier score can be substantially affected by the estimated censoring distribution. If this is the case, the validity of the scores can be questioned. In regards to both issues, we propose the administrative Brier score that works for administrative right-censored event times and does not require estimation of the censoring distribution. This means that it is simpler to use than the IPCW scores.

We simulate examples where the IPCW score fails to identify the best survival estimates, but the administrative Brier score still provides reasonable scores. We also investigate a real-world churn data set with administrative censoring and find that it exhibits some of the same behavior as our simulations. This shows that the proposed administrative Brier score can be a very useful evaluation metric.

In this paper, we have only investigated the IPCW Brier score, but note that there are multiple other IPCW scores that might suffer from the same drawbacks as the Brier score. For example the IPCW Binomial log-likelihood (Graf et al., 1999) and the IPCW concordance index (Uno et al., 2011; Gerds et al., 2013). Preliminary investigations of the IPCW Binomial log-likelihood (Graf et al., 1999) suggest that it has the same issues as the IPCW Brier score while an administrative version of the Binomial log-likelihood behaves in the same manner as the administrative Brier score.

The issues discussed in this paper are mostly relevant for machine learning methods, such as neural networks, as the issues are only notable for quite precise estimates of the event-time distribution and the censoring times . Hence, we are unlikely to encounter such issues with classical statistical models.

Large parts of the machine learning literature rely on empirical evaluation of predictive methodology. We, therefore, believe that more research on the evaluation metrics for right-censored survival data is needed.

This work was supported by The Norwegian Research Council 237718 through the Big Insight Center for research-driven innovation.

## Appendix A The BCE Survival Estimates

To better understand the survival estimates of the binary classifiers in Section 4, we investigate the minimizers of the expected loss. Again, we stress that the BCE method corresponds to a set of binary classifiers in the manner given by the loss (7) and (8).

The expected loss of the binary classifier (7) is

The minimizers of this expectation with respect to can be found by equating the partial derivatives with zero,

This gives the minimizers

(A.1) | ||||

We see that the ’s are underestimating the true survival as long as there is censoring present. This is expected as the binary classifiers remove censored individuals, decreasing the proportion of survived individuals.

If the censoring times can be identified by the covariates, the censoring distribution is given by the step function . The minimizer (A.1) can then be written as

If , we have

and if , we have . The minimizer can, therefore, be written as

We recognize this as the minimizer of the IPCW Brier score with administrative censoring from Section 3.1.1. So if there is sufficient information in the covariates to identify the administrative censoring times , the binary classifiers will approach the minimizers of the IPCW Brier score.

### References

- Laura Antolini, Patrizia Boracchi, and Elia Biganzoli. A time-dependent discrimination index for survival data. Statistics in Medicine, 24(24):3927–3944, 2005.
- Elia Biganzoli, Patrizia Boracchi, Luigi Mariani, and Ettore Marubini. Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach. Statistics in Medicine, 17(10):1169–1186, 1998.
- Charles C. Brown. On the use of indicator variables for studying the time-dependence of parameters in a response-time model. Biometrics, 31(4):863–872, 1975.
- Travers Ching, Xun Zhu, and Lana X. Garmire. Cox-nnet: An artificial neural network method for prognosis prediction of high-throughput omics data. PLoS Computational Biology, 14(4):e1006076, 2018.
- Stephane Fotso. Deep neural networks for survival analysis based on a multi-task framework. arXiv preprint arXiv:1801.05512, 2018.
- Michael F. Gensheimer and Balasubramanian Narasimhan. A scalable discrete-time survival model for neural networks. PeerJ, 7:e6257, 2019.
- Thomas A. Gerds and Martin Schumacher. Consistent estimation of the expected Brier score in general survival models with right-censored event times. Biometrical Journal, 48(6):1029–1040, 2006.
- Thomas A. Gerds, Michael W. Kattan, Martin Schumacher, and Changhong Yu. Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring. Statistics in Medicine, 32(13):2173–2184, 2013.
- Erika Graf, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. Assessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine, 18(17-18):2529–2545, 1999.
- Cheng Guo and Felix Berkhahn. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737, 2016.
- Frank E. Harrell Jr, Robert M. Califf, David B. Pryor, Kerry L. Lee, and Robert A. Rosati. Evaluating the yield of medical tests. Journal of the American Medical Association, 247(18):2543–2546, 1982.
- Hemant Ishwaran, Udaya B. Kogalur, Eugene H. Blackstone, and Michael S. Lauer. Random survival forests. Annals of Applied Statistics, 2(3):841–860, 2008.
- Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deepsurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Medical Research Methodology, 18(1), 2018.
- Håvard Kvamme, Ørnulf Borgan, and Ida Scheel. Time-to-event prediction with neural networks and Cox regression. Journal of Machine Learning Research, 20(129):1–30, 2019.
- Håvard Kvamme and Ørnulf Borgan. Continuous and discrete-time survival prediction with neural networks. arXiv preprint arXiv:1910.06724, 2019.
- Changhee Lee, William R Zame, Jinsung Yoon, and Mihaela van der Schaar. Deephit: A deep learning approach to survival analysis with competing risks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Margaux Luck, Tristan Sylvain, Héloïse Cardinal, Andrea Lodi, and Yoshua Bengio. Deep learning for patient-specific kidney graft survival analysis. arXiv preprint arXiv:1705.10245, 2017.
- Hajime Uno, Tianxi Cai, Michael J. Pencina, Ralph B. D’Agostino, and L. J. Wei. On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in Medicine, 30(10):1105–1117, 2011.
- Safoora Yousefi, Fatemeh Amrollahi, Mohamed Amgad, Chengliang Dong, Joshua E. Lewis, Congzheng Song, David A. Gutman, Sameer H. Halani, Jose Enrique Velazquez Vega, Daniel J. Brat, et al. Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. Scientific Reports, 7(11707), 2017.