Boosted nonparametric hazards with timedependent covariatesDonald K.K. Lee^{1}^{1}1Correspondence: donald.lee@emory.edu. Supported by a hyperplane, Ningyuan Chen^{2}^{2}2Supported by the HKUST startup fund R9382, Hemant Ishwaran^{3}^{3}3Supported by the NIH grant R01 GM125072Emory University, University of Toronto, University of Miami
Given functional data from a survival process with timedependent covariates, we derive a smooth convex representation for its nonparametric loglikelihood functional and obtain its functional gradient. From this we devise a generic gradient boosting procedure for estimating the hazard function nonparametrically. An illustrative implementation of the procedure using regression trees is described to show how to recover the unknown hazard. We show that the generic estimator is consistent if the model is correctly specified; alternatively an oracle inequality can be demonstrated for treebased models. To avoid overfitting, boosting employs several regularization devices. One of them is stepsize restriction, but the rationale for this is somewhat mysterious from the viewpoint of consistency. Our work brings some clarity to this issue by revealing that stepsize restriction is a mechanism for preventing the curvature of the risk from derailing convergence. ^{†}^{†}MSC 2010 subject classifications. Primary 62N02; Secondary 62G05, 90B22.^{†}^{†}Keywords. survival analysis, gradient boosting, functional data, stepsize shrinkage, regression trees, likelihood functional.
1. Introduction. Flexible hazard models involving timedependent covariates are indespensible tools for studying systems that track covariates over time. In medicine, electronic health records systems make it possible to log patient vitals throughout the day, and these measurements can be used to build realtime warning systems for adverse outcomes such as cancer mortality [2]. In financial technology, lenders track obligors’ behaviours over time to assess and revise default rate estimates. Such models are also used in many other fields of scientific inquiry since they form the building blocks for transitions within a Markovian state model. Indeed, this work was partly motivated by our study of patient transitions in emergency department queues and in organ transplant waitlist queues [20]. For example, allocation for a donor heart in the U.S. is defined in terms of coarse tiers [23], and transplant candidates are assigned to tiers based on their health status at the time of listing. However, a patient’s condition may change rapidly while awaiting a heart, and this timedependent information may be the most predictive of mortality and not the static covariates collected far in the past.
The main contribution of this paper is to introduce a fully nonparametric boosting procedure for hazard estimation with timedependent covariates. We describe a generic gradient boosting procedure for boosting arbitrary weak base learners for this setting. Generally speaking, gradient boosting adopts the view of boosting as an iterative gradient descent algorithm for minimizing a loss functional over a target function space. Early work includes Breiman [6, 7, 8] and Mason et al. [21, 22]. A unified treatment was provided by Friedman [13], who coined the term “gradient boosting” which is now generally taken to be the modern interpretation of boosting.
Most of the existing boosting approaches for survival data focus on timestatic covariates and involve boosting the Cox model. Examples include the popular Rpackages mboost [10] and gbm [25] which apply gradient boosting using Cox partial likelihood loss; see also Bühlmann and Hothorn [10]. Related work includes the penalized Cox partial likelihood approach of Binder and Schumacher [4]. Other important approaches, but not based on the Cox model, include Boosting [11] with inverse probability of censoring weighting (IPCW) [17, 18], boosted transformation models of parametric families [15], and boosted accelerated failure time models [27].
As just described there are many proposed boosting methods for dealing with timestatic covariates. However, for the case of timedependent covariates the literature is much more sparse. In fact, so far there is no general nonparametric approach for dealing with this setting. This is because in order to implement a fully nonparametric estimator, one has to contend with the issue of identifying the gradient, which turns out to be a nontrivial problem due to the functional nature of the data. This is unlike most standard applications of gradient boosting where the gradient can easily be identified and calculated.
1.1. Timedependent covariate framework. To explain why this is so challenging, we start by formally defining the survival problem with timedependent covariates. Our description follows the framework of Aalen [1]. Let denote the potentially unobserved failure time. Conditional on the history up to time the probability of failing at equals
(1) 
Here denotes the unknown hazard function, is a predictable covariate process, and is a predictable indicator of whether the subject is at risk at time .^{1}^{1}1The filtration of interest is . If is only observable when , we can set whenever . To simplify notation, without loss of generality we normalize the units of time so that for .^{2}^{2}2Since the data is always observed up to some finite time, there is no information loss from censoring at that point. For example, if is the failure time in minutes and the longest duration in the data is minutes, the failure time in hours, , is at most hour. The hazard function on the minute timescale, , can be recovered from the hazard function on the hourly timescale, , via . In other words, the subject is not at risk after time , so we can restrict attention to the time interval .
If failure is observed at then the indicator equals 1, otherwise and we set . Throughout we assume we observe independent and identically distributed functional data samples . The evolution of observation ’s failure status can then be thought of as a sequence of coin flips at time increments , with the probability of “heads” at each time point given by (1). Therefore, observation ’s contribution to the likelihood is
where the limit can be understood as a product integral. Hence, if the loghazard function is
then the (scaled) negative loglikelihood functional is
(2) 
which we shall refer to as the likelihood risk. The goal is to estimate the loghazard function nonparametrically.
1.2. The likelihood does not have a gradient in generic function spaces. As mentioned, our approach is to boost the loghazard by using functional gradient descent. However the chief difficulty is that the canonical representation of the likelihood risk functional does not have a gradient. To see this, observe that the directional derivative of (2) equals
(3) 
which is the difference of two different inner products where
Hence, (3) cannot be expressed as a single inner product of the form for some function . Were it possible to do so, would then be the gradient function.
In simpler nonfunctional data settings like regression or classification, the loss can be written as , where is the nonfunctional target statistical model and is the outcome, so the gradient is simply . The negative gradient is then approximated using a base learner from a predefined class of functions (this being either parametric; for example linear learners, or nonparametric; for example tree learners). Typically, the optimal base learner is chosen to minimize the approximation error and then scaled by a regularization parameter to obtain the updated estimate of :
Importantly, in the simpler nonfunctional setting the gradient does not depend on the space that belongs to. By contrast, a key insight of this paper is that the gradient of can only be defined after carefully specifying an appropriate sampledependent domain for . The likelihood risk can then be reexpressed as a smooth convex functional, and an analogous representation also exists for the population risk. These representations resolve the difficulty above, allow us to describe and implement a gradient boosting procedure, and are also crucial to establishing guarantees for our estimator.
1.3. Contributions of the paper. A key discovery of this paper is Proposition 1 of Section id1 which provides an integral representation for the likelihood risk from which several results follow, including, importantly, an explicit representation for the gradient. Proposition 1 relies on defining a suitable space of loghazard functions defined on the timecovariate domain . Identifying this space is the key insight that allows us to rescue the likelihood approach and to derive the gradient needed to implement gradient boosting. Arriving at this framework is not conceptually trivial, and may explain the absence of boosted nonparametric hazard estimators until now.
Algorithm 1 of Section id1 describes our boosted hazard estimator. The algorithm minimizes the likelihood risk (2) over the defined space of loghazard functions. In the special case of regression tree learners, expressions for the likelihood risk and its gradient are obtained from Proposition 1, which are then used to describe a treebased implementation of our estimator in Section id1. In Section id1 the implementation is applied to a highdimensional dataset generated from a naturalistic simulation of patient service times in an emergency department.
Section id1 establishes the consistency of the procedure. We show that the hazard estimator is consistent if the space is correctly specified. In particular, if the space is the span of regression trees, then the hazard estimator satisfies an oracle inequality and recovers up to some error tolerance (Propositions 3 and 4).
Another contribution of our work is to clarify the mechanisms used by gradient boosting to avoid overfitting. Gradient boosting typically applies two types of regularization to invoke slow learning: (i) A small stepsize is used for the update; and (ii) The number of boosting iterations is capped. The number of iterations used in our algorithm is set using the framework of Zhang and Yu [31], whose work explains why early stopping is necessary for consistency. On the other hand, the role of stepsize restriction is more mysterious. While [31] demonstrates that small stepsizes are needed to prove consistency, unrestricted greedy stepsizes are already small enough for classification problems [28] and also for commonly used regression losses (see the Appendix of [31]). We show in Section id1 that for the setting considered here, shrinkage acts as a counterweight to the curvature of the risk (see Lemma 2). Hence if the curvature is unbounded, as is the case for hazard regression, then the stepsizes may need to be explicitly controlled to ensure convergence. This is an important result which adds to our knowledge of numerical convergence of gradient boosting. As noted by Biau and Cadre [3] the literature for this topic is still relatively sparse, and was the motivation for their study of numerical convergence of two types of gradient boosting procedures. Our work adds to this by considering the functional data setting.
Concluding remarks can be found in Section id1. Proofs not appearing in the body of the paper can be found in the appendix.
2. The boosted hazard estimator. In this section, we describe our boosted hazard estimator. To provide readers with concrete examples for the ideas introduced here, we will show how the quantities defined in this section specialize in the case of regression trees, which is one of a few possible ways to implement boosting.
We begin by defining in Section id1 an appropriate sampledependent domain for the likelihood risk . As explained, the key insight of this paper is that this will allow us to reexpress the likelihood risk and its population analogue as smooth convex functionals, thereby enabling us to compute their gradients in Propositions 1 and 2 of Section id1. Following this, the boosting algorithm is formally stated in Sections id1id1.
2.1. Specifying a domain for . We will make use of two identifiability conditions (A1) and (A2) to define the domain for . Condition (A1) below is the same as Condition 1(iv) of Huang and Stone [19].
Assumption (A1). ()
The true hazard function is bounded between some interval on the timecovariate domain .
Recall we defined and to be predictable processes, and so it can be shown that the integrals and expectations appearing in this paper are all well defined. Denoting the indicator function as , define the following population and empirical subprobability measures on :
and note that because the data is i.i.d. by assumption. Intuitively, measures the denseness of the observed sample timecovariate paths on . For any integrable ,
(4)  
(5) 
This allows us to define the following (random) norms and inner products
and note that because .
By design, allows us to specify a natural domain for . Let be a set of bounded functions that are linearly independent, in the sense that if and only if (when some of the covariates are discretevalued, should be interpreted as the product of a counting measure and the Lebesgue measure). The span of the functions is
For example, the span of all regression tree functions that can be defined on is ,^{3}^{3}3It is clear that said span is contained in . For the converse, it suffices to show that is also contained in the span of trees of some depth. This is easy to show for trees with splits, because they can generate partitions of the form in (Section 3 of [8]). which are linear combinations of indicator functions over disjoint timecovariate cubes indexed^{4}^{4}4With a slight abuse of notation, the index is only considered multidimensional when describing the geometry of , such as in (6). In all other situations should be interpreted as a scalar index. by :
(6) 
The regions are formed using all possible split points for the th covariate , with the spacings determined by the precision of the measurements. For example, if weight is measured to the closest kilogram, then the set of all possible split points will be kilograms. Note that these split points are the finest possible for any realization of weight that is measured to the nearest kilogram. While abstract treatments of trees assume that there is a continuum of split points, in reality they fall on a discrete (but fine) grid.
When is equipped with , we obtain the following sampledependent subspace of , which is the appropriate domain for :
Note that the elements in are equivalence classes rather than actual functions that have welldefined values at each . This is a problem because the likelihood (2) requires evaluating at the points where . We resolve this by fixing an orthonormal basis for , and represent each member of uniquely in the form . For example in the case of regression trees, applying the GramSchmidt procedure to gives
which by design have disjoint support.
The second condition we impose is for to be linearly independent in , that is if and only if . Since by construction is already linearly independent on , the condition intuitively requires the set of all possible timecovariate trajectories to be adequately dense in to intersect a sufficient amount of the support of every . This is weaker than the identifiability conditions 1(ii)1(iii) in Huang and Stone [19] which require to have a positive joint probability density on .
Assumption (A2). ()
The Gram matrix is positive definite.
2.2. Integral representations for the likelihood risk. Having deduced the appropriate domain for , we can now recast the risk as a smooth convex functional on with explicit closed form expression for the gradient.
Proposition 1. ()
For functions of the form , the likelihood risk (2) can be written as
(7) 
where is the function
Thus there exists (depending on and ) for which the Taylor representation
(8) 
holds, where the gradient
(9) 
of is the projection of onto . Hence if then the infimum of over the span of is uniquely attained at .
For regression trees the expressions (7) and (9) simplify further because is closed under pointwise exponentiation, i.e. for . This is because the ’s are disjoint so and hence . Thus
(10)  
(11) 
where
is the number of observed failures in the timecovariate region .
Proof of Proposition 1.
Fix a realization of . Using (5) we can rewrite (2) as
We can express in terms of the basis as . Hence
where the fourth equality follows from the orthonormality of the basis. This completes the derivation of (7).
By an interchange argument we obtain
the latter being positive whenever ; i.e., is convex. The Taylor representation (8) then follows from noting that is the orthogonal projection of onto . ∎
The expectation of the likelihood risk also has an integral representation. A special case of the representation (12) below is proved in Proposition 3.2 of Huang and Stone [19] for rightcensored data only, under assumptions that do not allow for internal covariates. In the statement of the proposition below recall that and are defined in (A1). The constant is defined later in (26).
Proposition 2. ()
For ,
(12) 
Furthermore the restriction of to is coercive:
(13) 
and it attains its minimum at a unique point . If contains the underlying loghazard function then .
Remark 1. ()
Coerciveness (13) implies that any with expected risk less than is uniformly bounded:
(14) 
where the constant
(15) 
is by design no smaller than 1 in order to simplify subsequent analyses.
2.3. The boosting procedure. Algorithm 1 describes the proposed boosting procedure for estimating . A weak learner approximation to the negative gradient is used as the descent direction. Some popular choices are:

Shallow depth regression trees that are correlated with . This is the framework considered by Friedman [13].

The member of the basis most aligned with . This is a variant of coordinate descent (Chapter 7.2 of Schapire and Freund [26]).

If coordinate descent is used and each depends on only one component of the covariate vector, we obtain componentwise learners as in Bühlmann and Yu [11].
To model a generic approximation to a nonzero gradient, we introduce the concept of an gradient.
Definition 1. ()
Suppose . We say that a unit vector is an gradient at if for some ,
(16) 
Call a negative gradient if is an gradient.
Since the remainder term in the Taylor expansion (8) is nonnegative due to convexity, can only be decreased along a direction if and only if that direction is a negative gradient. The larger is, the closer the alignment is between the gradient and the gradient, and the greater the risk reduction. In particular, is the unique 1gradient with steepest descent and maximal risk reduction. While at first glance, this might seem to suggest that gradients with higher alignment should be preferred, it is well known that the statistical performance of gradient descent generally improves when simple base learners are used. This slow learning (regularization) is essentially a tradeoff between the complexity of the base learner space and the amount of risk reduction achieved in one boosting iteration. Thus using simpler descent directions with smaller has the balancing effect of improving statistical performance. Our consistency results, Propositions 3 and 4 of Section id1, qualitatively captures this tradeoff.
In the case of regression trees, the alignment is nondecreasing in the number of tree splits (complexity of the learner). For a given , it is always possible to find a deep enough tree that is an gradient, because with enough splits we can recover the gradient itself up to almost everywhere.^{5}^{5}5Split the tree until each terminal node contains just one of the regions in (6) with . Then set the value of the node equal to the value of the gradient function (11) inside . Conversely, the tree complexity can be set first (using crossvalidation to select the number of splits), and afterward is implicitly the minimum of over the iterations, where is the LHS of (16) for the th iteration. Since the gradient at each iteration is nonzero (enforced by line 2 of Algorithm 1), this ensures that .
2.4. Regularization steps. Algorithm 1 makes use of two parameters, and . The first defines the early stopping criterion, while the second controls the stepsize. These are two common regularization techniques for boosting:

Early stopping. The number of boosting iterations is controlled by stopping the algorithm before the uniform norm of the estimator reaches or exceeds
(17) where is the branch of the Lambert function that returns the real root of the equation for .

Stepsizes. The stepsize used in gradient boosting is typically held constant across iterations. While we can also do this in our procedure^{6}^{6}6The term in condition (18) would need to be replaced by if a constant stepsize is used., the role of stepsize shrinkage becomes more salient if we use instead as the stepsize for the th iteration in Algorithm 1. This stepsize is controlled in two ways. First, it is made to decrease with each iteration according to the RobbinsMonro condition that the sum of the steps diverges while the sum of squared steps converges. Second, the shrinkage factor is selected to make the stepsizes decay with at rate
(18) This acts as a counterbalance to ’s unbounded curvature:
(19) which is upper bounded by when and .
3. Consistency. Under (A1)(A2), guarantees for our hazard estimator in Algorithm 1 can be derived for two scenarios of interest. In the following development, recall from Proposition 2 that is the unique minimizer of , so it satisfies the first order condition
(20) 
for all . Recall that the span of all trees is closed under pointwise exponentiation (), in which case (20) implies that is the orthogonal projection of onto .

Consistency when is correctly specified. If the true loghazard function is in , then Proposition 2 asserts that . It will be shown in this case that is consistent:

Oracle inequality for regression trees. If is closed under pointwise exponentiation, it follows from (20) that is the best approximation to among all candidate hazard estimators . It can then be shown that converges to this best approximation:
This oracle result is in the spirit of the type of guarantees available for treebased boosting in the nonfunctional data setting. For example, if tree stumps are used for regression, then the regression function estimate will converge to the best approximation to the true regression function in the span of tree stumps [9]. Similar results also exist for boosted classifiers [5].
Propositions 3 and 4 below formalize these guarantees by providing bounds on the error terms above. While sharper bounds may exist, the chief purpose of this paper is to introduce our generic estimator for the first time and to provide guarantees that apply across different implementations. More refined convergence rates may exist for a specific implementation, just like the analysis in Bühlmann and Yu [11] for Boosting when componentwise spline learners are specifically used. We leave this investigation for future research.
En route to establishing the guarantees, Lemma 2 clarifies the role played by stepsize restriction in ensuring convergence of the estimator. As explained in the Introduction, explicit shrinkage is not necessary for classification and regression problems where the risk has bounded curvature. Lemma 2 suggests that it may, however, be needed when the risk has unbounded curvature, as is the case with . Seen in this light, shrinkage is really a mechanism for controlling the growth of the risk curvature.
3.1. Strategy for establishing guarantees. The representations for and its population analogue from Section id1 are the key ingredients for formalizing the guarantees. We use them to first show that converges to : Applying Taylor’s theorem to (12) about yields
(21) 
The problem is thus transformed into one of risk minimization , for which [31] suggests analyzing separately the terms of the decomposition
The authors argue that in boosting, the point of limiting the number of iterations (enforced by lines 510 in Algorithm 1) is to prevent from growing too fast, so that (I) converges to zero as . At the same time, is allowed to grow with in a controlled manner so that the empirical risk in (III) is eventually minimized as . Lemmas 1 and 2 below show that our procedure achieves both goals. Lemma 1 makes use of complexity theory via empirical processes, while Lemma 2 deals with the curvature of the likelihood risk. The term (II) will be bounded using standard concentration results.
3.2. Bounding (I) using complexity. To capture the effect of using a simple gradient (16) as the descent direction, we bound (I) in terms of the complexity of^{7}^{7}7 Note that for technical convenience, has been enlarged from to include the unit ball.
(23)  
Depending on the choice of weak learners for the gradients, may be much smaller than . For example, coordinate descent might only ever select a small subset of basis functions because of sparsity. As another example if is additively separable in time and also in each covariate, then regression trees might only ever select simple tree stumps (one tree split).
The measure of complexity we use below comes from empirical process theory. Define for and suppose that is a subprobability measure on . Then the ball of radius centred at some is . The covering number is the minimum number of such balls needed to cover (definitions 2.1.5 and 2.2.3 of van der Vaart and Wellner [29]), so for . A complexity measure for is
(24) 
where the supremum is taken over and over all nonzero subprobability measures. As discussed, is never greater than, and potentially much smaller than , the complexity of , which is fixed and finite.
Before stating Lemma 1, we note that the result also shows an empirical analogue to the norm equivalences
(25) 
exists, where
(26) 
The factor of 2 above serves to simplify the presentation, and can be replaced with anything greater than 1.
Lemma 1. ()
There exists a universal constant such that for any , with probability at least
an analogue to (25) holds for all :
(27) 
and for all ,
(28) 
Remark 2. ()
The equivalences (27) imply that equals its upper bound . That is, if , then , so because are linearly independent on .
3.3. Bounding (III) using curvature. We use the representation in Proposition 1 to study the minimization of the empirical risk by boosting. Standard results for exact gradient descent like Theorem 2.1.15 of Nesterov [24] are in terms of the norm of the minimizer, which may not exist for .^{8}^{8}8The infimum of is not always attainable: If is nonpositive and vanishes on the set , then