Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings
Abstract
The application of existing methods for constructing optimal dynamic treatment regimes is limited to cases where investigators are interested in optimizing a utility function over a fixed period of time (finite horizon). In this manuscript, we develop an inferential procedure based on temporal difference residuals for optimal dynamic treatment regimes in infinitehorizon settings, where there is no a priori fixed end of followup point. The proposed method can be used to determine the optimal regime in chronic diseases where patients are monitored and treated throughout their life. We derive large sample results necessary for conducting inference. We also simulate a cohort of patients with diabetes to mimic the third wave of the National Health and Nutrition Examination Survey, and we examine the performance of the proposed method in controlling the level of hemoglobin A1c. Supplementary materials for this article are available online.
Some key words: Actionvalue function; Causal inference; Backward induction; Temporal difference residual.
A dynamic treatment regime (DTR) is a treatment process that considers patients’ individual characteristics and their ongoing performance to decide which treatment option to assign. DTRs can, potentially, reduce side effects and treatment costs, which makes the process attractive for policy makers. The optimal DTR is the one that, if followed, yields the most favorable outcome on average. Depending on the context, the DTR is also called an adaptive intervention (Collins et al., 2004) or adaptive strategy (Lavori & Dawson, 2000).
The goal of this manuscript is to devise a new methodology that can be used to construct the optimal DTR in infinitehorizon settings (i.e., when the number of decision points is not necessarily fixed for all individuals). The estimation procedure, however, is based on observational data collected over a fixed period of time that includes many decision points. One potential application of our method is to estimate the optimal treatment regime for chronic diseases using data extracted from an electronic medical record data set during a fixed period of time.
This work was motivated by the National Health and Nutrition Examination Survey (NHANES), which was designed to assess the health status of adults and children in the United States. Timbie et al. (2010) simulates a cohort of subjects diagnosed with diabetes that mimics the third wave of the NHANES and uses this cohort to evaluate the ability of available treatments to control risk factors. Specifically, Timbie and colleagues’ study was designed to manage the risk factors for vascular complications such as high blood pressure, high cholesterol and high blood glucose (Grundy et al., 2004; Hunt, 2008). In this manuscript, we simulate a cohort similar to Timbie’s. Our focus is on constructing a DTR for lowering hemoglobin A1c among patients with diabetes.
One challenge in constructing an optimal regime is avoiding treatments that are optimal in the short term but do not result in an optimal longterm outcome. To address this challenge, Murphy (2003) introduced a method based on backward induction (dynamic programming) to estimate the optimal regime using experimental or observational data. Murphy’s method starts from the last decision point and finds the treatment option that optimizes the outcome and goes backward in time to find the best treatment regime for all the decision points (Bather, 2000; Jordan, 2002). More specifically, backward induction maps the covariate history of each individual to an optimal regime. Another method was introduced by Robins (2004) using structural nested mean models (SNMMs). The key notion in Murphy’s and Robins’ methods is that the optimal regime can be characterized by just modeling the difference between the outcome under different treatment regimes, rather than the full outcome model. Robins (2004) proposes Gestimation as a tool to estimate the parameters of SNMMs, while Murphy (2003) uses a least square characterization method (Moodie et al., 2007).
Qlearning, a reinforcement learning algorithm, is also widely used in constructing the optimal regime (Murphy et al., 2006; Zhao et al., 2009; Chakraborty et al., 2010; NahumShani et al., 2012). Qlearning is an extension of the standard regression method that can be used with longitudinal data in which treatments vary over time. Goldberg & Kosorok (2011) introduce a new Qlearning method that can be used when individuals are subject to right censoring. Their new method creates a pseudo population in which everyone has an equal number of decision points and they show that the results obtained by the pseudo population can be translated to the original problem (Zhao et al., 2011; Zhang et al., 2012, 2013). Schulte et al. (2014) provides a selfcontained description of different methods for estimating the optimal treatment regime in finite horizon settings.
The existing methods in the statistics literature are specifically designed to estimate the optimal treatment regime that optimizes a utility function over a fixed period of time. However, in this manuscript our inferential goal is to construct the optimal regime in infinite horizon settings with data that are collected over a fixed period of time. This requires a methodology that estimates the Qfunction and the optimal decision rule without the time index. We achieve this by developing an estimating equation that estimates the optimal regime without requiring backward induction from the last to the first decision point. In order to capture the disease dynamic and the longterm treatment effects, our dataset should contain a long trajectory of data with many decision points.
The remainder of this manuscript is organized as follows. Section Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings explains the data structure and presents the proposed method. In Section Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings, we develop asymptotic properties of the method. We conduct a simulation study in Section Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings to examine the performance of our method. The last section contains some concluding remarks. All the proofs are relegated to an online supplementary document.
We study the effect of a timedependent treatment on a function of outcome. Our data set is composed of i.i.d. trajectories. The th trajectory is composed of the sequence , where is the set of variables measured at the th decision point and is the treatment assigned at that decision point after measuring . is the maximal number of decision points, and the observed length of trajectories are allowed to be different. At each decision point , we define a variable as a summary function of the observed history (such as timevarying covariates, prior response, baseline covariates and treatment history) that depends on, at most, the last time points. We assume that the support of is the same for all s and denote it as . If a patient dies before the last decision point, say , we set (absorbing state). Given , takes values in for all where , . We set for . The treatment and the summary function history through are denoted by and , respectively. We use lowercase letters to refer to the possible values of the corresponding capital letter random variable. From this point forward, for simplicity of notation, we drop the subscript .
We use a counterfactual or potential outcomes framework to define the causal effect of interest and to state assumptions. Potential outcomes models were introduced by Neyman (1990) and Rubin (1978) for timeindependent treatment and later extended by Robins (1986, 1987) to assess the timedependent treatment effect from experimental and observational longitudinal studies.
Associated with each fixed value of the treatment vector , we conceptualize a vector of the potential outcomes , where is the value of the summary function at the th decision point that we would have observed had the individual been assigned the treatment history .
In the potential outcomes framework, we make the following assumptions to identify the causal effect of a dynamic regime.

Consistency: for each

Sequential randomization: .
These assumptions link the potential outcome and the observed data (Robins, 1994, 1997). Assumption 1 means that the potential outcome of a treatment regime corresponds to the actual outcome if assigned to that regime. Assumption 2 means that within levels of , treatment at time , , is randomized. Throughout this manuscript, we assume that these identifiability assumptions hold.
Besides the above assumptions, we assume that the data generating law satisfies the following assumptions:

A.1 Markovian assumption: Fot each ,
(0) (0) 
A.2 Time homogeneity: For each and ,
where and are the summary functions at the previous and the next time, respectively. From this point forward, we refer to as a state variable.

A.3 Positivity assumption: Let be the conditional probability of receiving treatment given . For each action and for each possible value , .
Assumption A.2 means that the conditional distribution of the s does not depend on . A.3 ensures that all treatment options in have been assigned to some patients (i.e., for each , all actions in are possible). This assumption is also known as an exploration assumption.
Assumptions and provide guidance for how to construct the state variable . The Markovian assumptions (Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings) and (Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings) seem to be unrealistic in many studies of chronic diseases. But they are not. This is because, if it is necessary, one can construct the state variable such that it includes previous treatments and observed intermediate outcomes. Thus, for example, (Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings) does not indicate that decision makers make the next treatment decision taking into account only the last outcome data, i.e. disregarding the earlier treatments and outcomes, because these information can be included in the preceding state variable, say .
In cases where has to depend on the observed history of the last time points, assumption is satisfied only if we ignore the first time points of the observed trajectory of patients. This is because the support of is the same only for .
Under assumptions 1–3, the distribution of the observed trajectories is composed of the distribution of the trajectory given , say , and the density . The likelihood of the observed trajectory is given by
(0) 
Expectations with respect to this distribution are denoted by .
The treatment regime (policy), , is a deterministic decision rule where for every , the output is an action , where is the space of feasible actions (Robins, 2004; Schulte et al., 2014). The likelihood of the trajectory corresponding with this law is
(0) 
Expectations with respect to this distribution are denoted by . Note the likelihood ( ‣ Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings) is not welldefined and it may be identical to zero unless holds for each possible value and . Note that, the observed trajectory may be truncated by death at time . In this case, we have , and and by definition, for all , and .
We define the reward value as a known function of at each time and denote it by . The reward value is a longitudinal outcome that is coded such that high values are preferable. We set if .
The actionvalue function at time , , is defined as an expected value of the cumulative discounted reward if taking treatment at state at time and following the policy afterward. In other words, quantifies the quality of policy when and . Hence, is defined as , where is called a discount factor, , which is fixed a priori by the researcher. Note that by definition of the reward function, . Under the Markovian assumption, the actionvalue function does not depend on . Thus we can drop the subscript and denote it by . Note that the actionvalue function has a finite value when and the rewards are bounded.
The discount factor balances the immediate and longterm effect of treatments on the actionvalue function. If , the objective would be maximizing the immediate reward and ignoring the consequences of the action on future rewards or outcomes. As approaches one, future rewards become more important. In other words, specifies our inferential goal. In Section S4 of the supplementary materials we discuss the effect of the choice of on the estimated optimal regime.
The actionvalue function can be written as
The last equation is known as Bellman equation for (Sutton & Barto, 1998; Si, 2004). The inner expectation quantifies the quality of policy at time , in state and with treatment . Taking treatment at time ensures treatment policy is followed in the interval .
Our goal is to construct a treatment policy that, if implemented, would lead to an optimal actionvalue function for each pair . Accordingly, the optimal actionvalue function can be defined as
where . Taking treatment at time ensures that we are taking an optimal treatment in the interval . The last equality follows from the definition of and can also be written as
(0) 
Note that the only distribution involved in the is . Denote any policy for which
as an optimal policy. So, for state , we can define the optimal policy as and the optimal value function as .
The actionvalue function can be estimated by turning the recurrence relation ( ‣ Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings) into an update rule that relies on estimating the conditional density . However, when the cardinality of and the dimension of are large, estimation of the conditional densities is infeasible. We refer to this method as the classical approach and explain it in Section 4 (Simester et al., 2006; Mannor et al., 2007). One way to overcome this limitation is to use a linear function approximation for , which is discussed in the following subsection.
The optimal actionvalue function ( ‣ Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings) is unknown and needs to be estimated in order to construct the optimal regime. Suppose the function can be represented using a linear function of parameters ,
where is the parameter vector of dimension and can be any vector of features summarizing the state and treatment pair (Sutton et al., 2009a, b; Maei et al., 2010). Features are constructed such that . Accordingly, we define the optimal dynamic treatment regime as .
Now we discuss how to estimate the unknown vector of parameters . First, we define an error term and then we construct an estimating equation. In view of the Bellman equation ( ‣ Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings), for each , we have
(0) 
Thus, the error term at time (t+1) in the linear setting can be defined as , which is known as temporal difference error in computer science literature. In order to account for the influence of the feature function in the estimation of the s, we multiply by and define as a value of such that
(0) 
The expectation in the above equation depends on the transition densities and . Note that as in ( ‣ Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings),
Hence, given the observed data, an unbiased estimating equation for can be defined as
(0) 
where is the empirical average.
In practice, sometimes there is no that solves the system of equations ( ‣ Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings), and sometimes the solution is not unique. One way to deal with this is to take an approach similar to the least square technique and define as a minimizer of an objective function. As in ( ‣ Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings), a simple objective function can be defined as
(Sutton et al., 2009b). The objective function used in this manuscript is the above function weighted by the inverse of the feature covariance matrix and defined as
(0) 
where is a fullrank matrix. The weight improves the performance of the proposed stochastic minimization algorithm in Section S1 of the supplementary materials. The function is a generalization of the objective function presented in Maei et al. (2010).
The objective function can be estimated using the observed by
(0) 
where and . Define . Then, the estimated optimal dynamic treatment regime is . By law of large numbers, the estimator is a consistent estimator of .
The following theorem presents the consistency and asymptotic normality of estimator where the asymptotic normality result relies on the uniqueness of the optimal treatment at each decision point. This allows investigators to test the significance of variables for use in this sequential decision making problem. Assumptions required in this section are listed in Appendix 3.
Theorem 1.
(Consistency and asymptotic normality) For a map , defined in ( ‣ Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings), under assumptions , any sequence of estimators with satisfies the following statements:

For small enough , .

Under , where
where is an identity matrix and .
Proof.
See the supplementary material. □
In Section S3 of the supplementary materials, we show that under assumption , for any . The asymptotic variance may be estimated consistently by replacing the expectations with expectations with respect to the empirical measure and replacing with its estimate .
The objective function is a nonconvex and nondifferentiable function of , which complicates the estimation process. Standard optimization techniques often fail to find the global minimizer of this function. In Section S1 of the supplementary materials, we present an incremental approach, which is a generalization of greedy gradient Qlearning (GGQ), an iterative stochastic minimization algorithm, introduced by Maei et al. (2010) as a tool to minimize . Hence, from this point forward, we refer to our proposed method as GGQ.
We simulate a cohort of patients with diabetes and focus on constructing a dynamic treatment regime for maintaining the hemoglobin A1c below 7%. The A1clowering treatments that we consider in this manuscript are similar to those of Timbie et al. (2010) and include metformin, sulfonylurea, glitazone, and insulin. We used treatment discontinuation rates to measure patients’ intolerance to treatment and reflect both the side effects and burdens of treatment. The treatment efficacies and discontinuation rates are extracted from Kahn et al. (2006) and Timbie et al. (2010). We assume that patients who discontinue a treatment do not drop out but just take the next available treatment. These simulated data mimic the third wave of NHANES.
In Section 4.3, we compare the performance of our proposed approach with the classical approach using a simulated dataset and then in Section 4.4 we perform a Monte Carlo study to examine the asymptotic results.
Our study consists of 20 decision points, and the time between each decision point is 3 months. Eligible individuals start with metformin and augment with treatments sulfonylurea, glitazone, and insulin through the followup. At each decision point, there are two treatment options: 1) augment the treatment 2) continue the treatment received at the previous decision point. The discontinuation variable is generated from a Bernoulli distribution given at the last augmented treatment. is the number of augmented treatments by the end of interval where . The variable is the augmented treatment at time . As soon as a treatment is augmented the variable increases by one, whether or not the treatment will be discontinued. The death indicator variable at time is generated as a function of previous observed covariates.
Here are the steps we take to generate the dataset:

Baseline variables: Variables are generated from a multivariate normal distribution with mean (13,160,9.4) and the covariance matrix , where BP is the systolic blood pressure. Also, .

Assigned treatment at time : Given the state variable , the sets of available treatments are , , , , and , where 0 means continue with the treatment received at the previous decision point. Although the ideal level is below 7%, Timbie et al. (2010) raises concern about the feasibility and polypharmacy burden needed for treating patients whose . Our simulation study investigates the optimal treatment regime for these patients. More specifically,

if , the treatment is not augmented because is under control and .

if and , the treatment is augmented with the next available treatment. Hence, . Note that these are patients whose is too high. Thus, the only option is augmenting their treatment.

If and , then a binary variable is generated from , where denotes the discontinuation indicator. implies that the patient continues with the same treatment as time and we set (), while implies that the patient takes the next available treatment (treatment is augmented) and we set . For example, if and , a patient can be assigned to either augmenting the treatment taken at time with or continuing with the same treatment as time (), depending on the generated variable .
Note: When , no new treatment is added. Hence, .


Treatment discontinuation indicator at time : A binary variable is generated from a Bernoulli distribution given the last augmented treatment. For all , the treatment discontinuation rates are , and .
Note: We assume no treatment discontinuation for patients who are taking the same treatment at time as at time (i.e., ).

Intermediate outcome at time : To avoid variance inflation through time, we use the following generative model for at time , , where and
with being the treatment effect of the augmented treatment . The treatment effects of metformin, sulfonylurea, glitazone and insulin are 0.14, 0.20, 0.02, and 0.14, respectively. Note that the treatment effects are reported as a percentage reduction in . The treatment effect of glitazone is listed as 0.12 in Timbie et al. (2010), which is similar to metformin. However, in order to study the effect of the treatment discontinuation on the optimal regime, we set its treatment effect to 0.02 and, from now on, denote it by glitazone.

Timevarying variables at time : and .

Death indicator at time : A binary variable is generated from a Bernoulli distribution with probability . is the indicator of death.

Reward function at time : In order to be able to find an optimal treatment regime, we need an operational definition of controlled . Hence, we define the following reward function at time as a function of , and ,

if , 2 if , 10 if and zero otherwise.
This reward structure helps us to identify treatments whose discontinuation rate outweighs their efficacy while reducing the chance of death.

Note that the state space at time includes . However, the Markov property holds with , and variables and are noise variables. In order to satisfy assumption , we ignored the first four time points in the observed trajectory of each patient.
We generate two datasets of sizes 2,000 and 5,000 and compare the quality of the estimated optimal treatment policy using the proposed and the classical approach. The latter, also known as actionvalue iteration method, turns the recurrence relation of ( ‣ Constructing Dynamic Treatment Regimes in InfiniteHorizon Settings) into an update rule as
where is the reward function. This procedure can be summarized as

set for all

for each ,


repeat 2 and 3 until where is a small positive value

for each , .
The above 5step procedure is similar to the one presented in Chapter 4 of Sutton & Barto (1998). Note that the classical approach requires estimation of the transition probabilities , which limits its usage to cases where state and action space is small. We categorize the continuous variables and estimate nonparametrically. The variables are categorized based on the percentiles and denoted as . The categorized is formed by breaking the variable on . Hence the state variable used in the classical approach is . Note that corresponded to .
Unlike the classical approach, the optimal treatment policy using our proposed GGQ method utilizes the continuous state variable . In our example, we parametrize the optimal action value function using a 72dimensional vector of parameters and construct the features using radial basis functions (Gaussian kernels). See Appendix 2 for more details. To specify the step sizes of the stochastic minimization algorithm, first we select two functions that satisfy the conditions listed in Appendix 1 and multiply them by . Then we run the algorithm for different values of and select the one that minimizes the objective function. In this simulation study we set the step sizes and where is set to 0.05. Section S5 in the supplementary materials discusses the effect of the choice of tuning parameters on the estimated optimal regime.
True optimal policy. As the sample size increases, the optimal actionvalue function estimated using the classical approach converges to the true optimal actionvalue function. Hence, for the purpose of finding the true optimal policy, we generate a large dataset of size 500,000 and estimate transition probabilities using a nonparametric approach, where is the oracle state . Then by the 5step procedure (classical approach), the true optimal policy is approximated and set as our benchmark.
Figure 1 depicts the true and estimated optimal treatment for each discretized oracle state using the GGQ and classical approaches. As in this example, we set the discount factor to 0.6. Note that the estimated optimal policy using classical and GGQ methods is based on the states and , respectively. However, in Figure 1, we averaged it over the noise variables and, for comparability, we report the results on the discretized oracle state. The vertical axis on the left hand side is the percentage of time that the treatment choices are selected as optimal. The left vertical and both horizontal axes represent the elements of the state , respectively. This plot shows that the proposed GGQ method outperforms the classical approach for moderate sample sizes.
More specifically, the estimated optimal policy using GGQ () recommends not augmenting the third (glitazone) and fourth (insulin) treatments (the left plots). This makes sense since glitazone has a small treatment effect (0.02) and insulin has high discontinuation rate (0.35) that outweighs its efficacy. However, the estimated optimal policy using a classical approach () recommends augmenting these treatments by some positive probabilities. Specifically, augments insulin about 50% of the time when and .
Figure 2 presents the difference between the values of the estimated optimal policies and . The values are calculated using the Monte Carlo method, where the value of a treatment policy , for each state , is defined as . For both sample sizes and all of the states, the value of is higher than the value of . This indicates that the estimated optimal policy has better quality.
Our simulation result is consistent with Timbie et al. (2010), and suggests that we should not always augment the treatment when . In other words, depending on the treatment already taken, sometimes we should consider not augmenting the treatment to avoid side effects.
We generate 500 datasets each of sizes 2,000 and 5,000 to examine the asymptotic behavior of the proposed method. Figure 3 shows the confidence intervals of , where the standard errors are estimated using the variance formula presented in Theorem 1. The dark circles are the average of the 500 s and asterisks are the true parameter values approximated using a Monte Carlo study with . These confidence intervals may be used to identify the parts of the feature function that should be kept in the decision rule and may be used as a feature selection tool. Another important use of the asymptotic results in Theorem 1 is to investigate whether there is a significant difference between treatment options. Specifically, one may build the confidence interval for the difference between the estimated optimal actionvalue function for different treatment options (augment vs. continue) and check whether it contains zero (while adjusting for TypeI error rate for more than two treatment options). In Figure 4, we constructed and evaluated the quality of these 95% confidence intervals for , given each state variable for . The results for is similar and omitted due to space limitations. These results confirm the accuracy of our variance estimator in Theorem 1.
We have proposed a new method that can be used to form optimal dynamic treatment regimes in infinitehorizon settings (i.e., there is no a priori fixed end of follow up), while our data were collected over a fixed period of time with many decision points. We have assumed that the value of the optimal regime can be presented using a linear function of parameters, and we developed an estimating procedure based on temporal difference residuals to estimate the parameters of this function. We developed the asymptotic properties of the estimated parameters and evaluated the proposed method using simulation studies.
This work raises a number of interesting issues. We have derived the asymptotic distribution of the estimators under some assumptions. One important practical problem is to provide a valid inference when the optimal treatment is not unique for some states, (i.e., assumption is violated). This may lead to nonregular estimators and inflate the TypeI error rate (Bickel et al., 1993). Among others, Robins (2004) and Laber et al. (2010) proposed solutions to this issue. However, the existing methods may not be directly applied to our method and require major modifications. The second issue is how to construct the feature functions. In this manuscript, we used the radial basis functions (Moody & Darken, 1989; Poggio & Girosi, 1990). One simple method is to try different feature functions () and select the one that minimizes the function (Parr et al., 2008). Alternatively, one may use support vector regression to approximate the actionvalue function (Vapnik et al., 1997; Tang & Kosorok, 2012).
The proposed method can be used in settings where the time between decision points is fixed, say 3 months. This assumption often holds (approximately) for some chronic diseases such as diabetes, cyclic fibrosis and asthma. It would, however, be of interest to extend the method to cases with a random decision point (clinic visits). Usually, the random time between decision points happens either when doctors decide to schedule the next visit sooner or later than the prespecified time or when patients request an appointment due to, for example, side effects or acute symptoms. The former is easier to deal with because we have the covariates required to model the visit process. The latter, however, is more difficult and results in nonignorable missing data because we do not have information about those patients who did not show up. Robins et al. (2008) discusses the issue of the random visit process in detail.
Acknowledgement Acknowledgements should appear after the body of the paper but before any appendices and be as brief as possible subject to politeness. Information, such as contract numbers, of no interest to readers, must be excluded.
Supplementary material Supplementary material available at Biometrika online includes the stochastic minimization algorithm and proof of Theorem 1. It also discusses the effect of the discount factor and tuning parameters of the stochastic minimization algorithm on the estimated optimal treatment regime.
Appendix 1: Tuning Parameters The tuning parameters and in the GGQ algorithm need to satisfy the following assumptions (Maei et al., 2010):

, and are deterministic.

.

.

.
Appendix 2: feature functions The feature functions are constructed using the radial basis functions and
where