Personalized Cancer Chemotherapy Schedule: a numerical comparison of performance and robustness in model-based and model-free scheduling methodologies

Personalized Cancer Chemotherapy Schedule: a numerical comparison of performance and robustness in model-based and model-free scheduling methodologies

Jesus Tordesillas and Juncal Arbelaiz
J. Tordesillas, MIT Department of Aeronautics and Astronautics  jtorde@mit.eduJ. Arbelaiz, MIT Department of Applied Mathematics. Both authors contributed equally to this work.

Reinforcement learning algorithms are gaining popularity in fields in which optimal scheduling is important, and oncology is not an exception. The complex and uncertain dynamics of cancer limit the performance of traditional model-based scheduling strategies like Optimal Control. Motivated by the recent success of model-free Deep Reinforcement Learning (DRL) in challenging control tasks and in the design of medical treatments, we use Deep Q-Network (DQN) and Deep Deterministic Policy Gradient (DDPG) to design a personalized cancer chemotherapy schedule. We show that both of them succeed in the task and outperform the Optimal Control solution in the presence of uncertainty. Furthermore, we show that DDPG can exterminate cancer more efficiently than DQN presumably due to its continuous action space. Finally, we provide some insight regarding the amount of samples required for the training.


Cancer is a common name that is given to a group of diseases that involve the repeated and uncontrolled division and spreading of abnormal cells. These abnormal tissues are called tumors [1]. Early diagnosis and effective treatment improve the survival rate of these diseases.

The optimal treatment schedule and drug dose vary according to the stage of the tumor, the weight of the patient, the white blood cell levels (immune cells), concurrent illness and age of the patient. Thus, proper scheduling and personalization of the chemotherapy treatment are vital to reducing the mortality rate. This motivated the use of techniques originating in engineering fields – such as optimal control – to derive optimal drug dosing for cancer chemotherapy [2]. A review of model-based scheduling strategies is provided in [3].

One of the main challenges associated with the study of cancer as a dynamical system is that it is known to be complex, nonlinear, and its mechanisms of action are uncertain. Consequently, first principle mathematical models may not be able to account for all the variations in the patient dynamics [1].

Fig. 1: Immune cells (in orange) attached to a tumour cell (brown). This image was captured by a scanning electron microscope [4].

Motivated by the challenging nature of generating accurate models of cancer dynamics and the recent success stories of using RL for control [5, 6], we use model-free Deep-Reinforcement Learning (DRL) algorithms to design a personalized cancer chemotherapy schedule. Particularly, we will use Deep Q-Network (DQN) [7] (with a discrete action space) and Deep Deterministic Policy Gradient [8] (with continuous action space), and provide a comparison of their performance.

RL and DRL have been successfully applied in different medical treatment designs. For instance, in [9] DRL was used to determine the optimal treatment regimes from medical data. [10] developed automated radiation adaptation protocols for Lung Cancer using DDPG. In [1], Q-learning was used to design a chemotherapy schedule with a discrete action space, and in [11] DL was used for the detection of metastatic breast cancer. Moreover, [12] used DDPG to control the drug dosing for suppressing cell growth modeled through a stochastic logistic growth model.

The aim of this work is to develop in-silico trials to compare the performance of model-based and model-free techniques in scheduling a personalized chemotherapy treatment. Particularly, our objectives include:

  1. Scheduling the personalized chemotherapy treatment by solving the optimal control problem to establish a baseline for comparison and evaluate whether DQN and DDPG provide similar performance.

  2. Gaining a qualitative understanding of the amount of training data (episodes) required by DQN and DDPG to perform similarly to the baseline policy.

  3. Evaluating the robustness of the optimal controller, DQN and DDPG policies in the presence of different types of relevant uncertainties: parametric in the dynamics, parametric in the initial conditions of the model (diagnosis) and in the presence of a stochastic forcing term in the tumor cell population dynamics.

  4. Comparing the performance of DQN and DDPG in the scenarios described above.

Ii Model

In order to compute the optimal control policy and the reward in DRL, a mathematical model that captures the distribution and effects of the chemotherapy drug is required. A realistic model should address tumor growth, the reaction of the human immune system to the tumor growth, and the effects of chemotherapy treatment on immune cells, normal cells and tumor growth [2], [3].

We will simulate patient’s response to the treatment through a pharmacological model of cancer chemotherapy, given by a nonlinear and coupled system of 4 deterministic Ordinary Differential Equations (ODEs) [1]:


where is the number of immune cells, is the number of normal cells, is the number of tumor cells and is the drug concentration. The action or control is the chemotherapy drug infusion rate [], which we will schedule through Optimal Control and DRL, respectively. The initial conditions, are determined according to the diagnosis. From now on, we will refer to this nonlinear model by , where and . For the sake of visualization, an electron micrograph of immune and cancer cells is shown in Figure 1.

The model parameters and the corresponding values used in our simulations are provided in Table I. Note that the patient and his diagnosis determine both the initial conditions and the value of the model parameters.

TABLE I: Parameters of the model, description and the values used in our optimal control policy computation and training of DRL algorithms unless otherwise indicated. Extracted from [1].

As mentioned in the Introduction, in order to test the robustness of the optimal control and DRL policies, we simulated the previously presented system of ODEs with parametric uncertainty and also with a stochastic forcing term. The reader is referred to the next Section for further details on the uncertain scenarios considered.

Iii Scenarios Considered

A personalized scheduling of chemotherapy treatment is fundamental to patient’s recovery from the disease. When designing the treatment schedule, it is important to optimize the amount of drug used in order to regulate the potentially lethal side effects of chemotherapy, since often the patient’s immune system weakens and becomes prone to life-threatening infections, diminishing its capability to eradicate cancer [1].

Consequently, we obtained the Optimal Control and DRL policies for two different cases: a preliminary and somewhat unrealistic Case 0, in which the goal is to exterminate cancer regardless of the state of the rest of the cell populations. We used this case to validate our algorithms, codes and parameter and hypermarameter values. Then, we simulated a Case Patient, in which cancer is eradicated while preserving a minimum population of normal and immune cells in order to guarantee patient’s safety. In both cases, the initial condition was the same: .

Iii-a Case 0

The optimal control problem for Case 0 is formulated through the following nonlinear program:

s. t.

Both the policy and the length of the treatment are determined by the optimal solution. The hard constraints correspond to the dynamics of cancer and the initial state of the patient, bounds in the drug-rate and the target of cancer eradication. The cost is the area enclosed by the tumor cell population (note that is nonnegative) and the time axis, between and . For the DRL algorithms, the reward used is , where corresponds to the length of the timestep, and it is a fixed value. Note that is included in the reward just to make the comparison with the cost functional used in the Optimal Control program straightforward.

Iii-B Case Patient

In order to guarantee patient’s safety during the treatment, additional state constraints are added to the program (2). Particularly, the program is augmented with the constraints and .

In the case of the DRL algorithms, the constraints are imposed softly by modifying the reward function:


where denotes the Iverson bracket. The last two terms of equation (3) penalize the violation of the constraints and . Note that, for the cases in which the soft constraints are satisfied, this reward is the negative of the Riemann sum approximation of the area below the curve of the number of tumor cells (i.e. it is a rectangular approximation to the negative of the integral cost used in the optimal control problem).

Furthermore, in order to test the robustness of the obtained policies, different types of relevant sources of uncertainty will be considered regarding the diagnosis, growth-rate and dynamics of the tumor. The corresponding results are provided in Section VI.

Iii-B1 Parametric uncertainty: model parameter

The per-unit growth-rate of the tumor, , represents how aggressive the disease is. Thus, an accurate estimation of its value is important for the computation of the optimal control policy. We systematically varied its value and obtained the range for which the optimal control problem is feasible, as well as the sensitivity of the nominal optimal control policy to perturbations in the value of this parameter.

Iii-B2 Parametric uncertainty: initial condition

A wrong estimation of the initial size of the tumor , will also have an impact on the performance of the optimal policy. We will systematically evaluate the robustness of the nominal policy to perturbations on .

Iii-B3 Stochastic forcing in tumor dynamics

We simulated a system of SDEs of the form:


where the first term is the drift and corresponds to the deterministic dynamics given by the ODEs, and the second one is the diffusion term, modeled by a constant in our case and applied only to the equation that gives the evolution of the population of tumor cells . denotes a Wiener process, common in the motion of cells. We simulated the SDEs using the Euler-Maruyama scheme.

Iv Methods and Algorithms

Iv-a Optimal Control

The general form of a finite-horizon optimal control problem in Bolza form is:

s. t.

Inspection of the program (2) reveals that it constitutes a particular case of problem (5) and thus can be solved using standard optimal control techniques. Something to note is that the control solution provided by (5) will usually be open-loop (i.e. the control policy will be a function of time, as opposed to feedback or closed-loop control laws, in which the control is a function of the state. Usually this last case is more desirable since it accounts for discrepancies between the real dynamics and the predictions made by the model and corrects the control input accordingly, making the controller more robust to uncertainty).

Iv-B Deep Reinforcement Learning

DDPG and DQN algorithms are considered. Both DDPG and DQN share two common features: they are off-policy and model-free algorithms. Off-policy means that the behaviour policy used is not the same as the policy being improved. This allows the use of a memory replay, and the use of any exploration strategy. Model-free means that the algorithm does not try to estimate the transition matrix of the dynamics of the environment . Instead, it estimates the optimal policy or value function directly.

The differences between these two algorithms are highlighted below.

Iv-B1 Deep Q Network (DQN)

Deep Q Network was proposed in [7], and the main difference with respect to standard Q-learning is the use of a neural network to approximate the action-value function . The DQN algorithm is shown in Alg. 1 (taken from [7]).

The inclusion of a neural network, usually leads to an unstable training due to two factors: the correlation between samples and the non-stationary targets. These two challenges are addressed by DQN using [13]:

  • An experience replay: In a replay buffer, a dataset of tuples are saved. During the training, the agent will randomly sample mini-batch samples from this replay buffer (line 1 of the algorithm). This allows a stabilization of the training process, and a better approximation to i.i.d samples removing the correlation between them.

  • Fixed Q-targets: During several updates, the target network weights used in the target calculation are fixed (lines 1, 1, 1).

DQN is mainly used for problems with a discrete action space. However, some applications make use of the Normalized Advantage Function (NAF) to apply DQN in continuous action spaces [14].

Iv-B2 Deep Deterministic Policy Gradient (DDPG))

DDPG was proposed in [8]. The algorithm used by DDPG is shown in Alg. 2 (taken from [8]), and its main characteristics are these ones [8], [15], [16]:

  • Policy-gradient: DDPG tries to estimate the gradient of the expected return, and the policy is updated using this estimate.

  • Actor-critic: DDPG has two different structures (see also Fig. 2) :

    • Actor: The actor contains the policy function. It takes the state of the environment as input, and produces an action.

    • Critic: From the state of the environment and the reward received by the action taken by the actor, the critic produces the temporal difference error. This error is used to update both the actor (line 2) and the critic (line 2).

Both the actor and critic described above are represented using neural networks.

Fig. 2: Actor and Critic used in DDPG. Adapted from [15].
1 Initialize replay memory D to capacity N Initialize action-value function Q with random weights Initialize target action-value function with random weights for episode= do
2        Initialize sequence and preprocessed sequence for t= do
3               With probability select a random action otherwise select Execute action in emulator and observe reward and image Set and preprocess Store transition in Sample random minibatch of transitions from Set Perform a gradient descent step on with respect to the network parameters Every C steps reset
Algorithm 1 DQN
1 Initialize critic network and actor with weights and . Initialize target network and with weights , Initialize replay buffer for episode= do
2        Initialize a random process for active exploration Receive initial observation state for  do
3               Select action according to the current policy and exploration noise Execute action and observe reward and new state Store transition in Sample a random minibatch of transitions from Set Update critic by minimizing the loss: Update the actor policy using the sampled policy gradient: Update the target networks:
Algorithm 2 DDPG

V Implementation

V-a Optimal Control

We solved the optimal control problem using direct collocation methods [17], which transcribe the continuous dynamics and control functions to a finite set of algebraic variables and then solve a high-dimensional non-linear program (NLP). We used the MATLAB implementation of ICLOCS2 [18], the open-source Imperial College London Optimal Control software (available for download here) in conjunction with the open-source NLP solver IPOPT [19]. We modified the open-source code and introduced the dynamics of cancer (the ODE system), cost functional and constrains, as well as the desired options and tolerances for the solvers.

We used h-methods for the transcription, particularly the Hermite-Simpson method with automatic mesh refinement and an initial number of 200 nodes. Regarding numerical tolerances, we allowed errors of up to in the state and control bounds, and in the terminal condition of cancer eradication. The optimal control solution for both the Case 0 and Case Patient is provided in Section VI.

V-B Reinforcement Learning

To implement the DRL algorithms, we used the open-source library ChainerRL [20]. Moreover, we created an environment using the OpenAI Gym [21] framework. This environment takes an action and the current state, and returns the next observation and the reward obtained. In this environment, we implemented the system of ODEs, and solved it using the numerical integration methods provided by Scipy [22].

The values of the most relevant parameters of the DQN and DDPG implementations are shown in Table II. Note that we used to match the reward used in Optimal Control as well as possible.

To improve the convergence and the training times, all the states, actions and the reward were normalized to values in , and CuPy was used.

Note that both in DDPG and DQN, it is required to select the size of the time steps taken by the agent during an episode. We used a time step size of days, which achieved a reasonable training time.

Parameter DQN DDPG
Replay Start Size
Layers 2 3
Hidden Units per layer 100 300
Discount Factor 0.99 0.995
TABLE II: Parameters used in DQN and DDPG

Vi Results

Vi-a Case 0

The results for the preliminary Case 0 (in which there are not constraints on and ) are shown in Fig. 3. Note that the policy found by all the algorithms (O.C., DDPG and DQN) is the same: apply the maximum allowed drug infusion rate until the cancer is exterminated, which is the expected solution.

Fig. 3: Results for the preliminary Case 0. (a) Trajectories of the states when the optimal drug-rate is applied. Policies provided by (b) O.C. (c) DQN (d) DDPG. Note that (b)-(d) match. The shaded regions represent the feasible control region.
Fig. 4: Optimal policies provided by the different methods when sweeping the values of .

Vi-B Case Patient

In this case, we add the constraints and to the program. The optimal policies found by the three algorithms for the nominal growth tumor rate () are shown in Fig. 5. All the policies are able to exterminate the cancer in days. Note also that the policies provided by the different algorithms share some resemblances: Maximum allowed drug infusion at the beginning of the treatment, followed by an average value . In this case, the cost of the solutions provided by DDPG and DQN is slightly higher than that of O.C.

Fig. 5: Results for the Case Patient. (a) O.C. (b) DQN (c) DDPG. Row 1 provides the policies and row 2 the state evolution when the corresponding policy is applied. The shaded regions represent the feasible control region.

The convergence rates for DDPG and DQN are shown in Fig. 6. The training was stopped when the average of the Q-value reached a stationary behaviour. The training for each case considered took minutes. We also note that the shape of the average Q (see Figure 6) is that expected from a successful training procedure (see [23] for example): After the replay buffer has been filled, the average Q increases at the beginning of the training, it reaches a maximum (where the agent overestimates the Q value), and then decreases to achieve a stationary value.

Fig. 6: Convergence rates in both DQN10 and DDPG. For DDPG, both the Critic and Actor losses are shown. The values of these plots are normalized.

Vi-B1 Parametric uncertainty: Model Parameter

To evaluate the sensitivity of the optimal policy to changes in , we perturbed around its nominal value and obtained the range for which an optimal policy exists. The problem is feasible for values and the corresponding policies are plotted in Fig. 4. Again, DDPG and DQN obtain a similar policy to that provided by O.C., although some of them present an oscillatory behaviour. Both for DQN and DDPG, these oscillations may be due to the length of the time step ( days), and might be reduced by refining it. Moreover, for the case of DQN the coarse discretization (10 nodes) of the action space should be accounted for.

Once we obtained the values of for which the problem is feasible, we tested the optimal policy obtained for in different scenarios, where . The results are shown in Fig. 7. In all the algorithms, the policy found eradicates the cancer for , and does not cure it for . However, for the case , DRL algorithms are successful exterminating the cancer, but O.C. is not. The increased robustness of DRL to parametric uncertainty might be due to its exploration phase: exploration makes the policies found by DRL algorithms contain more information than the optimal control solution, which only focuses on finding a minimum for the given model.

Fig. 7: Test of the performance of the nominal policies (computed/trained with ) to perturbations in the value of . (a) O.C. (b) DQN10 (c) DDPG. DRL methods show improved robustness to perturbations than O.C.

Vi-B2 Parametric uncertainty: initial condition

For this case, the agent is trained with the nominal initial condition , and tested with perturbed values of . The perturbed values and corresponding results are shown in Fig. 8. Note that DRL is able to exterminate the cancer for all the cases, while O.C. fails to do it for . Again, this argues for the increased robustness of the policies provided by DRL in comparison to O.C.

Fig. 8: Evolution of the tumor cell population when the policy obtained for the nominal initial condition (computed/trained with ) is applied for perturbed initial size of the tumor. (a) O.C. (b) DQN10 (c) DDPG. It is remarkable that while optimal control does not manage to exterminate the disease in all the cases, DRL methods do, showing increased robustness. The first 5 days have been removed from the graph for the sake of visualization.

Vi-B3 Stochastic forcing in tumor dynamics

In this last scenario, the agent is tested with a stochastic forcing term in the dynamics of . The plots of the mean and standard deviations of the solutions found are shown in Fig. 9. DQN and DDPG drive the tumor closer to extermination than O.C. at the end of the treatment.

Fig. 9: Evolution of for the noisy case. Note the improved performance of DDPG and DQN at .
Fig. 10: Mean cost (-reward) of the policies found as a function of the training episodes for each algorithm compared to that of the optimal controller. Legend indicates the algorithms with the respective training episodes.

Vi-C Sampling

An interesting question to investigate is the dependence of learning in DQN and DDPG on the number of episodes. This question is specially relevant in data-poor scenarios, where one may wonder if the size of the dataset is enough to obtain a suitable policy. With this aim, the agent was evaluated after each training episode, and its cost was compared with that of O.C. In the case of DQN, seven different discretizations of the action space were considered: 7, 10, 20, 30, 40, 50 and 60 nodes. The resulting plot is shown in Fig. 10. It is observed that the DRL costs asymptote to the optimal one and that, after episodes, DDPG obtains a cost that is close to that of Optimal Control. Note also that, in general, DDPG tends to obtain better agents than DQN.

Vii Conclusions and Future Work

This work presented a comparison between classical O.C. and model-free DRL approaches, both in discrete and continuous action space. We showed that, with an accurate model of the dynamics, O.C. provides the best solution, but closely followed by DRL. Moreover, we showed that in the presence of parametric or stochastic uncertainty, DRL approaches have the potential to outperform classical trajectory optimization O.C. techniques.

In the Case 0, all the algorithms found the same optimal policy. In the Case Patient, the policies found by DRL perform similarly to O.C, but they exhibit increased robustness to uncertainties. Regarding the relative merits of DQN and DDPG, we found that DDPG outperforms DQN, presumably due to its continuous action space. Furthermore, it seems to learn faster.

The sampling analysis of the algorithms showed that approximately 1500 calls to the model are needed for DDPG to obtain a performance close to optimal.

As future work, we consider comparing the perfomance of DDPG and DQN with robust optimal control, where uncertainty on the dynamics is taken into account in the formulation of the optimal control program at the expense of increasing the cost associated to the policy found. Moreover, we plan also to compare the performace of DDPG and DQN with Model Predictive Control, that benefits from feedback in each iteration.


Both authors would like to thank Prof. David Sontag, Alejandro Rodriguez-Ramos and Dong-Ki Kim for valuable discussions and ideas. They are also grateful for the economic support of Fundacion Bancaria “la Caixa”.


  • [1] Regina Padmanabhan, Nader Meskin, and Wassim M Haddad. Reinforcement learning-based control of drug dosing for cancer chemotherapy treatment. Mathematical biosciences, 293:11–20, 2017.
  • [2] L.G.D. Pillis and A. Radunskaya. A mathematical tumor model with immune resistance and drug therapy: an optimal control approach. Comput. Math. Methods, 2001.
  • [3] H. Sbeity and R. Younes. Review of optimization methods for cancer chemotherapy treatment planning. J. Comput. Sci. Syst. Bio., 2015.
  • [4] Memorial sloan kettering cancer center. Accessed: 8-Dec-2018.
  • [5] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. A general reinforcement learning algorithm that masters chess, shogi and go through self-play. Science, 2018.
  • [6] B. Recht. A tour of reinforcement learning: The view from continuous control. 2018.
  • [7] V. Mnih, K. Kavukcuoglu, D Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Peterseu, C. Beattie, A. Sadik, I. Antonoglou, H. King, D Kumaran, D. Wierstra, S. Legg, and D. Hassaan. Human-level control through deep reinforcement learning. Nature, 2016.
  • [8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.
  • [9] Ying Liu, Brent Logan, Ning Liu, Zhiyuan Xu, Jian Tang, and Yangzhi Wang. Deep reinforcement learning for dynamic treatment regimes on medical registry data. In 2017 IEEE International Conference on Healthcare Informatics (ICHI), pages 380–385. IEEE, 2017.
  • [10] Huan-Hsin Tseng, Yi Luo, Sunan Cui, Jen-Tzung Chien, Randall K Ten Haken, and Issam El Naqa. Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical physics, 44(12):6690–6705, 2017.
  • [11] Google A.I.: Using deep-learning for breast tumor detection. Accessed: 11-Dec-2018.
  • [12] Dalit Engelhardt. Dynamic control of stochastic evolution: A deep reinforcement learning approach to adaptively targeting emergent drug resistance. arXiv preprint arXiv:1903.11373, 2019.
  • [13] Emma Brunskill. CNNs and Deep Q-Learning. Accessed: 10-Dec-2018.
  • [14] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
  • [15] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [16] Deep deterministic policy gradients. Accessed: 10-Dec-2018.
  • [17] J.T. Betts. Practical methods for optimal control and estimation using nonlinear programming. SIAM, 2010.
  • [18] P. Falugi, E. Kerrigan, and E. van Wyk. Imperial College London Optimal Control Software: User Guide (ICLOCS). 2010.
  • [19] A. Watcher and L.T. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math. Program., Ser. A, 2005.
  • [20] ChainerRL, a deep reinforcement learning library. Accessed: 10-Dec-2018.
  • [21] OpenAI Gym. Accessed: 10-Dec-2018.
  • [22] Scipy. Accessed: 10-Dec-2018.
  • [23] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4):e0172395, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description