Personalized Cancer Chemotherapy Schedule: a numerical comparison of performance and robustness in model-based and model-free scheduling methodologies
Reinforcement learning algorithms are gaining popularity in fields in which optimal scheduling is important, and oncology is not an exception. The complex and uncertain dynamics of cancer limit the performance of traditional model-based scheduling strategies like Optimal Control. Motivated by the recent success of model-free Deep Reinforcement Learning (DRL) in challenging control tasks and in the design of medical treatments, we use Deep Q-Network (DQN) and Deep Deterministic Policy Gradient (DDPG) to design a personalized cancer chemotherapy schedule. We show that both of them succeed in the task and outperform the Optimal Control solution in the presence of uncertainty. Furthermore, we show that DDPG can exterminate cancer more efficiently than DQN presumably due to its continuous action space. Finally, we provide some insight regarding the amount of samples required for the training.
I INTRODUCTION and OBJECTIVES
Cancer is a common name that is given to a group of diseases that involve the repeated and uncontrolled division and spreading of abnormal cells. These abnormal tissues are called tumors . Early diagnosis and effective treatment improve the survival rate of these diseases.
The optimal treatment schedule and drug dose vary according to the stage of the tumor, the weight of the patient, the white blood cell levels (immune cells), concurrent illness and age of the patient. Thus, proper scheduling and personalization of the chemotherapy treatment are vital to reducing the mortality rate. This motivated the use of techniques originating in engineering fields – such as optimal control – to derive optimal drug dosing for cancer chemotherapy . A review of model-based scheduling strategies is provided in .
One of the main challenges associated with the study of cancer as a dynamical system is that it is known to be complex, nonlinear, and its mechanisms of action are uncertain. Consequently, first principle mathematical models may not be able to account for all the variations in the patient dynamics .
Motivated by the challenging nature of generating accurate models of cancer dynamics and the recent success stories of using RL for control [5, 6], we use model-free Deep-Reinforcement Learning (DRL) algorithms to design a personalized cancer chemotherapy schedule. Particularly, we will use Deep Q-Network (DQN)  (with a discrete action space) and Deep Deterministic Policy Gradient  (with continuous action space), and provide a comparison of their performance.
RL and DRL have been successfully applied in different medical treatment designs. For instance, in  DRL was used to determine the optimal treatment regimes from medical data.  developed automated radiation adaptation protocols for Lung Cancer using DDPG. In , Q-learning was used to design a chemotherapy schedule with a discrete action space, and in  DL was used for the detection of metastatic breast cancer. Moreover,  used DDPG to control the drug dosing for suppressing cell growth modeled through a stochastic logistic growth model.
The aim of this work is to develop in-silico trials to compare the performance of model-based and model-free techniques in scheduling a personalized chemotherapy treatment. Particularly, our objectives include:
Scheduling the personalized chemotherapy treatment by solving the optimal control problem to establish a baseline for comparison and evaluate whether DQN and DDPG provide similar performance.
Gaining a qualitative understanding of the amount of training data (episodes) required by DQN and DDPG to perform similarly to the baseline policy.
Evaluating the robustness of the optimal controller, DQN and DDPG policies in the presence of different types of relevant uncertainties: parametric in the dynamics, parametric in the initial conditions of the model (diagnosis) and in the presence of a stochastic forcing term in the tumor cell population dynamics.
Comparing the performance of DQN and DDPG in the scenarios described above.
In order to compute the optimal control policy and the reward in DRL, a mathematical model that captures the distribution and effects of the chemotherapy drug is required. A realistic model should address tumor growth, the reaction of the human immune system to the tumor growth, and the effects of chemotherapy treatment on immune cells, normal cells and tumor growth , .
We will simulate patient’s response to the treatment through a pharmacological model of cancer chemotherapy, given by a nonlinear and coupled system of 4 deterministic Ordinary Differential Equations (ODEs) :
where is the number of immune cells, is the number of normal cells, is the number of tumor cells and is the drug concentration. The action or control is the chemotherapy drug infusion rate , which we will schedule through Optimal Control and DRL, respectively. The initial conditions, are determined according to the diagnosis. From now on, we will refer to this nonlinear model by , where and . For the sake of visualization, an electron micrograph of immune and cancer cells is shown in Figure 1.
The model parameters and the corresponding values used in our simulations are provided in Table I. Note that the patient and his diagnosis determine both the initial conditions and the value of the model parameters.
As mentioned in the Introduction, in order to test the robustness of the optimal control and DRL policies, we simulated the previously presented system of ODEs with parametric uncertainty and also with a stochastic forcing term. The reader is referred to the next Section for further details on the uncertain scenarios considered.
Iii Scenarios Considered
A personalized scheduling of chemotherapy treatment is fundamental to patient’s recovery from the disease. When designing the treatment schedule, it is important to optimize the amount of drug used in order to regulate the potentially lethal side effects of chemotherapy, since often the patient’s immune system weakens and becomes prone to life-threatening infections, diminishing its capability to eradicate cancer .
Consequently, we obtained the Optimal Control and DRL policies for two different cases: a preliminary and somewhat unrealistic Case 0, in which the goal is to exterminate cancer regardless of the state of the rest of the cell populations. We used this case to validate our algorithms, codes and parameter and hypermarameter values. Then, we simulated a Case Patient, in which cancer is eradicated while preserving a minimum population of normal and immune cells in order to guarantee patient’s safety. In both cases, the initial condition was the same: .
Iii-a Case 0
The optimal control problem for Case 0 is formulated through the following nonlinear program:
Both the policy and the length of the treatment are determined by the optimal solution. The hard constraints correspond to the dynamics of cancer and the initial state of the patient, bounds in the drug-rate and the target of cancer eradication. The cost is the area enclosed by the tumor cell population (note that is nonnegative) and the time axis, between and . For the DRL algorithms, the reward used is , where corresponds to the length of the timestep, and it is a fixed value. Note that is included in the reward just to make the comparison with the cost functional used in the Optimal Control program straightforward.
Iii-B Case Patient
In order to guarantee patient’s safety during the treatment, additional state constraints are added to the program (2). Particularly, the program is augmented with the constraints and .
In the case of the DRL algorithms, the constraints are imposed softly by modifying the reward function:
where denotes the Iverson bracket. The last two terms of equation (3) penalize the violation of the constraints and . Note that, for the cases in which the soft constraints are satisfied, this reward is the negative of the Riemann sum approximation of the area below the curve of the number of tumor cells (i.e. it is a rectangular approximation to the negative of the integral cost used in the optimal control problem).
Furthermore, in order to test the robustness of the obtained policies, different types of relevant sources of uncertainty will be considered regarding the diagnosis, growth-rate and dynamics of the tumor. The corresponding results are provided in Section VI.
Iii-B1 Parametric uncertainty: model parameter
The per-unit growth-rate of the tumor, , represents how aggressive the disease is. Thus, an accurate estimation of its value is important for the computation of the optimal control policy. We systematically varied its value and obtained the range for which the optimal control problem is feasible, as well as the sensitivity of the nominal optimal control policy to perturbations in the value of this parameter.
Iii-B2 Parametric uncertainty: initial condition
A wrong estimation of the initial size of the tumor , will also have an impact on the performance of the optimal policy. We will systematically evaluate the robustness of the nominal policy to perturbations on .
Iii-B3 Stochastic forcing in tumor dynamics
We simulated a system of SDEs of the form:
where the first term is the drift and corresponds to the deterministic dynamics given by the ODEs, and the second one is the diffusion term, modeled by a constant in our case and applied only to the equation that gives the evolution of the population of tumor cells . denotes a Wiener process, common in the motion of cells. We simulated the SDEs using the Euler-Maruyama scheme.
Iv Methods and Algorithms
Iv-a Optimal Control
The general form of a finite-horizon optimal control problem in Bolza form is:
Inspection of the program (2) reveals that it constitutes a particular case of problem (5) and thus can be solved using standard optimal control techniques. Something to note is that the control solution provided by (5) will usually be open-loop (i.e. the control policy will be a function of time, as opposed to feedback or closed-loop control laws, in which the control is a function of the state. Usually this last case is more desirable since it accounts for discrepancies between the real dynamics and the predictions made by the model and corrects the control input accordingly, making the controller more robust to uncertainty).
Iv-B Deep Reinforcement Learning
DDPG and DQN algorithms are considered. Both DDPG and DQN share two common features: they are off-policy and model-free algorithms. Off-policy means that the behaviour policy used is not the same as the policy being improved. This allows the use of a memory replay, and the use of any exploration strategy. Model-free means that the algorithm does not try to estimate the transition matrix of the dynamics of the environment . Instead, it estimates the optimal policy or value function directly.
The differences between these two algorithms are highlighted below.
Iv-B1 Deep Q Network (DQN)
Deep Q Network was proposed in , and the main difference with respect to standard Q-learning is the use of a neural network to approximate the action-value function . The DQN algorithm is shown in Alg. 1 (taken from ).
The inclusion of a neural network, usually leads to an unstable training due to two factors: the correlation between samples and the non-stationary targets. These two challenges are addressed by DQN using :
An experience replay: In a replay buffer, a dataset of tuples are saved. During the training, the agent will randomly sample mini-batch samples from this replay buffer (line 1 of the algorithm). This allows a stabilization of the training process, and a better approximation to i.i.d samples removing the correlation between them.
DQN is mainly used for problems with a discrete action space. However, some applications make use of the Normalized Advantage Function (NAF) to apply DQN in continuous action spaces .
Iv-B2 Deep Deterministic Policy Gradient (DDPG))
Policy-gradient: DDPG tries to estimate the gradient of the expected return, and the policy is updated using this estimate.
Actor-critic: DDPG has two different structures (see also Fig. 2) :
Actor: The actor contains the policy function. It takes the state of the environment as input, and produces an action.
Both the actor and critic described above are represented using neural networks.
V-a Optimal Control
We solved the optimal control problem using direct collocation methods , which transcribe the continuous dynamics and control functions to a finite set of algebraic variables and then solve a high-dimensional non-linear program (NLP). We used the MATLAB implementation of ICLOCS2 , the open-source Imperial College London Optimal Control software (available for download here) in conjunction with the open-source NLP solver IPOPT . We modified the open-source code and introduced the dynamics of cancer (the ODE system), cost functional and constrains, as well as the desired options and tolerances for the solvers.
We used h-methods for the transcription, particularly the Hermite-Simpson method with automatic mesh refinement and an initial number of 200 nodes. Regarding numerical tolerances, we allowed errors of up to in the state and control bounds, and in the terminal condition of cancer eradication. The optimal control solution for both the Case 0 and Case Patient is provided in Section VI.
V-B Reinforcement Learning
To implement the DRL algorithms, we used the open-source library ChainerRL . Moreover, we created an environment using the OpenAI Gym  framework. This environment takes an action and the current state, and returns the next observation and the reward obtained. In this environment, we implemented the system of ODEs, and solved it using the numerical integration methods provided by Scipy .
The values of the most relevant parameters of the DQN and DDPG implementations are shown in Table II. Note that we used to match the reward used in Optimal Control as well as possible.
To improve the convergence and the training times, all the states, actions and the reward were normalized to values in , and CuPy was used.
Note that both in DDPG and DQN, it is required to select the size of the time steps taken by the agent during an episode. We used a time step size of days, which achieved a reasonable training time.
|Replay Start Size|
|Hidden Units per layer||100||300|
Vi-a Case 0
The results for the preliminary Case 0 (in which there are not constraints on and ) are shown in Fig. 3. Note that the policy found by all the algorithms (O.C., DDPG and DQN) is the same: apply the maximum allowed drug infusion rate until the cancer is exterminated, which is the expected solution.
Vi-B Case Patient
In this case, we add the constraints and to the program. The optimal policies found by the three algorithms for the nominal growth tumor rate () are shown in Fig. 5. All the policies are able to exterminate the cancer in days. Note also that the policies provided by the different algorithms share some resemblances: Maximum allowed drug infusion at the beginning of the treatment, followed by an average value . In this case, the cost of the solutions provided by DDPG and DQN is slightly higher than that of O.C.
The convergence rates for DDPG and DQN are shown in Fig. 6. The training was stopped when the average of the Q-value reached a stationary behaviour. The training for each case considered took minutes. We also note that the shape of the average Q (see Figure 6) is that expected from a successful training procedure (see  for example): After the replay buffer has been filled, the average Q increases at the beginning of the training, it reaches a maximum (where the agent overestimates the Q value), and then decreases to achieve a stationary value.
Vi-B1 Parametric uncertainty: Model Parameter
To evaluate the sensitivity of the optimal policy to changes in , we perturbed around its nominal value and obtained the range for which an optimal policy exists. The problem is feasible for values and the corresponding policies are plotted in Fig. 4. Again, DDPG and DQN obtain a similar policy to that provided by O.C., although some of them present an oscillatory behaviour. Both for DQN and DDPG, these oscillations may be due to the length of the time step ( days), and might be reduced by refining it. Moreover, for the case of DQN the coarse discretization (10 nodes) of the action space should be accounted for.
Once we obtained the values of for which the problem is feasible, we tested the optimal policy obtained for in different scenarios, where . The results are shown in Fig. 7. In all the algorithms, the policy found eradicates the cancer for , and does not cure it for . However, for the case , DRL algorithms are successful exterminating the cancer, but O.C. is not. The increased robustness of DRL to parametric uncertainty might be due to its exploration phase: exploration makes the policies found by DRL algorithms contain more information than the optimal control solution, which only focuses on finding a minimum for the given model.
Vi-B2 Parametric uncertainty: initial condition
For this case, the agent is trained with the nominal initial condition , and tested with perturbed values of . The perturbed values and corresponding results are shown in Fig. 8. Note that DRL is able to exterminate the cancer for all the cases, while O.C. fails to do it for . Again, this argues for the increased robustness of the policies provided by DRL in comparison to O.C.
Vi-B3 Stochastic forcing in tumor dynamics
In this last scenario, the agent is tested with a stochastic forcing term in the dynamics of . The plots of the mean and standard deviations of the solutions found are shown in Fig. 9. DQN and DDPG drive the tumor closer to extermination than O.C. at the end of the treatment.
An interesting question to investigate is the dependence of learning in DQN and DDPG on the number of episodes. This question is specially relevant in data-poor scenarios, where one may wonder if the size of the dataset is enough to obtain a suitable policy. With this aim, the agent was evaluated after each training episode, and its cost was compared with that of O.C. In the case of DQN, seven different discretizations of the action space were considered: 7, 10, 20, 30, 40, 50 and 60 nodes. The resulting plot is shown in Fig. 10. It is observed that the DRL costs asymptote to the optimal one and that, after episodes, DDPG obtains a cost that is close to that of Optimal Control. Note also that, in general, DDPG tends to obtain better agents than DQN.
Vii Conclusions and Future Work
This work presented a comparison between classical O.C. and model-free DRL approaches, both in discrete and continuous action space. We showed that, with an accurate model of the dynamics, O.C. provides the best solution, but closely followed by DRL. Moreover, we showed that in the presence of parametric or stochastic uncertainty, DRL approaches have the potential to outperform classical trajectory optimization O.C. techniques.
In the Case 0, all the algorithms found the same optimal policy. In the Case Patient, the policies found by DRL perform similarly to O.C, but they exhibit increased robustness to uncertainties. Regarding the relative merits of DQN and DDPG, we found that DDPG outperforms DQN, presumably due to its continuous action space. Furthermore, it seems to learn faster.
The sampling analysis of the algorithms showed that approximately 1500 calls to the model are needed for DDPG to obtain a performance close to optimal.
As future work, we consider comparing the perfomance of DDPG and DQN with robust optimal control, where uncertainty on the dynamics is taken into account in the formulation of the optimal control program at the expense of increasing the cost associated to the policy found. Moreover, we plan also to compare the performace of DDPG and DQN with Model Predictive Control, that benefits from feedback in each iteration.
Both authors would like to thank Prof. David Sontag, Alejandro Rodriguez-Ramos and Dong-Ki Kim for valuable discussions and ideas. They are also grateful for the economic support of Fundacion Bancaria “la Caixa”.
-  Regina Padmanabhan, Nader Meskin, and Wassim M Haddad. Reinforcement learning-based control of drug dosing for cancer chemotherapy treatment. Mathematical biosciences, 293:11–20, 2017.
-  L.G.D. Pillis and A. Radunskaya. A mathematical tumor model with immune resistance and drug therapy: an optimal control approach. Comput. Math. Methods, 2001.
-  H. Sbeity and R. Younes. Review of optimization methods for cancer chemotherapy treatment planning. J. Comput. Sci. Syst. Bio., 2015.
-  Memorial sloan kettering cancer center. https://www.mskcc.org. Accessed: 8-Dec-2018.
-  D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. A general reinforcement learning algorithm that masters chess, shogi and go through self-play. Science, 2018.
-  B. Recht. A tour of reinforcement learning: The view from continuous control. 2018.
-  V. Mnih, K. Kavukcuoglu, D Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Peterseu, C. Beattie, A. Sadik, I. Antonoglou, H. King, D Kumaran, D. Wierstra, S. Legg, and D. Hassaan. Human-level control through deep reinforcement learning. Nature, 2016.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.
-  Ying Liu, Brent Logan, Ning Liu, Zhiyuan Xu, Jian Tang, and Yangzhi Wang. Deep reinforcement learning for dynamic treatment regimes on medical registry data. In 2017 IEEE International Conference on Healthcare Informatics (ICHI), pages 380–385. IEEE, 2017.
-  Huan-Hsin Tseng, Yi Luo, Sunan Cui, Jen-Tzung Chien, Randall K Ten Haken, and Issam El Naqa. Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical physics, 44(12):6690–6705, 2017.
-  Google A.I.: Using deep-learning for breast tumor detection. https://ai.googleblog.com/2018/10/applying-deep-learning-to-metastatic.html. Accessed: 11-Dec-2018.
-  Dalit Engelhardt. Dynamic control of stochastic evolution: A deep reinforcement learning approach to adaptively targeting emergent drug resistance. arXiv preprint arXiv:1903.11373, 2019.
-  Emma Brunskill. CNNs and Deep Q-Learning. https://web.stanford.edu/class/cs234/slides/cs234_2018_l6.pdf. Accessed: 10-Dec-2018.
-  Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
-  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
-  Deep deterministic policy gradients. https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html. Accessed: 10-Dec-2018.
-  J.T. Betts. Practical methods for optimal control and estimation using nonlinear programming. SIAM, 2010.
-  P. Falugi, E. Kerrigan, and E. van Wyk. Imperial College London Optimal Control Software: User Guide (ICLOCS). 2010.
-  A. Watcher and L.T. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math. Program., Ser. A, 2005.
-  ChainerRL, a deep reinforcement learning library. https://chainerrl.readthedocs.io/en/latest/. Accessed: 10-Dec-2018.
-  OpenAI Gym. https://gym.openai.com/. Accessed: 10-Dec-2018.
-  Scipy. https://docs.scipy.org/doc/scipy/reference/index.html. Accessed: 10-Dec-2018.
-  Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4):e0172395, 2017.