Personalized Cancer Chemotherapy Schedule: a numerical comparison of performance and robustness in modelbased and modelfree scheduling methodologies
Abstract
Reinforcement learning algorithms are gaining popularity in fields in which optimal scheduling is important, and oncology is not an exception. The complex and uncertain dynamics of cancer limit the performance of traditional modelbased scheduling strategies like Optimal Control. Motivated by the recent success of modelfree Deep Reinforcement Learning (DRL) in challenging control tasks and in the design of medical treatments, we use Deep QNetwork (DQN) and Deep Deterministic Policy Gradient (DDPG) to design a personalized cancer chemotherapy schedule. We show that both of them succeed in the task and outperform the Optimal Control solution in the presence of uncertainty. Furthermore, we show that DDPG can exterminate cancer more efficiently than DQN presumably due to its continuous action space. Finally, we provide some insight regarding the amount of samples required for the training.
I INTRODUCTION and OBJECTIVES
Cancer is a common name that is given to a group of diseases that involve the repeated and uncontrolled division and spreading of abnormal cells. These abnormal tissues are called tumors [1]. Early diagnosis and effective treatment improve the survival rate of these diseases.
The optimal treatment schedule and drug dose vary according to the stage of the tumor, the weight of the patient, the white blood cell levels (immune cells), concurrent illness and age of the patient. Thus, proper scheduling and personalization of the chemotherapy treatment are vital to reducing the mortality rate. This motivated the use of techniques originating in engineering fields – such as optimal control – to derive optimal drug dosing for cancer chemotherapy [2]. A review of modelbased scheduling strategies is provided in [3].
One of the main challenges associated with the study of cancer as a dynamical system is that it is known to be complex, nonlinear, and its mechanisms of action are uncertain. Consequently, first principle mathematical models may not be able to account for all the variations in the patient dynamics [1].
Motivated by the challenging nature of generating accurate models of cancer dynamics and the recent success stories of using RL for control [5, 6], we use modelfree DeepReinforcement Learning (DRL) algorithms to design a personalized cancer chemotherapy schedule. Particularly, we will use Deep QNetwork (DQN) [7] (with a discrete action space) and Deep Deterministic Policy Gradient [8] (with continuous action space), and provide a comparison of their performance.
RL and DRL have been successfully applied in different medical treatment designs. For instance, in [9] DRL was used to determine the optimal treatment regimes from medical data. [10] developed automated radiation adaptation protocols for Lung Cancer using DDPG. In [1], Qlearning was used to design a chemotherapy schedule with a discrete action space, and in [11] DL was used for the detection of metastatic breast cancer. Moreover, [12] used DDPG to control the drug dosing for suppressing cell growth modeled through a stochastic logistic growth model.
The aim of this work is to develop insilico trials to compare the performance of modelbased and modelfree techniques in scheduling a personalized chemotherapy treatment. Particularly, our objectives include:

Scheduling the personalized chemotherapy treatment by solving the optimal control problem to establish a baseline for comparison and evaluate whether DQN and DDPG provide similar performance.

Gaining a qualitative understanding of the amount of training data (episodes) required by DQN and DDPG to perform similarly to the baseline policy.

Evaluating the robustness of the optimal controller, DQN and DDPG policies in the presence of different types of relevant uncertainties: parametric in the dynamics, parametric in the initial conditions of the model (diagnosis) and in the presence of a stochastic forcing term in the tumor cell population dynamics.

Comparing the performance of DQN and DDPG in the scenarios described above.
Ii Model
In order to compute the optimal control policy and the reward in DRL, a mathematical model that captures the distribution and effects of the chemotherapy drug is required. A realistic model should address tumor growth, the reaction of the human immune system to the tumor growth, and the effects of chemotherapy treatment on immune cells, normal cells and tumor growth [2], [3].
We will simulate patient’s response to the treatment through a pharmacological model of cancer chemotherapy, given by a nonlinear and coupled system of 4 deterministic Ordinary Differential Equations (ODEs) [1]:
(1)  
where is the number of immune cells, is the number of normal cells, is the number of tumor cells and is the drug concentration. The action or control is the chemotherapy drug infusion rate [], which we will schedule through Optimal Control and DRL, respectively. The initial conditions, are determined according to the diagnosis. From now on, we will refer to this nonlinear model by , where and . For the sake of visualization, an electron micrograph of immune and cancer cells is shown in Figure 1.
The model parameters and the corresponding values used in our simulations are provided in Table I. Note that the patient and his diagnosis determine both the initial conditions and the value of the model parameters.
As mentioned in the Introduction, in order to test the robustness of the optimal control and DRL policies, we simulated the previously presented system of ODEs with parametric uncertainty and also with a stochastic forcing term. The reader is referred to the next Section for further details on the uncertain scenarios considered.
Iii Scenarios Considered
A personalized scheduling of chemotherapy treatment is fundamental to patient’s recovery from the disease. When designing the treatment schedule, it is important to optimize the amount of drug used in order to regulate the potentially lethal side effects of chemotherapy, since often the patient’s immune system weakens and becomes prone to lifethreatening infections, diminishing its capability to eradicate cancer [1].
Consequently, we obtained the Optimal Control and DRL policies for two different cases: a preliminary and somewhat unrealistic Case 0, in which the goal is to exterminate cancer regardless of the state of the rest of the cell populations. We used this case to validate our algorithms, codes and parameter and hypermarameter values. Then, we simulated a Case Patient, in which cancer is eradicated while preserving a minimum population of normal and immune cells in order to guarantee patient’s safety. In both cases, the initial condition was the same: .
Iiia Case 0
The optimal control problem for Case 0 is formulated through the following nonlinear program:
(2)  
s. t.  
Both the policy and the length of the treatment are determined by the optimal solution. The hard constraints correspond to the dynamics of cancer and the initial state of the patient, bounds in the drugrate and the target of cancer eradication. The cost is the area enclosed by the tumor cell population (note that is nonnegative) and the time axis, between and . For the DRL algorithms, the reward used is , where corresponds to the length of the timestep, and it is a fixed value. Note that is included in the reward just to make the comparison with the cost functional used in the Optimal Control program straightforward.
IiiB Case Patient
In order to guarantee patient’s safety during the treatment, additional state constraints are added to the program (2). Particularly, the program is augmented with the constraints and .
In the case of the DRL algorithms, the constraints are imposed softly by modifying the reward function:
(3) 
where denotes the Iverson bracket. The last two terms of equation (3) penalize the violation of the constraints and . Note that, for the cases in which the soft constraints are satisfied, this reward is the negative of the Riemann sum approximation of the area below the curve of the number of tumor cells (i.e. it is a rectangular approximation to the negative of the integral cost used in the optimal control problem).
Furthermore, in order to test the robustness of the obtained policies, different types of relevant sources of uncertainty will be considered regarding the diagnosis, growthrate and dynamics of the tumor. The corresponding results are provided in Section VI.
IiiB1 Parametric uncertainty: model parameter
The perunit growthrate of the tumor, , represents how aggressive the disease is. Thus, an accurate estimation of its value is important for the computation of the optimal control policy. We systematically varied its value and obtained the range for which the optimal control problem is feasible, as well as the sensitivity of the nominal optimal control policy to perturbations in the value of this parameter.
IiiB2 Parametric uncertainty: initial condition
A wrong estimation of the initial size of the tumor , will also have an impact on the performance of the optimal policy. We will systematically evaluate the robustness of the nominal policy to perturbations on .
IiiB3 Stochastic forcing in tumor dynamics
We simulated a system of SDEs of the form:
(4) 
where the first term is the drift and corresponds to the deterministic dynamics given by the ODEs, and the second one is the diffusion term, modeled by a constant in our case and applied only to the equation that gives the evolution of the population of tumor cells . denotes a Wiener process, common in the motion of cells. We simulated the SDEs using the EulerMaruyama scheme.
Iv Methods and Algorithms
Iva Optimal Control
The general form of a finitehorizon optimal control problem in Bolza form is:
(5)  
s. t.  
Inspection of the program (2) reveals that it constitutes a particular case of problem (5) and thus can be solved using standard optimal control techniques. Something to note is that the control solution provided by (5) will usually be openloop (i.e. the control policy will be a function of time, as opposed to feedback or closedloop control laws, in which the control is a function of the state. Usually this last case is more desirable since it accounts for discrepancies between the real dynamics and the predictions made by the model and corrects the control input accordingly, making the controller more robust to uncertainty).
IvB Deep Reinforcement Learning
DDPG and DQN algorithms are considered. Both DDPG and DQN share two common features: they are offpolicy and modelfree algorithms. Offpolicy means that the behaviour policy used is not the same as the policy being improved. This allows the use of a memory replay, and the use of any exploration strategy. Modelfree means that the algorithm does not try to estimate the transition matrix of the dynamics of the environment . Instead, it estimates the optimal policy or value function directly.
The differences between these two algorithms are highlighted below.
IvB1 Deep Q Network (DQN)
Deep Q Network was proposed in [7], and the main difference with respect to standard Qlearning is the use of a neural network to approximate the actionvalue function . The DQN algorithm is shown in Alg. 1 (taken from [7]).
The inclusion of a neural network, usually leads to an unstable training due to two factors: the correlation between samples and the nonstationary targets. These two challenges are addressed by DQN using [13]:

An experience replay: In a replay buffer, a dataset of tuples are saved. During the training, the agent will randomly sample minibatch samples from this replay buffer (line 1 of the algorithm). This allows a stabilization of the training process, and a better approximation to i.i.d samples removing the correlation between them.
DQN is mainly used for problems with a discrete action space. However, some applications make use of the Normalized Advantage Function (NAF) to apply DQN in continuous action spaces [14].
IvB2 Deep Deterministic Policy Gradient (DDPG))
DDPG was proposed in [8]. The algorithm used by DDPG is shown in Alg. 2 (taken from [8]), and its main characteristics are these ones [8], [15], [16]:

Policygradient: DDPG tries to estimate the gradient of the expected return, and the policy is updated using this estimate.

Actorcritic: DDPG has two different structures (see also Fig. 2) :

Actor: The actor contains the policy function. It takes the state of the environment as input, and produces an action.

Both the actor and critic described above are represented using neural networks.
V Implementation
Va Optimal Control
We solved the optimal control problem using direct collocation methods [17], which transcribe the continuous dynamics and control functions to a finite set of algebraic variables and then solve a highdimensional nonlinear program (NLP). We used the MATLAB implementation of ICLOCS2 [18], the opensource Imperial College London Optimal Control software (available for download here) in conjunction with the opensource NLP solver IPOPT [19]. We modified the opensource code and introduced the dynamics of cancer (the ODE system), cost functional and constrains, as well as the desired options and tolerances for the solvers.
We used hmethods for the transcription, particularly the HermiteSimpson method with automatic mesh refinement and an initial number of 200 nodes. Regarding numerical tolerances, we allowed errors of up to in the state and control bounds, and in the terminal condition of cancer eradication. The optimal control solution for both the Case 0 and Case Patient is provided in Section VI.
VB Reinforcement Learning
To implement the DRL algorithms, we used the opensource library ChainerRL [20]. Moreover, we created an environment using the OpenAI Gym [21] framework. This environment takes an action and the current state, and returns the next observation and the reward obtained. In this environment, we implemented the system of ODEs, and solved it using the numerical integration methods provided by Scipy [22].
The values of the most relevant parameters of the DQN and DDPG implementations are shown in Table II. Note that we used to match the reward used in Optimal Control as well as possible.
To improve the convergence and the training times, all the states, actions and the reward were normalized to values in , and CuPy was used.
Note that both in DDPG and DQN, it is required to select the size of the time steps taken by the agent during an episode. We used a time step size of days, which achieved a reasonable training time.
Parameter  DQN  DDPG 

Replay Start Size  
Layers  2  3 
Hidden Units per layer  100  300 
Discount Factor  0.99  0.995 
Vi Results
Via Case 0
The results for the preliminary Case 0 (in which there are not constraints on and ) are shown in Fig. 3. Note that the policy found by all the algorithms (O.C., DDPG and DQN) is the same: apply the maximum allowed drug infusion rate until the cancer is exterminated, which is the expected solution.
ViB Case Patient
In this case, we add the constraints and to the program. The optimal policies found by the three algorithms for the nominal growth tumor rate () are shown in Fig. 5. All the policies are able to exterminate the cancer in days. Note also that the policies provided by the different algorithms share some resemblances: Maximum allowed drug infusion at the beginning of the treatment, followed by an average value . In this case, the cost of the solutions provided by DDPG and DQN is slightly higher than that of O.C.
The convergence rates for DDPG and DQN are shown in Fig. 6. The training was stopped when the average of the Qvalue reached a stationary behaviour. The training for each case considered took minutes. We also note that the shape of the average Q (see Figure 6) is that expected from a successful training procedure (see [23] for example): After the replay buffer has been filled, the average Q increases at the beginning of the training, it reaches a maximum (where the agent overestimates the Q value), and then decreases to achieve a stationary value.
ViB1 Parametric uncertainty: Model Parameter
To evaluate the sensitivity of the optimal policy to changes in , we perturbed around its nominal value and obtained the range for which an optimal policy exists. The problem is feasible for values and the corresponding policies are plotted in Fig. 4. Again, DDPG and DQN obtain a similar policy to that provided by O.C., although some of them present an oscillatory behaviour. Both for DQN and DDPG, these oscillations may be due to the length of the time step ( days), and might be reduced by refining it. Moreover, for the case of DQN the coarse discretization (10 nodes) of the action space should be accounted for.
Once we obtained the values of for which the problem is feasible, we tested the optimal policy obtained for in different scenarios, where . The results are shown in Fig. 7. In all the algorithms, the policy found eradicates the cancer for , and does not cure it for . However, for the case , DRL algorithms are successful exterminating the cancer, but O.C. is not. The increased robustness of DRL to parametric uncertainty might be due to its exploration phase: exploration makes the policies found by DRL algorithms contain more information than the optimal control solution, which only focuses on finding a minimum for the given model.
ViB2 Parametric uncertainty: initial condition
For this case, the agent is trained with the nominal initial condition , and tested with perturbed values of . The perturbed values and corresponding results are shown in Fig. 8. Note that DRL is able to exterminate the cancer for all the cases, while O.C. fails to do it for . Again, this argues for the increased robustness of the policies provided by DRL in comparison to O.C.
ViB3 Stochastic forcing in tumor dynamics
In this last scenario, the agent is tested with a stochastic forcing term in the dynamics of . The plots of the mean and standard deviations of the solutions found are shown in Fig. 9. DQN and DDPG drive the tumor closer to extermination than O.C. at the end of the treatment.
ViC Sampling
An interesting question to investigate is the dependence of learning in DQN and DDPG on the number of episodes. This question is specially relevant in datapoor scenarios, where one may wonder if the size of the dataset is enough to obtain a suitable policy. With this aim, the agent was evaluated after each training episode, and its cost was compared with that of O.C. In the case of DQN, seven different discretizations of the action space were considered: 7, 10, 20, 30, 40, 50 and 60 nodes. The resulting plot is shown in Fig. 10. It is observed that the DRL costs asymptote to the optimal one and that, after episodes, DDPG obtains a cost that is close to that of Optimal Control. Note also that, in general, DDPG tends to obtain better agents than DQN.
Vii Conclusions and Future Work
This work presented a comparison between classical O.C. and modelfree DRL approaches, both in discrete and continuous action space. We showed that, with an accurate model of the dynamics, O.C. provides the best solution, but closely followed by DRL. Moreover, we showed that in the presence of parametric or stochastic uncertainty, DRL approaches have the potential to outperform classical trajectory optimization O.C. techniques.
In the Case 0, all the algorithms found the same optimal policy. In the Case Patient, the policies found by DRL perform similarly to O.C, but they exhibit increased robustness to uncertainties. Regarding the relative merits of DQN and DDPG, we found that DDPG outperforms DQN, presumably due to its continuous action space. Furthermore, it seems to learn faster.
The sampling analysis of the algorithms showed that approximately 1500 calls to the model are needed for DDPG to obtain a performance close to optimal.
As future work, we consider comparing the perfomance of DDPG and DQN with robust optimal control, where uncertainty on the dynamics is taken into account in the formulation of the optimal control program at the expense of increasing the cost associated to the policy found. Moreover, we plan also to compare the performace of DDPG and DQN with Model Predictive Control, that benefits from feedback in each iteration.
Acknowledgment
Both authors would like to thank Prof. David Sontag, Alejandro RodriguezRamos and DongKi Kim for valuable discussions and ideas. They are also grateful for the economic support of Fundacion Bancaria “la Caixa”.
References
 [1] Regina Padmanabhan, Nader Meskin, and Wassim M Haddad. Reinforcement learningbased control of drug dosing for cancer chemotherapy treatment. Mathematical biosciences, 293:11–20, 2017.
 [2] L.G.D. Pillis and A. Radunskaya. A mathematical tumor model with immune resistance and drug therapy: an optimal control approach. Comput. Math. Methods, 2001.
 [3] H. Sbeity and R. Younes. Review of optimization methods for cancer chemotherapy treatment planning. J. Comput. Sci. Syst. Bio., 2015.
 [4] Memorial sloan kettering cancer center. https://www.mskcc.org. Accessed: 8Dec2018.
 [5] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. A general reinforcement learning algorithm that masters chess, shogi and go through selfplay. Science, 2018.
 [6] B. Recht. A tour of reinforcement learning: The view from continuous control. 2018.
 [7] V. Mnih, K. Kavukcuoglu, D Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Peterseu, C. Beattie, A. Sadik, I. Antonoglou, H. King, D Kumaran, D. Wierstra, S. Legg, and D. Hassaan. Humanlevel control through deep reinforcement learning. Nature, 2016.
 [8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.
 [9] Ying Liu, Brent Logan, Ning Liu, Zhiyuan Xu, Jian Tang, and Yangzhi Wang. Deep reinforcement learning for dynamic treatment regimes on medical registry data. In 2017 IEEE International Conference on Healthcare Informatics (ICHI), pages 380–385. IEEE, 2017.
 [10] HuanHsin Tseng, Yi Luo, Sunan Cui, JenTzung Chien, Randall K Ten Haken, and Issam El Naqa. Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical physics, 44(12):6690–6705, 2017.
 [11] Google A.I.: Using deeplearning for breast tumor detection. https://ai.googleblog.com/2018/10/applyingdeeplearningtometastatic.html. Accessed: 11Dec2018.
 [12] Dalit Engelhardt. Dynamic control of stochastic evolution: A deep reinforcement learning approach to adaptively targeting emergent drug resistance. arXiv preprint arXiv:1903.11373, 2019.
 [13] Emma Brunskill. CNNs and Deep QLearning. https://web.stanford.edu/class/cs234/slides/cs234_2018_l6.pdf. Accessed: 10Dec2018.
 [14] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep qlearning with modelbased acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
 [15] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [16] Deep deterministic policy gradients. https://pemami4911.github.io/blog/2016/08/21/ddpgrl.html. Accessed: 10Dec2018.
 [17] J.T. Betts. Practical methods for optimal control and estimation using nonlinear programming. SIAM, 2010.
 [18] P. Falugi, E. Kerrigan, and E. van Wyk. Imperial College London Optimal Control Software: User Guide (ICLOCS). 2010.
 [19] A. Watcher and L.T. Biegler. On the implementation of an interiorpoint filter linesearch algorithm for largescale nonlinear programming. Math. Program., Ser. A, 2005.
 [20] ChainerRL, a deep reinforcement learning library. https://chainerrl.readthedocs.io/en/latest/. Accessed: 10Dec2018.
 [21] OpenAI Gym. https://gym.openai.com/. Accessed: 10Dec2018.
 [22] Scipy. https://docs.scipy.org/doc/scipy/reference/index.html. Accessed: 10Dec2018.
 [23] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4):e0172395, 2017.