Explicit approximation of stochastic optimal feedback control for combined therapy of cancer\thanksreffootnoteinfo
Abstract
In this paper, a tractable methodology is proposed to approximate stochastic optimal feedback treatment in the context of mixed immunochemo therapy of cancer. The method uses a fixedpoint value iteration that approximately solves a stochastic dynamic programminglike equation. It is in particular shown that the introduction of a variancerelated penalty in the latter induces better results that cope with the consequences of softening the health safety constraints in the cost function. The convergence of the value function iteration is revisited in the presence of the variance related term. The implementation involves some Machine Learning tools in order to represent the optimal function and to perform complexity reduction by clustering. Quantitative illustration is given using a commonly used model of combined therapy involving twelve highly uncertain parameters.
Keywords. Stochastic optimal control; Stochastic Dynamic Programming, Cancer therapy; Support Vector Machine, Clustering; FixedPoint value iteration; Convergence analysis.
GIPSA]Mazen Alamir
{}^{\mathrm{a}}Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSAlab, 38000 Grenoble, France {}^{\mathrm{}}
1 Introduction
Rationalizing drug delivery is an active research field that commonly involves population models with high number of uncertain parameters. Unfortunately, these parameters which are by nature highly variable between individuals, are inaccessible to identification because of the lack of excitation [9].
The great majority of applied mathematicallike works solve deterministic optimal control problems which are formulated using the nominal values of the parameters [15, 6, 11]. While this might be important to draw qualitative conclusions regarding the patterns of optimal strategies (intensive treatment, presence of singular arcs, etc); the resulting profiles do not accommodate for the high dispersion of the parameters.
Using repetitive solutions of such openloop nominal scheduling strategies in a feedback mode through the recedinghorizon principle (apply the first part of the optimal strategy and recompute a new optimal injection profile at the next decision instant and so on leading to the so called Model Predictive Control design (MPC).) obviously reduces the drawback of parameters mismatch [4, 5]. However, the performances can only be observed a posteriori as nothing is explicitly done to address the presence of uncertainties.
Another option is to use a parameterized state feedback law and to optimally tune its design parameters using explicitly the statistical description of the model’s parameters dispersion by means of the randomized optimization framework [1]. Unfortunately, as far as available works are concerned, this is to be done for each initial state in the proposed frameworks.
Stochastic model predictive control (SMPC) offers an attractive alternative through the use of a set of scenarios in the online computation in order to minimize the approximate version of the expectation of the cost function [13, 12, 2]. This lead to a solution for each initial state but does not give a global overview on the performance on a whole region of the state space. On the other hand, this solution can be scalable in the dimension of the state which enables the use of almost arbitrarily complex models of the underlying dynamics.
Stochastic Dynamic Programming (SDP) [7, 3, 10] is a framework that, at least conceptually, outperforms the previous approaches in that it gives an 1) explicit state feedback law that 2) explicitly incorporates the statistical description of the parameters dispersion while being 3) defined on a whole region of interest within the state space.
Unfortunately, the price to pay to get these advantages is to represent the Bellman function over the extended space of state and control making this approach non scalable in the dimension of the state and control. Moreover, when minimizing the residual of the SDP equation, each call involves the computation of statistically defined quantities and this, for the computation to be relevant, should theoretically involve a high number of randomly sampled instances of the uncertain model’s parameters. Note however that this last drawback is obviously shared by the SMPC approach.
The literature on solving the stochastic dynamic programming equation is huge and a complete survey of it is out of the scope of this paper. For a complete and recent survey, one can consult [3]. This paper does not claim a particularly novel algorithm to solve the SDP equation. Rather it focuses on the issue of introducing a variancerelated term in the definition of the cost function, investigates its impact on handling the healthrelated constraint in the combined therapy of cancer and revisits the convergence proof of the underlying value function fixedpoint iteration when such additional variancerelated term is added.
More precisely, a simple framework is proposed for a moderate sized models\@footnotemark\@footnotetextthe combined therapy model used involves four states and two controls which are commonly encountered sizes in all the related works. in which Machine Learning (ML) tools are used. More precisely, the model’s structure uses a Support Vector Machine (SVM) regression model to represent the value function. On the other hand, a clustering approach is used to approximate the statistical quantities using a lower number of samples. This leads to a tractable approximate solution that can be obtained in less than 20 minutes when using a grid of 9604 points in the state/control space when no parallelization is used. Drastic reduction of the computation time can be obtained since the proposed scheme is amenable to massive parallelization.
This paper is organized as follows: The problem of combined therapy is stated in Section 2. Some recalls regarding SDP are proposed in Section 3 and some results are proposed for the convergence of a fixedpoint iteration that is used to solve the associated equations. Section 4 details the proposed approximate solution’s framework while Section 5 shows the results on the combined therapy of cancer and discusses some issues. Finally, the paper ends with Section 6 that summarizes the paper’s contributions and gives some hints for further investigation.Â Â Â
2 Problem statement
2.1 The dynamic model
Let us consider the population model used in [8] to describe the dynamics involved in the combined immunochemo therapy:
\displaystyle\dot{x}_{1}  \displaystyle=  \displaystyle ax_{1}(1bx_{1})c_{1}x_{4}x_{1}k_{3}x_{3}x_{1}  (1)  
\displaystyle\dot{x}_{2}  \displaystyle=  \displaystyle\delta x_{2}k_{2}x_{3}x_{2}+s_{2}  (2)  
\displaystyle\dot{x}_{3}  \displaystyle=  \displaystyle\gamma_{0}x_{3}+u_{2}  (3)  
\displaystyle\dot{x}_{4}  \displaystyle=  \displaystyle g\dfrac{x_{1}}{h+x_{1}}x_{4}rx_{4}p_{0}x_{4}x_{1}k_{1}x_{4}x_% {3}+s_{1}u_{1}  (4) 
where x_{1} tumor cell population x_{2} circulating lymphocytes population x_{3} chemotherapy drug concentration x_{4} effector immune cell population u_{1} rate of introduction of immune cells u_{2} rate of introduction of chemotherapy The description of the role of each groups of term is given in Table 1 for the sake of clarity.
Eq.  Term  Description 

(2.1)  ax_{1}(1bx_{1}) 
Logistic tumor growth 
(2.1)  c_{1}x_{4}x_{1} 
Death of tumor due to effector cells 
(2.1)  k_{3}x_{3}x_{1} 
Death of tumor due to chemotherapy 
(2.1)  \delta x_{2} 
Death of circulating lymphocytes 
(2.1)  k_{2}x_{3}x_{2} 
Death of lymphocytes due to chemo 
(2.1)  s_{2} 
Constant source of lymphocytes 
(2.1)  \gamma_{0}x_{3} 
Exponential decay of chemotherapy 
(2.1)  g\dfrac{x_{1}}{h+x_{1}}x_{4} 
Stimulation of tumor on effector cells 
(2.1)  rx_{4} 
Death of effector cells 
(2.1)  p_{0}x_{4}x_{1} 
Inactivation of effector cells by tumor 
(2.1)  k_{1}x_{4}x_{3} 
Death of effector cells due to chemo 
When referred to the nominal values of the parameters involved in the model (2.1)(2.1), the following values are used:
\displaystyle a=0.25,b=1.02\times 10^{14},c_{1}=4.41\times 10^{10},g=1.5% \times 10^{2}  
\displaystyle h=20.2,k_{2}=k_{3}=0.6,k_{1}=0.8,p_{0}=2\times 10^{11},r=0.04  
\displaystyle s_{1}=1.2\times 10^{7},s_{2}=7.5\times 10^{6},\delta=1.2\times 1% 0^{2},\gamma_{0}=0.9 
Note that the dynamic model (2.1)(2.1) involves 14 parameters. It is worth noting however that the order of magnitude of x_{1} is about 10^{9} while h\approx 20. This means that h acts only when the tumor is almost disappearing\@footnotemark\@footnotetextOn the other hand, without h, the effector immune population will keep growing while the tumor is absent which contradicts its very definition as a tumor stimulated immune population.. That is the reason why in the remainder of this paper, this parameter is supposed to be known and hence it is not included in the set of uncertain parameters. As it is shown later, this enables a separable structure of the evolution map to be used [see (6)].
Gathering all the other parameters in a vector p\in\mathbb{R}^{13} and using some discretization scheme with some sampling period \tau, it is possible to put the dynamics above in the following condensed form for the easiness of notation:
\displaystyle x^{+}  \displaystyle=f(x,u,p)\quad(x,u,p)\in\mathbb{R}^{4}\times\mathbb{R}^{2}\times% \mathbb{R}^{13}  (5)  
\displaystyle=\Phi(x,u)\Psi(p)  (6) 
for straightforward definition of \Phi and \Psi.
The trajectory of the system starting at initial state x under a control profile \bm{u} and when the parameter value p holds is denoted by x^{\bm{u}}(kx,p), namely:
\displaystyle x^{\bm{u}}(k+1x,p)=f\bigl{(}x^{\bm{u}}(kx,p),u(k),p\bigr{)}  
\displaystyle x^{\bm{u}}(0x,p)=x 
2.2 The control objective and constraints
The aim of the combined therapy is to reduce the size of the tumor population x_{1} while keeping the level of lymphocyte cells x_{2} above an upper bound x_{2}^{min}. This has to be done using quantized drug delivery:
u\in\mathbb{U}:=\{0,u_{1}^{max}\}\times\{0,u_{2}^{max}\}  (7) 
so that an infinite sequence of control lying inside \mathbb{U} is denoted hereafter by \mathbb{U}^{\infty}.
In order to address the control objective, the following stage cost function can be used that incorporates the constraint on x_{2} as a soft constraint:
L(x,u):=x_{1}^{2}+\rho_{c}\max\{0,x_{2}^{min}x_{2}\}+\rho_{1}u_{1}+\rho_{2}u_% {2}  (8) 
so that if the model’s parameters were perfectly known, an optimization problem can be defined for all given initial state x as follows:
\displaystyle\mathcal{P}(xp):  
\displaystyle\quad\min_{\bm{u}\in\mathbb{U}^{\infty}}J(\bm{u}x,p):=\sum_{k=0}% ^{\infty}\gamma^{k}L\Bigl{(}x^{\bm{u}}(kx,p),u(k)\Bigr{)}  (9) 
for some \gamma\in(0,1]. Now, since the parameters are supposed to be unknown, the stochastic control approach amounts at replacing the deterministic cost function (9) by a statistically defined one such as:
\displaystyle J_{\alpha}(\bm{u}x,\Pi)  \displaystyle:=\mu(J(\bm{u}x,\cdot))+\alpha\sigma(J(\bm{u}x,\cdot))  (10)  
\displaystyle=:\mathcal{S}_{\alpha}^{(J)}(\bm{u},x)  (11) 
where \mu(\cdot) and \sigma(\cdot) stand respectively for the expectation and the variance of their argument when p is sampled according to some supposedly known Probability Density Function (pdf) \Pi. Note that (11) simply introduces a short notation of the r.h.s of (10).
This leads to the following stochastic optimization problem:
\mathcal{P}_{\alpha}(x\Pi):\quad\min_{\bm{u}\in\mathbb{U}^{\infty}}J_{\alpha}% (\bm{u}x,\Pi)  (12) 
Note that in many stochastic controlrelated works, attention is focused on the expectation of the cost function so that \alpha=0 is widely used in (10). This is not totally appropriate in the case of cancer therapy. Indeed, expectation is a relevant indicator only when high number of realizations are expected to take place\@footnotemark\@footnotetextsuch as saving energy in building management for instance despite of bad knowledge of exogenous parameters such as power demands and whether conditions. in which only the global average is of interest. In the case of cancer therapy, a scenario is a patient being treated. This is why one should obviously be interested in the risk that each patient afford during the treatment. In such cases, it is precisely those bad scenarios with non negligible likeliness that really matter. This has to do not only with the expectation but definitively with the associated variance as well. This explains the use of \alpha>0 in (12).
A part of the discussion and a large part of the results shown in this paper are dedicated to showing the relevance of introducing the variance related term the cancer therapy stochastic control problem’s formulation on one hand, and on the convergence of the computation scheme in the presence of a non vanishing \alpha in the stochastic cost formulation.
In the next section, some recalls are proposed on the Stochastic Dynamic Programming equations that might be invoked to approximately solve the stochastic optimization problem (12).
3 Recalls and preliminary results
3.1 Stochastic Dynamic Programming
SDP attempts to solve the following functional equation in which the unknown function V(\cdot) is the optimal solution of (12). More precisely:
\displaystyle V(x)  \displaystyle:=\min_{u\in\mathbb{U}}Q(x,u)\qquad\mbox{where}  (13)  
\displaystyle Q(x,u)  \displaystyle:=L(x,u)+\gamma\min_{v\in\mathbb{U}}\Bigl{[}\mathcal{S}_{\alpha}^% {(Q)}(x,u,v)\Bigr{]}  (14)  
where x^{+}=f(x,u,p)  (15) 
in which \mathcal{S}^{(Q)}(x,u,v) is defined in a similar way as \mathcal{S}^{(J)}(\bm{u},x) [see (11)]:
\mathcal{S}^{(Q)}_{\alpha}(x,u,v):=\mu\bigl{(}Q(f(x,u,\cdot),v)\Bigr{)}+\alpha% \sigma\bigl{(}Q(f(x,u,\cdot),v)\Bigr{)}  (16) 
where here again, \mu(\cdot) and \sigma(\cdot) are respectively the expectation and the variance of their argument when p (involved in the definition of f(x,u,p)) is sampled according to the pdf \Pi.
The fact that V satisfying (13)(14) is a solution to (12) is a direct consequence of the Bellman principle and the fact that L=\mu(L)+\sigma(L)=L+0 since L(x,u) is a deterministic punrelated term.
From the above, it comes out that the truly unknown function to find is Q(\cdot,\cdot) since the optimal value function V(\cdot) as well as the optimal control are then recovered from the static optimization problem (13).
Now it is not hard to see from (14) that Q is a solution of a fixedpoint iteration:
Q^{(i+1)}=F\Bigl{(}Q^{(i)}\Bigr{)}  (17) 
where the operator F (commonly called the Bellman operator) is defined by:
F(Q)(x,u):=L(x,u)+\gamma\min_{v\in\mathbb{U}}\Bigl{[}\mathcal{S}^{(Q)}(f(x,u,% \cdot),v)\Bigr{]}  (18) 
In the next section, some convergence results regarding the above fixedpoint iterations are derived in a general conceptual setting before a specific implementation is proposed in Section 4 for the cancer combined therapy.
3.2 FixedPoint iteration convergence analysis
In order to derive sufficient conditions for the convergence of the fixedpoint iteration (17), some definitions and assumptions are needed. They are introduced hereafter.
First of all, the following continuity assumption of the dynamics is used:
Assumption 1 (Continuous dynamics)
The map f describing the discretetime state evolution (5) is continuous in its arguments.
Regarding the uncertain vector p, the following assumption is needed
Assumption 2 (Finitely supported uncertainties)
there exists a compact set \mathbb{P} to which belong all possible realizations of the uncertain vector p.
This seems a quite reasonable assumption when physically meaningful model’s parameters are invoked. Still, this is rigorously not the case for Gaussian distributions for instance for which any arbitrarily high values are rigorously admissible.
Definition 3.1 (Bounded excursion)
A map G defined on \mathbb{X}\times\mathbb{U} is said to have Bbounded excursions if the following inequality holds:
\displaystyleG(f(x,u,p),v)\mu(G(f(x,u,\cdot),v))\leq B  (19) 
for all (x,u,v,p)\in\mathbb{X}\times\mathbb{U}^{2}\times\mathbb{P}.
This definition simply states that the excursion of the realizations G(f(x,u,p),v) from the mean value \mu\bigl{(}G(f(x,u,\cdot),v)\bigr{)} is bounded. It goes without saying that, under Assumption 1, such a bound exists as soon as the map G is continuous. since all the involved bounding sets are compact.
Lemma 1 (FixedPoint convergence)
Under Assumptions 1 and 2, if the fixed point map F invoked in (17) produces only continuous maps Q^{(i)} with Bbounded excursions for some B>0, then the fixed point iteration converges on \mathbb{Z}:=\mathbb{X}\times\mathbb{U} provided that the penalty \alpha on the variance term in (16) satisfies:
\alpha<\dfrac{\gamma1}{2\gamma B}  (20) 
and for sufficiently small sampling period \tau that is used to derive the discretetime dynamics (5).
Proof. See Appendix A.1.
From the above lemma, one can clearly derive the standard convergence result that is known to hold when no penalty on the variance is used (see for instance [7, 10, 3] and the references therein):
Corollary 2 (Convergence for pure expectation)
It is worth underlying that the convergence results cited above depend on the assumption according to which the successive iterates Q^{(i)} produced by the Bellman map F admit a Bbounded excursions with common upper bound B. The way this condition is enforced is discussed in the next section.
4 The proposed computation framework
4.1 Parametrization of Q(x,u)
In the previous section, the SDP equation and the related fixedpoint iteration have been defined in a conceptual framework. In this framework, the unknown function Q(x,u) one is looking for has no specific finitedimensional representation that would be compatible with concrete computation schemes. In this section, this parameterization aspect is addressed.
For the sake of simplicity, the notation z=(x,u) and \mathbb{Z}:=\mathbb{X}\times\mathbb{U} is used so that Q(z), f(z,p) stands for Q(x,u) and f(x,u,p) respectively. The finite dimensional parameterization of Q is obtained through the following steps:

First of all, a fixed grid of points \mathcal{Z}:=\{z^{(i)}\}_{i=1}^{n_{g}} is defined on the compact subset \mathbb{Z} of interest.

Denote by \bm{q}:=\{q_{i}\}_{i=1}^{n_{g}} the vector of values of the function Q at the grid points, namely q_{i}=Q(z^{(i)}).

Choose some regression model (Polynomial, Gaussian, Support Vector Machine Regressor (SVMR), etc.) and denoted it by \hat{Q}. As far as this work is concerned, a SVMR with gaussian basis is used with the default parameters used in the scikitlearn Machine Learning (ML) library [14].

This choice enables to associate to each vector of values \bm{q}\in\mathbb{R}^{n_{g}} a function \hat{Q}(\cdot\bm{q}) defined on \mathbb{Z} which is precisely the one identified (or learned to use a ML vocabulary) using the learning data (\mathcal{Z},\bm{q}).
Based on the above items, it is now possible to state the following definition:
Definition 4.1 (Admissible regression model)
A regression model is said to be admissible if and only if for all set \mathcal{Z} of finite grid points in \mathbb{Z} and any compact set of function values \mathbb{Q}\subset\mathbb{R}^{n_{g}}, there exists B>0 such that all identified maps \hat{Q}(\cdot\bm{q}) with \bm{q}\in\mathbb{Q} shows Bbounded excursions in the sense of Definition 3.1.
4.2 Main convergence result
Using the above preliminary results and definitions, the following main result regarding the convergence of the proposed FixedPoint iteration scheme can be stated as follows:
Proposition 3 (Main result)
Under Assumptions 1 and 2, if an admissible regression model(in the sense of Definition 4.1) is used over some grid points \mathcal{Z}:=\{z^{(i)}\}_{i=1}^{n_{g}} with the initialization q_{i}=L(z^{(i)}) to solve the stochastic optimal feedback of the combined therapy of cancer as stated in Section 2, then the resulting fixedpoint iteration (18) with \gamma\in(0,1) is convergent provided that the variance related penalty \alpha and the sampling period \tau are taken sufficiently small.
Proof. See appendix A.2.
The only remaining task is to find a regression model that is admissible in the sense of Definition 4.1. Fortunately, there are many options. The one that is used in the remainder of this paper is the Support Vector Machine regressor (SVMR) which shows the following structure:
\hat{Q}(z\bm{q}):=\beta_{0}(\bm{q})+\sum_{i=1}^{n_{g}}\beta_{i}(\bm{q})G(z^{(% i)},z)  (21) 
where the coefficients \beta_{i}(\bm{q}) admits uniform bound if \bm{q} belongs to some compact set \mathbb{Q}.
Remark 4
It is worth mentioning at this stage that the convergence of the fixed pointiteration does not tell us much on the distance of the resulting solution to the true optimal solution. This heavily depends on the density of the grid \mathcal{Z} on one hand and on the quality of the approximation of the statistical quantities involved in the very definition of the cost function on the other hand. These two aspects remain obviously at the origin of the curse of dimensionality that restricts the application of the underlying SDP approach to system of moderate sizes.
4.3 Computation of the expectation and variance
The fixedpoint iteration (18) requires the computation of the map \mathcal{S}^{(Q)}(f(x,u,\cdot),v) which involves the computation of the expectations and variances of quantities over the probability space of parameter values. In this section, the way this is done concretely is explained.
Recall that the r.h.s of the dynamics in (6) can be decomposed into two multiplicative terms x^{+}=\Phi(x,u)\Psi(p) where only the second term is parameterdependent. Consequently, the computation of the expectation of x^{+} needs the statistics of \Psi(p)\in\mathbb{R}^{14} to be approximated. This is done by identifying n_{cl} clusters based on a set \mathcal{S}_{\psi} of n_{s} samples of the vector \Psi(\cdot) that are drawn using the pdf \Pi. The centers of the computed clusters, denoted by \Psi^{(j)}, j=1,\dots,n_{cl} can be used as representatives of their clusters. Moreover, the ratio \pi_{j} between the sizes of these clusters and the total population n_{s} can be used as a measure of their probabilities, namely
\pi_{j}:=\dfrac{1}{n_{s}}\mbox{card}\Bigl{\{}\psi\in\mathcal{S}_{\psi}\quad% \quad\psi=\Psi^{(j)}\Bigr{\}}  (22) 
Based on the above definitions, the following approximation are used in the approximation of terms in (16):
\displaystyle\hat{\mu}\bigl{(}x,u,v\bm{q}\bigr{)}:=\sum_{j=1}^{n_{cl}}\pi_{j}% \hat{Q}\Bigl{(}\Phi(x,u)\Psi^{(j)},v\bm{q}\Bigr{)}  (23)  
\displaystyle\hat{\sigma}\bigl{(}x,u,v,\bm{q}\bigr{)}:=  
\displaystyle\sum_{j=1}^{n_{cl}}\pi_{j}\Biggl{[}\hat{Q}\Bigl{(}\Phi(x,u)\Psi^{% (j)},v\bm{q}\Bigr{)}\hat{\mu}\bigl{(}x,u,v\bm{q}\bigr{)}\Biggr{]}^{2}  (24) 
Using these expression, the weighted sum (16) of the above term can be used to approximated by:
\hat{\mathcal{S}}_{\alpha}^{(Q)}(x,u,v\bm{q}):=\hat{\mu}\bigl{(}x,u,v\bm{q}% \bigr{)}+\alpha\hat{\sigma}\bigl{(}x,u,v\bm{q}\bigr{)}  (25) 
The updated values of q_{i} at the underlying grid points z^{(i)} can therefore be obtained through:
q^{+}_{i}:=L(z^{(i)})+\gamma\min_{v\in\mathbb{U}}\Bigl{[}\hat{\mathcal{S}}_{% \alpha}^{(Q)}(f(z^{(i)}),v\bm{q})\Bigr{]}  (26) 
for each element z^{(i)} of the underlying grid \mathcal{Z}. Note that the optimization problem in v can be solved by simple enumeration over the four elements of the set \mathbb{U}.
This implements a fixedpoint iteration on the vector of values \bm{q} than can be shortly written as follows:
\bm{q}^{+}=F(\bm{q})  (27) 
where the notation F used in (17) is overloaded using the finitedimensional parametrization \bm{q} of Q. This is the fixedpoint iteration that is proved to be convergent under the assumptions of Proposition 3.
In Section 5, the instantiation of the framework on the combined therapy of cancer is detailed and the corresponding results are commented.
5 Numerical investigation on the combined therapy of cancer
5.1 Parameters used in the numerical investigation
As far as stochastic controller design is concerned (nominal controller is also investigated), the above framework has been investigated using the following parameters. A number n_{s}=10^{4} samples have been drawn to perform the clustering of set of values of \Psi (see Section 4.3). The number of identified clusters was taken equal to n_{cl}=20.
Regarding the pdf \Pi, the following definition has been used p_{i}=(1+\nu)p_{i}^{nom} where \nu is a normal distribution with 0 mean and variance \sigma=0.4. The grid \mathcal{Z} has been taken as the cartesian product of a uniform grid on x (having 7 uniformly distributed values over the normalized interval [0,1] per component) and the four values control set \mathbb{U}. This induces a set \mathcal{Z} composed of 7^{4}\times 4=9604 elements.
Clustering has been performed using the scikitlearn library’s KNN (knearest neighbors) utility. The scikit learn library’s SVM regression model has been used with radial basis function’s kernel and the default values of the remaining parameter. The sampling period \tau=0.25 Day is used and the lower bound x_{2}^{min}=0.05 is used for the normalized\@footnotemark\@footnotetextThe state components have been normalized using respectively the reference values (10^{9},10^{9},1,10^{9}). lymphocyte population size.
Three different controllers are compared. Namely:

The nominal controller for which \Pi is a dirac distribution centered at the nominal vector of parameters. The is referred to as Controller #1.

The expectationbased controller for which the distribution of the parameter is the one described above 5.1 where no penalty on the variance term (\alpha=0). The is referred to as Controller #2.

The mixed expectation/variancebased controller for which the penalty \alpha=0.1 is used on the variancerelated term. The is referred to as Controller #3.
5.2 Results
Despite the fact that the convergence of the fixedpoint iteration has been proved only for \gamma\in(0,1) and for sufficiently small \alpha and \tau, the value \gamma=1 has been used to show that convergence does occur for this ideal value (no discount on the cost function). On the other hand \alpha=0.1 for Controller #3.
First of all, Figure 1 shows a typical convergence curve for the fixedpoint iteration. The difference between two successive solutions during the iterations, namely \\bm{q}^{(i+1)}\bm{q}^{(i)}\_{\infty} is shown. Note that because of the use of \gamma=1, no strict contraction is obtained but eventually, contraction is settled and the solution is obtained.
Figures 2 and 3 show the benefit from using the stochastic formulation compared to the nominal one. In order to show this, N_{sam}:=20000 different simulations are constructed by drawing randomly samples of pairs of initial states and parameter vectors and the resulting closedloop cost function is simulated before statistics are drawn. More precisely, 100 initial states are sampled and for each of these initial state, 200 random parameter vectors samples are drawn. Namely, for each given pair (x^{(i)},p^{(i)}), i=1,\dots,N_{sam}
J_{cl}(x^{(i)},p^{(i)}):=\sum_{k=0}^{N_{sim}}L(x(k),u(k))  (28) 
where N_{sim}=50 is the simulation horizon length (12.5 Days) while x(k) and u(k) stand for the closedloop trajectories at instant k starting at initial instant k=0 at x^{(i)} and when the parameter vector p^{(i)} is used. Then different statistical comparisons are done when the above procedure is executed using the three controllers mentioned above, namely, the nominal, the stochastic controller based on the expectation term only (\alpha=0) and finally the stochastic controller using a penalty \alpha=0.1 on the variance.
More precisely, Figure 2 shows the histogram of the ratios between the closedloop performance as defined in (28) when the stochastic controllers are used compared to the nominal. One can clearly observe two features:

Both stochastic controllers lead to statistically better results than the nominal.

The stochastic Controller #3 that incorporates penalty on the variance shows better results than the one using only the expectationrelated penalty term.

It seems clear that even when focusing on the mean value of the cost function, the controller # 3 shows a better average than the controller # 2. This might seem contradictory with the fact that the latter is supposed to minimize this expectation. It can be conjectured that this comes from the fact that the expectation cannot be sufficiently well represented by the average over the 20 clusters (see Section 4.3) that are used to induce tractable computation of the expectation and the variance and considering the penalty on the variance correct this bias in a favorable way.
Figure 3 shows the comparison between the average levels of the lymphocytes population cells sizes (compared to the nominal) when the stochastic controllers are used. Here again, it comes out that the performances of the controller # 3 are slightly bette than that of the second expectationonlybased controller.
Another way to look at the constraint satisfaction is to count the number of scenarios where the health constraint x_{2}\geq x_{2}^{min} is satisfied. Figure 4 shows the percentage of improvement in the constraints satisfaction compared to the nominal when the stochastic controllers are used.
6 Conclusion and future works
In this paper a tractable stochastic control design for the combined therapy of cancer is proposed. The method is based on an approximate solution of the SDP equations through a value function fixedpoint iteration. The convergence of the latter in the presence of variance related penalty in the cost function is proved under mild technical assumption. The result suggests that the inclusion of a variancerelated term in the cost function might give better result by partially compensating for the errors induced by the approximation of the statistical quantities through computations involving a reduced number of clusters. Undergoing work involves the use of GPUbased computation framework to handle a higher order of magnitude of the samples size in order to examine how these qualitative results scale with the dimension of the populations over which they are evaluated.
References
 [1] M. Alamir. On probabilistic certification of combined cancer therapies using strongly uncertain models. Journal of Theoretical Biology, 384:59 – 69, 2015.
 [2] M. Alamir. On the use of supervised clustering in stochastic NMPC design. CoRR, abs/1811.09069, 2018.
 [3] D. P. Bertsekas. Dynamic programming and optimal control. 2017.
 [4] S. Chareyron and M. Alamir. Mixed immunotherapy and chemotherapy of tumors: Feedback design and model updating schemes. Journal of Theoretical Biology, 45:444–454, 2009.
 [5] Tao Chen, Norman F. Kirkby, and Raj Jena. Optimal dosing of cancer chemotherapy using model predictive control and moving horizon state/parameter estimation. Computer Methods and Programs in Biomedicine, 108(3):973 – 983, 2012.
 [6] Lisette G. de Pillis, Weiqing Gu, and Ami E. Radunskaya. Mixed immunotherapy and chemotherapy of tumors: modeling, applications and biological interpretations. Journal of theoretical biology, 238(4):841–862, 2006.
 [7] T. Jaakoola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Journal of Machine Learning Research, 6(6):1185–1201, 1994.
 [8] K. Kassara and A. Moustafid. Angiogenesis inhibition and tumorimmune interactions with chemotherapy by a control setvalued method. Mathematical Biosciences, 231(2):135 – 143, 2011.
 [9] Kanchi Lakshmi Kiran and S. Lakshminarayanan. Global sensitivity analysis and modelbased reactive scheduling of targeted cancer immunotherapy. Biosystems, 101(2):117 – 126, 2010.
 [10] Robert Kirkby. Convergence of discretized value function iteration. Computational Economics, 49(1):117–153, Jan 2017.
 [11] U. Ledzewicz, H. Schattler, and A. d’Onofrio. Optimal control for combination therapy in cancer. In 47th IEEE Conference on Decision and Control, 2008.
 [12] A. Mesbah. Stochastic model predictive control with active uncertainty learning: A survey on dual control. Annual Reviews in Control, 45:107 – 117, 2018.
 [13] A. Mesbah, S. Streif, R. Findeisen, and R. Braatz. Stochastic nonlinear model predictive control with probabilistic constraints. In Proceedings of the American Control Conference ACC2014, June 2014.
 [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [15] G. W. Swan. General applications of optimal control in cancer chemotherapy. IMA Journal of mathematics Applied in Medicine & Biology, 5:303–316, 1988.
A Appendix
A.1 Proof of Lemma 1
Proof. We shall prove that under the assumptions of the Lemma, the following inequality holds on the fixedpoint map F
\F(Q_{1})F(Q_{2})\_{\infty}\leq\gamma^{{}^{\prime}}\Q_{1}Q_{2}\_{\infty}  (A.1) 
where
\Q\_{\infty}=\sup_{(x,u)\in\mathbb{\mathbb{missing}}X\times\mathbb{U}}Q(x,u)  (A.2) 
for some \gamma^{{}^{\prime}}\in[0,1) which would obviously be sufficient to prove the result. To do so, let us begin by deriving upper bounds on the norm of F(Q_{1})(x,u)F(Q_{2})(x,u). Note that one has by definition of F:
[F(Q_{1})F(Q_{2})](x,u):=\gamma\Bigl{[}G_{\mu}(x,u)+\alpha G_{\sigma}(x,u)% \Bigr{]}  (A.3) 
where G_{\mu} comes from the meanrelated term in (16):
\displaystyle G_{\mu}(x,u):=  (A.4)  
\displaystyle\min_{v_{1}\in\mathbb{U}}\mu\Bigl{[}Q_{1}(f(x,u,\cdot),v_{1})% \Bigr{]}\min_{v_{2}\in\mathbb{U}}\mu\Bigl{[}Q_{2}(f(x,u,\cdot),v_{2})\Bigr{]} 
while G_{\sigma} is induced by the variancerelated term in (16), namely:
\displaystyle G_{\sigma}(x,u):=  (A.5)  
\displaystyle\min_{v_{1}\in\mathbb{U}}\mu\Biggl{[}\Bigl{(}Q_{1}(f(x,u,\cdot),v% _{1})\mu\bigl{[}Q_{1}(f(x,u,\cdot),v_{1})\bigr{]}\Bigr{)}^{2}\Biggr{]}  
\displaystyle\min_{v_{2}\in\mathbb{U}}\mu\Biggl{[}\Bigl{(}Q_{2}(f(x,u,\cdot),v% _{2})\mu\bigl{[}Q_{2}(f(x,u,\cdot),v_{2})\bigr{]}\Bigr{)}^{2}\Biggr{]} 
Let us consider successively upper bounds on G_{\mu} and G_{\sigma}.
Bounding G_{\mu}(x,u):
By using the pdf \Pi, it is possible to write:
\displaystyleG_{\mu}(x,u)\leq  
\displaystyle\intop\Pi(p)\leftQ_{1}(f(x,u,p),v_{1}^{*})Q_{2}(f(x,u,p),v_{2}^% {*})\rightdp 
in which v_{1}^{*} and v_{2}^{*} are the solutions of the optimization problem invoked in (A.4). Now since by definition:
(f(x,u,p),v^{*}_{i})\in(1+O(\tau))\mathbb{X}\times\mathbb{U} 
it comes from the previous equation that:
\displaystyleG_{\mu}(x,u)\leq  \displaystyle(1+O(\tau))\intop\Pi(p)\Q_{1}Q_{2}\_{\infty}dp  (A.6)  
\displaystyle=  \displaystyle(1+O(\tau))\Q_{1}Q_{2}\_{\infty}  (A.7) 
Bounding G_{\sigma}(x,u)
Here again, denoting by v_{1}^{\dagger} and v_{2}^{\dagger} the solutions to the two optimization problems invoked in (A.5), one can write after straightforward manipulations (mainly using the identity a^{2}b^{2}=(ab)(a+b)):
\displaystyle G_{\sigma}(x,u)=  
\displaystyle\intop\Pi(p)\Bigl{[}Q_{1}(f(x,u,p),v_{1}^{\dagger})Q_{2}(f(x,u,p% ),v_{2}^{\dagger})  
\displaystyle\mu\bigl{(}Q_{1}(f(x,u,\cdot),v_{1}^{\dagger})Q_{2}(f(x,u,\cdot)% ,v_{2}^{\dagger})\bigr{)}\Bigr{]}\times  
\displaystyle\Bigl{[}Q_{1}(f(x,u,p),v_{1}^{\dagger})+Q_{2}(f(x,u,p),v_{2}^{% \dagger})  
\displaystyle\mu\bigl{(}Q_{1}(f(x,u,\cdot),v_{1}^{\dagger})+Q_{2}(f(x,u,\cdot)% ,v_{2}^{\dagger})\bigr{)}\Bigr{]}dp 
Therefore,
\displaystyleG_{\sigma}(x,u)\leq  
\displaystyle\intop\Pi(p)\Bigl{[}Q_{1}(f(x,u,p),v_{1}^{\dagger})Q_{2}(f(x,u,p% ),v_{2}^{\dagger})\Bigr{]}\times  
\displaystyle\Bigl{[}Q_{1}(f(x,u,p),v_{1}^{\dagger})+Q_{2}(f(x,u,p),v_{2}^{% \dagger})  
\displaystyle\mu\bigl{(}Q_{1}(f(x,u,\cdot),v_{1}^{\dagger})+Q_{2}(f(x,u,\cdot)% ,v_{2}^{\dagger})\bigr{)}\Bigr{]}dp  
\displaystyle+  
\displaystyle\mu\bigl{(}Q_{1}(f(x,u,\cdot),v_{1}^{\dagger})Q_{2}(f(x,u,\cdot% ),v_{2}^{\dagger})\bigr{)}\times  
\displaystyle\intop\Pi(p)\Bigl{[}Q_{1}(f(x,u,p),v_{1}^{\dagger})+Q_{2}(f(x,u,% p),v_{2}^{\dagger})  
\displaystyle\mu\bigl{(}Q_{1}(f(x,u,\cdot),v_{1}^{\dagger})+Q_{2}(f(x,u,\cdot)% ,v_{2}^{\dagger})\bigr{)}\Bigr{]}dp 
Note that the second term in the above addition vanishes by definition of the mean of Q_{1}+Q_{2}. Regarding the first term, it can be bounded using the fact that Q_{1} and Q_{2} are both of finite excursion on \mathbb{X}\times\mathbb{U}. Indeed
\displaystyle\Bigl{[}Q_{1}(f(x,u,p),v_{1}^{\dagger})+Q_{2}(f(x,u,p),v_{2}^{% \dagger})  
\displaystyle\mu\bigl{(}Q_{1}(f(x,u,\cdot),v_{1}^{\dagger})+Q_{2}(f(x,u,\cdot)% ,v_{2}^{\dagger})\bigr{)}\Bigr{]}\leq 2B 
Using this in the first term of the last inequality bounding G_{\sigma}(x,u) leads to
G_{\sigma}(x,u)\leq 2B(1+O(\tau))\Q_{1}Q_{2}\_{\infty}  (A.8) 
Using (A.7) and (A.8) in (A.3) enables to induce that
\F(Q_{1})F(Q_{2})\_{\infty}\leq\gamma(1+2\alpha B+O(\tau))\Q_{1}Q_{2}\_{\infty}  (A.9) 
which obviously means that for all \alpha satisfying the condition (20), the fixed point is contractive and hence convergent. \hfill\Box
A.2 Proof of Proposition 3
All we have to prove is that when initializing the q_{i} as stated in the proposition, all the following values of \bm{q} lie in some compact set \mathbb{Q}. Once this is shown the results fallows directly from Definition 4.1 which guarantees that some unique B exists which can be used in Lemma 1 to conclude.
To prove the existence of the compact set \mathbb{Q}, the boundedness of the state evolution of the combined therapy of cancer is invoked, that is to say, under any bounded sequence \bm{u}\in\mathbb{U}^{\infty}, the state vector belongs at any instant in the future to some bounded set \bar{\mathbb{X}} (this is a known property of the model (2.1)(2.1) that can be easily shown by proving that for each component of the state, the r.h.s becomes negative when the component is sufficiently high).
Having this property at hand and given the initialization q_{i}=L(z^{(i)}, it comes from the definition of the fixedpoint iteration that the values q_{i} at any iteration satisfy the inequality:
q_{i}\leq\dfrac{\gamma}{1\gamma}\left[\max_{z\in\bar{\mathbb{X}}\times\mathbb% {U}}L(z)\right]  (A.10) 
which obviously means that the fixed point iteration produces vectors that always remains in some compact set \mathbb{Q} which ends the proof. \hfill\Box