Practical Reinforcement Learning of Stabilizing Economic MPC
Abstract
Reinforcement Learning (RL) has demonstrated a huge potential in learning optimal policies without any prior knowledge of the process to be controlled. Model Predictive Control (MPC) is a popular control technique which is able to deal with nonlinear dynamics and state and input constraints. The main drawback of MPC is the need of identifying an accurate model, which in many cases cannot be easily obtained. Because of model inaccuracy, MPC can fail at delivering satisfactory closedloop performance. Using RL to tune the MPC formulation or, conversely, using MPC as a function approximator in RL allows one to combine the advantages of the two techniques. This approach has important advantages, but it requires an adaptation of the existing algorithms. We therefore propose an improved RL algorithm for MPC and test it in simulations on a rather challenging example.
I Introduction
Reinforcement learning (RL) is a modelfree control technique which recursively updates the controller parameters in order to achieve optimality. Once the controller parameters have been learned, the controller will implement the control action which yields an infinitehorizon optimal cost for each initial condition [1]. RL has drawn increasing attention thanks to the striking results obtained in beating chess and go masters [2], and in learning how to make a robot walk or fly without supervision [3, 4].
In order to be able to solve the RL problem in practice, function approximation strategies must be employed. Such approximations typically rely on a set of basis functions, or features, and corresponding parameters that multiply them. In this case, the function approximator is linear in the parameters. However, nonlinear function approximators are commonly deployed using Deep Neural Networks (DNN).
Model Predictive Control (MPC) is a modelbased technique which exploits a model of the system dynamics to predict the system’s future behavior and optimize a given performance index, possibly subject to input and state constraints [5, 6, 7]. The success of MPC is due to its ability to enforce constraints and yield optimal trajectories. While a plethora of efficient algorithms for the online solution of MPC problems has been developed, the main drawback of this control technique is the need of identifying the openloop model offline, which is typically the most timeconsuming phase of control design.
The advantages of MPC and RL can be combined together by framing MPC as a function approximator within an RL context. The tuning parameters of the MPC problem (e.g., cost weighting matrices, model parameters, etc.) can be framed as parameters of a nonlinear function approximator, namely the MPC optimization problem. This setup can be adopted to approximate the feedback policy, the value function, or the actionvalue function.
The main advantages of combining MPC with RL are (a) the ease of introducing a constraintenforcing policy within RL, (b) the possibility of improving existing models and controllers by using them as an initial guess in RL, and (c) the possibility of interpreting the learned algorithm in a modelbased framework.
The combination of learning and control techniques has been proposed in, e.g., [8, 9, 10, 11]. Some attempts at combining RL and the linear quadratic regulator have been presented in [12, 13]. To the best of our knowledge, however, [14] is the first work proposing to use NMPC as a function approximator in RL.
In this paper, building on the ideas from [14], we further analyse the combination of RL and MPC. The main contribution of this paper is the development of an improved algorithmic framework tailored to the problem formulation which overcomes some shortcomings of basic RL algorithms. We demonstrate the potential of our approach in simulations on an involved nonlinear example from the process industry.
This paper is structured as follows. Section II introduces learning. The main contributions of this paper are presented in Section III, which discusses the use of MPC as a function approximator in RL, and Section IV, which presents an algorithm adaptation, tailored to the problem. Some numerical examples are given in Section V. The paper is concluded by Section VI which also outlines future research directions.
Ii Reinforcement Learning
Consider a Markov Decision Process (MDP) with state transition dynamics , where and denote the states and actions (or controls) respectively. We also introduce the stage cost and the discount factor . The actionvalue function and value function associated with the optimal policy are defined by the Bellman equations:
(1a)  
(1b) 
learning parametrizes the action value function as , where is a vector of parameters whose values have to be learned, and aims at minimizing . Given a stateaction pair and the next state , standard algorithms [1] update the parameter using
(2a)  
(2b) 
where is known as the TemporalDifference (TD) error. In batch policy updates is kept constant for steps after which it is updated as . In instantaneous policy updates such that is updated at every time instant.
It is important to underline that, while in this paper we focus on learning (which has been successfully deployed on some applications, e.g., [15, 16, 17]), our approach can be readily deployed on other TDalgorithms such as SARSA [18]. Even in approaches directly optimizing the policy, learning the actionvalue function is often necessary [19, 20], such that the developments of this paper apply to a fairly broad class of RL algorithms.
While a very common choice is to use Deep Neural Networks (DNN) as function approximators, it is hard to analyze closedloop stability in DNNbased RL. Moreover, in case a controller is already available, information on how to control the system cannot be easily incorporated in the problem formulation. For this reason, we propose to use MPC as a function approximator, since it makes it possible to enforce closedloop stability guarantees and an existing controller can be used as initial guess for the algorithm.
Iii MPCBased RL
The use of MPC to parametrize the actionvalue function has been first advocated in [14]. In the following, we first present the function approximation used and recall the most important result in Theorem 1, then we further analyze the properties of MPCbased RL and in the next section we will propose a new variant of the learning algorithm tailored to MPCbased RL.
Iiia Parametrization of the Function Approximations
We parametrize the actionvalue function using an MPC scheme of the form
(3a)  
(3b)  
(3c)  
(3d)  
(3e) 
where . Consequently, we obtain
Note that the policy and value function are equivalently obtained by solving Problem (3) with constraint removed. In order to address feasibility issues in Problem (3), in the presence of statedependent constraints we adopt an exact relaxation of such constraints [21].
Remark 1
MPC formulations in which the stage cost is not positivedefinite are commonly referred to as Economic MPC. In order to enforce closedloop stability and the existence of the MPC solution, we assume that and are positive definite functions. Therefore, in order to deal with the situation in which the true stage cost is not positivedefinite, we have introduced the arrival penalty . For all details on this topic, we refer to [14] and references therein.
Remark 2
The proposed setup readily accommodates for formulations in which the cost penalizes deviations from a given reference. In that case, the stage cost (both in RL and MPC) depends on a reference, passed to the problem as an exogenous signal. Inputoutput model formulations also readily fit in the proposed framework.
Among the desirable properties of the proposed formulation, we mention nominal stability guarantees, and the possibility to introduce constraints accounting for, e.g., actuator limitations, safe operation of the system, etc. While being aware that the current framework is not able to fully exploit these advantages, with the presented developments we aim at constructing a sound basis which will be used in future research with the intent of developing selftuning, safe, and stable economic MPC controllers.
IiiB Learning the Model: RL and System Identification
We recall the following fundamental theorem from [14], which states that the optimal value and actionvalue functions as well as the optimal policy can be learned even by using MPC based on a state transition model which is different from the true model . This also entails that there is no guarantee (and no need) that the RL algorithm will learn a physically meaningful model.
Theorem 1 ([14])
Consider a given (possibly stochastic) state transition model , possibly different from the true model . Define the optimal value function
associated with stage cost , and the terminal cost over an optimization horizon . Define as the (possibly stochastic) trajectories of the state transition model under a policy , starting from , and the optimal policy associated with and the associated actionvalue function. Consider the set such that
Then, such that the following identities hold on :



for the inputs such that .
We can now further clarify the theorem and formalize a form of “orthogonality” between reinforcement learning and system identification.
Corollary 2
Let us split the parameter as , such that the model only depends on and assume a perfect parametrization, such that, under adequate exploration of the stateaction space, RL learns and perfectly with , i.e.,
Then, leaving the task of identifying to a separate identification algorithm is not detrimental to the learning of the optimal value, actionvalue function, and policy.
Unfortunately, Corollary 2 applies only to the case of perfect parametrization. In practice, function approximation with imperfect parametrization and lack of excitation can destroy this form of orthogonality. Ongoing research is currently further investigating such aspects.
IiiC On the Parametrization of and
Since the Bellman principle of optimality states that , one could be tempted to replace by in the computation of the TD error. We discuss in the following lemma why this approach can minimize the TD error while not learning the correct actionvalue function and, consequently, not delivering the optimal policy.
Lemma 3
Consider computing the TD error (2a) by replacing by to obtain
(4) 
Then, it is possible to obtain without minimising the error and, therefore, without learning the optimal policy.
We prove the Lemma by a simple counter example. Consider the LQR case, with and
Then, (4) reads as
Assume now that the correct model is identified, i.e., , . Then, the TD error is independent of , which can be chosen as desired. This implies that, even though we use a perfect parametrization of the actionvalue function, . Consequently, the learned feedback given by
(5) 
is not uniquely defined.
For comparison, we consider now the standard case, in which the TD error is computed using
where is given by (5). In this case, if , , must solve the algebraic Riccati equation associated with the given stage cost and model. It is important to stress, however, that infinitely many solutions exist in general such that the true model will not be identified. As an example we provide : it is immediate to verify that .
IiiD Condensed ModelFree Parametrization
Motivated by the previous considerations on the possibility of learning a wrong model, we propose next a formulation which is truly modelfree as it does not include model parameters as coefficients to be learned. For simplicity we only focus here on the case of linear dynamics with a quadratic cost. We define ; then
(6a)  
(6b)  
(6c) 
where and is a symmetric (typically dense) matrix. Problem formulation (6) is sometimes used in MPC and optimal control, in which case the parameters are related to the system dynamics and cost [22]. In this case, the dependence of and on the parameters is less nonlinear than in the modelbased parametrization, see (3). However, the introduction of a prediction horizon possibly introduces more parameters than the standard formulation. A special case occurs when there are only input constraints and they are known: in this case can be fixed and only , need to be learned. The LQR case simplifies to
(7) 
such that and
Iv Algorithm
The main practical difficulties related to using MPC as a function approximator in RL are that: (a) the function approximator is nonlinear in the parameters , and (b) the MPC problem is guaranteed to have a meaningful solution only if the cost is positivedefinite. In this section we propose an algorithm that addresses both issues.
We start by recalling that the main motivation for the stochastic gradient approach typically used in learning stems from an equivalence with the tablelookup case, i.e., when the stateaction space is discrete and the actionvalue function is parametrized as
and if and otherwise. In this case, is linear in the parameter . Parameter in (2b) is used in order to approximate the expected value of the TD error: one can roughly interpret it as the inverse of the amount of samples over which the average is computed.
In this case, the update (2b) obtained with the (exact) function approximation matches that of the enumeration
The update (2b) can also be written as with
We remark that this is not the case if is not normalized or if is a nonlinear function of the parameter . This also implies that parameter loses its original meaning, as it is also used to dampen the stochastic gradient step in order to enforce convergence.
We propose to apply the update where is the solution of
(8a)  
(8b) 
Note that we solve Problem (8) to full convergence at each step. In case of a linear parametrization of , convergence is obtained in one step by using a GaussNewton Hessian approximation, in case Constraints (8b) are inactive. This also directly solves the issue of scaling, present in the tablelookup case if is not normalized. Finally, if instead of solving Problem (8) to full convergence constraints (8b) are neglected and one takes a single full Newton step, the update (2b) is recovered.
We highlight next the main features of Formulation (8):

Globalization: globalization strategies such as line search ensure descent and, therefore, convergence. Descent is typically not enforced in stochastic gradient approaches such as (2), which could take a step which yields a larger TD error than with the previous parameter estimate.

Positivedefiniteness enforcement: Constraints (8b) guarantee that the cost is positivedefinite and, therefore, let the MPC problem be stabilizing and wellposed. Without this constraint, even with a wellposed, positivedefinite cost, throughout the RL iterates one might obtain an indefinite cost yielding an unbounded solution. We have written the positivedefiniteness Constraints (8b) using the Hessian of and , which is reasonable if both functions are quadratic. However, other approaches relying on richer function approximations are possible, e.g. using sumofsquares techniques.

Best fit: by solving Problem (8) to full convergence, the step is the one which minimizes , analogously to the lookuptable case. Therefore, the choice of parameter can be done purely by considerations about the expected value approximation.
In summary, the main differences with the standard update (2b) are: (a) positivedefiniteness enforcement, (b) insensitivity to parameter scaling and (c) guarantee of improvement at each step through globalization and full convergence.
Iva Derivative Computation
We detail next how to compute the derivatives of the actionvalue function with respect to the parameters. To this end, we define the Lagrangian function underlying NMPC problem (3) as
where are the multipliers associated with constraints (3b)(3e) and . Note that, for , is the Lagrangian function associated to the NMPC problem defining the value function . We observe that [23]
(10) 
holds for given by the primaldual solution of (3). Note that this equality holds because constraints (3b) are not an explicit function of . The gradient (10) is therefore straightforward to build as a byproduct of solving the NMPC problem (3). We additionally observe that
(11) 
where is given by the primaldual solution to (3) with constraint removed and .
We remark that secondorder derivatives can also be computed, though in general they depend on the derivative of the optimal primaldual solution with respect to the parameters, given by
(12) 
where gathers the primaldual KKT conditions underlying the NMPC scheme (3). For a complete discussion on parametric sensitivity analysis of NLPs we refer to [23] and references therein.
V Numerical Example
We consider an example from the process industry, i.e. the evaporation process modelled in [24, 25] and used in [26, 27] to demonstrate the potential of economic MPC in the nominal case. For the sake of brevity we omit the model equations and nonquadratic economic cost function, which include states (concentration and pressure) and controls (pressure and flow). All details can be found in [26, 27]. The model further depends on concentration , flow , and temperatures , which are assumed to be constant in the control model. In reality, these quantities are stochastic with variance , , , , and mean centred on the nominal value. Bounds on the states and on the controls are present. In particular, the bound is introduced in order to ensure sufficient quality of the product. All state bounds are relaxed as in (3e).
We parametrize an NMPC controller as in (3), i.e., a nonlinear noncondensed MPC formulation, with , quadratic functions defined by Hessian , gradient , and constant , . The model is parametrized as the nominal model with the addition of a constant, i.e., . The control constraints are fixed and the state constraints are parametrized as simple bounds, i.e., . The vector of parameter therefore reads as:
Constants , are fixed and assumed to reflect the known cost of violating the state constraints.
We use the batch policy update with and update the parameters with the learned ones every time steps. In order to induce enough exploration, we use an greedy policy which is greedy of the samples, while in the remaining we apply the action
where saturates the input between its lower and upper bounds , respectively.
We initialize the ENMPC scheme by the naive initial guess , , , while all other parameters are . While at every step we do check that , during the learning phase the parameters never violate this constraint. As displayed in Figure 1, the algorithm converges to a constant parameter value while reducing the average TDerror. If the standard parameter update (2b) is applied, RL does not converge with the proposed . If a smaller value is used, the algorithm does not diverge but the parameters are updated very slowly, making the approach impractical.
We performed a simulation to compare the RLtuned NMPC scheme to the naive initial guess and the NMPC tuned using the economicbased approach proposed in [27], which relies on the nominal model. The economic gain obtained by RL is approximately and respectively. The effectiveness of the economicallytuned NMPC had been demonstrated in [27] in the absence of stochastic perturbations. In the considered scenario we observe that, while the economicallytuned scheme still performs better than the naivelytuned one, the RL tuning is able to significantly outperform the two other NMPC schemes by explicitly accounting for the stochastic perturbations.
The concentration and the difference in instantaneous cost between the RLtuned and the naivelytuned scheme are displayed in Figure 2. In particular, one can see that RL is trying to stabilize the concentration to a value which is higher than the nominally optimal one: the optimum in the presence of perturbations is obtained as a compromise between the loss due to operating at and the cost of violating the constraint . Indeed, the constraint is violated but only rarely and by small amounts.
Vi Conclusions and Outlook
In this paper we analyzed the use of reinforcement learning to tune MPC schemes, aiming at a selftuning controller which guarantees stability, safety (i.e., constraint satisfaction), and optimality. In order to be able to apply this approach in practice, we have proposed an improved algorithm based on learning and we have tested it in simulations.
Future work will consider several research directions including: (a) further improvements in the algorithmic framework with the aim of developing dataefficient algorithms; (b) developing new algorithms for other RL paradigms, such as policy gradient methods; (c) further investigating the combination of system identification and RL in order to best update the MPC parameters while guaranteeing safety.
References
 R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Second Edition. MIT press Cambridge, 2018. [Online]. Available: http://incompleteideas.net/book/thebook2nd.html
 D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, pp. 484–503, 2016.
 P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application of reinforcement learning to aerobatic helicopter flight,” in In Advances in Neural Information Processing Systems 19. MIT Press, 2007, p. 2007.
 S. Wang, W. Chaovalitwongse, and R. Babuska, “Machine learning algorithms in bipedal robot control,” IEEE Transactions on Systems, Man, and Cybernetics Part C, vol. 42, no. 5, pp. 728–743, Sep. 2012.
 J. Rawlings and D. Mayne, Model Predictive Control: Theory and Design. Nob Hill, 2009.
 L. Grüne and J. Pannek, Nonlinear Model Predictive Control. London: Springer, 2011.
 F. Borrelli, A. Bemporad, and M. Morari, Predictive control for linear and hybrid systems. Cambridge University Press, 2017.
 T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learningbased Model Predictive Control for Safe Exploration and Reinforcement Learning,” 2018, published on Arxiv.
 A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin, “Provably safe and robust learningbased model predictive control,” Automatica, vol. 49, no. 5, pp. 1216 – 1226, 2013.
 C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot, “Robust Constrained Learningbased NMPC enabling reliable mobile robot path tracking,” The International Journal of Robotics Research, vol. 35, no. 13, pp. 1547–1563, 2016.
 F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe Modelbased Reinforcement Learning with Stability Guarantees,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 908–918.
 F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits and Systems Magazine, vol. 9, no. 3, pp. 32–50, 2009.
 F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Systems, vol. 32, no. 6, pp. 76–105, 2012.
 S. Gros and M. Zanon, “DataDriven Economic NMPC using Reinforcement Learning,” IEEE Transactions on Automatic Control, 2018, (under revision). [Online]. Available: https://mariozanon.wordpress.com/rlfornmpc
 C. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, King’s College, Cambridge, 1989.
 V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 G. Theocharous, P. S. Thomas, and M. Ghavamzadeh, “Personalized Ad Recommendation Systems for LifeTime Value Optimization with Guarantees,” in IJCAI, 2015, pp. 1806–1812.
 J. P. Jens Kober, J. Andrew Bagnell, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, 2013.
 R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Proceedings of the 12th International Conference on Neural Information Processing Systems, ser. NIPS’99. Cambridge, MA, USA: MIT Press, 1999, pp. 1057–1063.
 D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on International Conference on Machine Learning  Volume 32, ser. ICML’14, 2014, pp. I–387–I–395.
 P. Scokaert and J. Rawlings, “Feasibility Issues in Linear Model Predictive Control,” AIChE Journal, vol. 45, no. 8, pp. 1649–1659, 1999.
 H. Bock and K. Plitt, “A multiple shooting algorithm for direct solution of optimal control problems,” in Proceedings 9th IFAC World Congress Budapest. Pergamon Press, 1984, pp. 242–247.
 C. Büskens and H. Maurer, Online Optimization of Large Scale Systems. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001, ch. Sensitivity Analysis and RealTime Optimization of Parametric Nonlinear Programming Problems, pp. 3–16.
 F. Y. Wang and I. T. Cameron, “Control studies on a model evaporation process â constrained state driving with conventional and higher relative degree systems,” Journal of Process Control, vol. 4, pp. 59–75, 1994.
 C. Sonntag, O. Stursberg, and S. Engell, “Dynamic Optimization of an Industrial Evaporator using Graph Search with Embedded Nonlinear Programming,” in Proc. 2nd IFAC Conf. on Analysis and Design of Hybrid Systems (ADHS), 2006, pp. 211–216.
 R. Amrit, J. B. Rawlings, and L. T. Biegler, “Optimizing process economics online using model predictive control,” Computers & Chemical Engineering, vol. 58, pp. 334 – 343, 2013.
 M. Zanon, S. Gros, and M. Diehl, “A Tracking MPC Formulation that is Locally Equivalent to Economic MPC,” Journal of Process Control, 2016.