Practical Reinforcement Learning of Stabilizing Economic MPC

Practical Reinforcement Learning of Stabilizing Economic MPC


Reinforcement Learning (RL) has demonstrated a huge potential in learning optimal policies without any prior knowledge of the process to be controlled. Model Predictive Control (MPC) is a popular control technique which is able to deal with nonlinear dynamics and state and input constraints. The main drawback of MPC is the need of identifying an accurate model, which in many cases cannot be easily obtained. Because of model inaccuracy, MPC can fail at delivering satisfactory closed-loop performance. Using RL to tune the MPC formulation or, conversely, using MPC as a function approximator in RL allows one to combine the advantages of the two techniques. This approach has important advantages, but it requires an adaptation of the existing algorithms. We therefore propose an improved RL algorithm for MPC and test it in simulations on a rather challenging example.

I Introduction

Reinforcement learning (RL) is a model-free control technique which recursively updates the controller parameters in order to achieve optimality. Once the controller parameters have been learned, the controller will implement the control action which yields an infinite-horizon optimal cost for each initial condition [1]. RL has drawn increasing attention thanks to the striking results obtained in beating chess and go masters [2], and in learning how to make a robot walk or fly without supervision [3, 4].

In order to be able to solve the RL problem in practice, function approximation strategies must be employed. Such approximations typically rely on a set of basis functions, or features, and corresponding parameters that multiply them. In this case, the function approximator is linear in the parameters. However, nonlinear function approximators are commonly deployed using Deep Neural Networks (DNN).

Model Predictive Control (MPC) is a model-based technique which exploits a model of the system dynamics to predict the system’s future behavior and optimize a given performance index, possibly subject to input and state constraints [5, 6, 7]. The success of MPC is due to its ability to enforce constraints and yield optimal trajectories. While a plethora of efficient algorithms for the online solution of MPC problems has been developed, the main drawback of this control technique is the need of identifying the open-loop model offline, which is typically the most time-consuming phase of control design.

The advantages of MPC and RL can be combined together by framing MPC as a function approximator within an RL context. The tuning parameters of the MPC problem (e.g., cost weighting matrices, model parameters, etc.) can be framed as parameters of a nonlinear function approximator, namely the MPC optimization problem. This setup can be adopted to approximate the feedback policy, the value function, or the action-value function.

The main advantages of combining MPC with RL are (a) the ease of introducing a constraint-enforcing policy within RL, (b) the possibility of improving existing models and controllers by using them as an initial guess in RL, and (c) the possibility of interpreting the learned algorithm in a model-based framework.

The combination of learning and control techniques has been proposed in, e.g., [8, 9, 10, 11]. Some attempts at combining RL and the linear quadratic regulator have been presented in [12, 13]. To the best of our knowledge, however, [14] is the first work proposing to use NMPC as a function approximator in RL.

In this paper, building on the ideas from [14], we further analyse the combination of RL and MPC. The main contribution of this paper is the development of an improved algorithmic framework tailored to the problem formulation which overcomes some shortcomings of basic RL algorithms. We demonstrate the potential of our approach in simulations on an involved nonlinear example from the process industry.

This paper is structured as follows. Section II introduces -learning. The main contributions of this paper are presented in Section III, which discusses the use of MPC as a function approximator in RL, and Section IV, which presents an algorithm adaptation, tailored to the problem. Some numerical examples are given in Section V. The paper is concluded by Section VI which also outlines future research directions.

Ii Reinforcement Learning

Consider a Markov Decision Process (MDP) with state transition dynamics , where and denote the states and actions (or controls) respectively. We also introduce the stage cost and the discount factor . The action-value function and value function associated with the optimal policy are defined by the Bellman equations:


-learning parametrizes the action value function as , where is a vector of parameters whose values have to be learned, and aims at minimizing . Given a state-action pair and the next state , standard algorithms [1] update the parameter using


where is known as the Temporal-Difference (TD) error. In batch policy updates is kept constant for steps after which it is updated as . In instantaneous policy updates such that is updated at every time instant.

It is important to underline that, while in this paper we focus on -learning (which has been successfully deployed on some applications, e.g., [15, 16, 17]), our approach can be readily deployed on other TD-algorithms such as SARSA [18]. Even in approaches directly optimizing the policy, learning the action-value function is often necessary [19, 20], such that the developments of this paper apply to a fairly broad class of RL algorithms.

While a very common choice is to use Deep Neural Networks (DNN) as function approximators, it is hard to analyze closed-loop stability in DNN-based RL. Moreover, in case a controller is already available, information on how to control the system cannot be easily incorporated in the problem formulation. For this reason, we propose to use MPC as a function approximator, since it makes it possible to enforce closed-loop stability guarantees and an existing controller can be used as initial guess for the algorithm.

Iii MPC-Based RL

The use of MPC to parametrize the action-value function has been first advocated in [14]. In the following, we first present the function approximation used and recall the most important result in Theorem 1, then we further analyze the properties of MPC-based RL and in the next section we will propose a new variant of the -learning algorithm tailored to MPC-based RL.

Iii-a Parametrization of the Function Approximations

We parametrize the action-value function using an MPC scheme of the form


where . Consequently, we obtain

Note that the policy and value function are equivalently obtained by solving Problem (3) with constraint removed. In order to address feasibility issues in Problem (3), in the presence of state-dependent constraints we adopt an exact relaxation of such constraints [21].

Remark 1

MPC formulations in which the stage cost is not positive-definite are commonly referred to as Economic MPC. In order to enforce closed-loop stability and the existence of the MPC solution, we assume that and are positive definite functions. Therefore, in order to deal with the situation in which the true stage cost is not positive-definite, we have introduced the arrival penalty . For all details on this topic, we refer to [14] and references therein.

Remark 2

The proposed setup readily accommodates for formulations in which the cost penalizes deviations from a given reference. In that case, the stage cost (both in RL and MPC) depends on a reference, passed to the problem as an exogenous signal. Input-output model formulations also readily fit in the proposed framework.

Among the desirable properties of the proposed formulation, we mention nominal stability guarantees, and the possibility to introduce constraints accounting for, e.g., actuator limitations, safe operation of the system, etc. While being aware that the current framework is not able to fully exploit these advantages, with the presented developments we aim at constructing a sound basis which will be used in future research with the intent of developing self-tuning, safe, and stable economic MPC controllers.

Iii-B Learning the Model: RL and System Identification

We recall the following fundamental theorem from [14], which states that the optimal value and action-value functions as well as the optimal policy can be learned even by using MPC based on a state transition model which is different from the true model . This also entails that there is no guarantee (and no need) that the RL algorithm will learn a physically meaningful model.

Theorem 1 ([14])

Consider a given (possibly stochastic) state transition model , possibly different from the true model . Define the optimal value function

associated with stage cost , and the terminal cost over an optimization horizon . Define as the (possibly stochastic) trajectories of the state transition model under a policy , starting from , and the optimal policy associated with and the associated action-value function. Consider the set such that

Then, such that the following identities hold on :

  • for the inputs such that .

We can now further clarify the theorem and formalize a form of “orthogonality” between reinforcement learning and system identification.

Corollary 2

Let us split the parameter as , such that the model only depends on and assume a perfect parametrization, such that, under adequate exploration of the state-action space, RL learns and perfectly with , i.e.,

Then, leaving the task of identifying to a separate identification algorithm is not detrimental to the learning of the optimal value, action-value function, and policy.

Unfortunately, Corollary 2 applies only to the case of perfect parametrization. In practice, function approximation with imperfect parametrization and lack of excitation can destroy this form of orthogonality. Ongoing research is currently further investigating such aspects.

Iii-C On the Parametrization of and

Since the Bellman principle of optimality states that , one could be tempted to replace by in the computation of the TD error. We discuss in the following lemma why this approach can minimize the TD error while not learning the correct action-value function and, consequently, not delivering the optimal policy.

Lemma 3

Consider computing the TD error (2a) by replacing by to obtain


Then, it is possible to obtain without minimising the error and, therefore, without learning the optimal policy.


We prove the Lemma by a simple counter example. Consider the LQR case, with and

Then, (4) reads as

Assume now that the correct model is identified, i.e., , . Then, the TD error is independent of , which can be chosen as desired. This implies that, even though we use a perfect parametrization of the action-value function, . Consequently, the learned feedback given by


is not uniquely defined.

For comparison, we consider now the standard case, in which the TD error is computed using

where is given by (5). In this case, if , , must solve the algebraic Riccati equation associated with the given stage cost and model. It is important to stress, however, that infinitely many solutions exist in general such that the true model will not be identified. As an example we provide : it is immediate to verify that .

Iii-D Condensed Model-Free Parametrization

Motivated by the previous considerations on the possibility of learning a wrong model, we propose next a formulation which is truly model-free as it does not include model parameters as coefficients to be learned. For simplicity we only focus here on the case of linear dynamics with a quadratic cost. We define ; then


where and is a symmetric (typically dense) matrix. Problem formulation (6) is sometimes used in MPC and optimal control, in which case the parameters are related to the system dynamics and cost [22]. In this case, the dependence of and on the parameters is less nonlinear than in the model-based parametrization, see (3). However, the introduction of a prediction horizon possibly introduces more parameters than the standard formulation. A special case occurs when there are only input constraints and they are known: in this case can be fixed and only , need to be learned. The LQR case simplifies to


such that and

Iv Algorithm

The main practical difficulties related to using MPC as a function approximator in RL are that: (a) the function approximator is nonlinear in the parameters , and (b) the MPC problem is guaranteed to have a meaningful solution only if the cost is positive-definite. In this section we propose an algorithm that addresses both issues.

We start by recalling that the main motivation for the stochastic gradient approach typically used in -learning stems from an equivalence with the table-lookup case, i.e., when the state-action space is discrete and the action-value function is parametrized as

and if and otherwise. In this case, is linear in the parameter . Parameter in (2b) is used in order to approximate the expected value of the TD error: one can roughly interpret it as the inverse of the amount of samples over which the average is computed.

In this case, the update (2b) obtained with the (exact) function approximation matches that of the enumeration

The update (2b) can also be written as with

We remark that this is not the case if is not normalized or if is a nonlinear function of the parameter . This also implies that parameter loses its original meaning, as it is also used to dampen the stochastic gradient step in order to enforce convergence.

We propose to apply the update where is the solution of


Note that we solve Problem (8) to full convergence at each step. In case of a linear parametrization of , convergence is obtained in one step by using a Gauss-Newton Hessian approximation, in case Constraints (8b) are inactive. This also directly solves the issue of scaling, present in the table-lookup case if is not normalized. Finally, if instead of solving Problem (8) to full convergence constraints (8b) are neglected and one takes a single full Newton step, the update (2b) is recovered.

We highlight next the main features of Formulation (8):

  • Globalization: globalization strategies such as line search ensure descent and, therefore, convergence. Descent is typically not enforced in stochastic gradient approaches such as (2), which could take a step which yields a larger TD error than with the previous parameter estimate.

  • Positive-definiteness enforcement: Constraints (8b) guarantee that the cost is positive-definite and, therefore, let the MPC problem be stabilizing and well-posed. Without this constraint, even with a well-posed, positive-definite cost, throughout the RL iterates one might obtain an indefinite cost yielding an unbounded solution. We have written the positive-definiteness Constraints (8b) using the Hessian of and , which is reasonable if both functions are quadratic. However, other approaches relying on richer function approximations are possible, e.g. using sum-of-squares techniques.

  • Best fit: by solving Problem (8) to full convergence, the step is the one which minimizes , analogously to the lookup-table case. Therefore, the choice of parameter can be done purely by considerations about the expected value approximation.

In summary, the main differences with the standard update (2b) are: (a) positive-definiteness enforcement, (b) insensitivity to parameter scaling and (c) guarantee of improvement at each step through globalization and full convergence.

Iv-a Derivative Computation

We detail next how to compute the derivatives of the action-value function with respect to the parameters. To this end, we define the Lagrangian function underlying NMPC problem (3) as

where are the multipliers associated with constraints (3b)-(3e) and . Note that, for , is the Lagrangian function associated to the NMPC problem defining the value function . We observe that [23]


holds for given by the primal-dual solution of (3). Note that this equality holds because constraints (3b) are not an explicit function of . The gradient (10) is therefore straightforward to build as a by-product of solving the NMPC problem (3). We additionally observe that


where is given by the primal-dual solution to (3) with constraint removed and .

We remark that second-order derivatives can also be computed, though in general they depend on the derivative of the optimal primal-dual solution with respect to the parameters, given by


where gathers the primal-dual KKT conditions underlying the NMPC scheme (3). For a complete discussion on parametric sensitivity analysis of NLPs we refer to [23] and references therein.

V Numerical Example

We consider an example from the process industry, i.e. the evaporation process modelled in [24, 25] and used in [26, 27] to demonstrate the potential of economic MPC in the nominal case. For the sake of brevity we omit the model equations and non-quadratic economic cost function, which include states (concentration and pressure) and controls (pressure and flow). All details can be found in [26, 27]. The model further depends on concentration , flow , and temperatures , which are assumed to be constant in the control model. In reality, these quantities are stochastic with variance , , , , and mean centred on the nominal value. Bounds on the states and on the controls are present. In particular, the bound is introduced in order to ensure sufficient quality of the product. All state bounds are relaxed as in (3e).

We parametrize an NMPC controller as in (3), i.e., a nonlinear non-condensed MPC formulation, with , quadratic functions defined by Hessian , gradient , and constant , . The model is parametrized as the nominal model with the addition of a constant, i.e., . The control constraints are fixed and the state constraints are parametrized as simple bounds, i.e., . The vector of parameter therefore reads as:

Constants , are fixed and assumed to reflect the known cost of violating the state constraints.

We use the batch policy update with and update the parameters with the learned ones every time steps. In order to induce enough exploration, we use an -greedy policy which is greedy of the samples, while in the remaining we apply the action

where saturates the input between its lower and upper bounds , respectively.

Fig. 1: Evolution of the parameters (increment w.r.t. the initial guess value) and of the TD error (averaged over the preceding samples).
Fig. 2: NMPC closed-loop simulations. Top graph: concentration for NMPC with RL tuning. Middle graph: naive tuning (red) and nominal economic tuning (blue). In both graphs, the quality constraint is displayed in thick dashed black line. Bottom graph: difference of instantaneous cost between NMPC with RL tuning and naive tuning.

We initialize the ENMPC scheme by the naive initial guess , , , while all other parameters are . While at every step we do check that , during the learning phase the parameters never violate this constraint. As displayed in Figure 1, the algorithm converges to a constant parameter value while reducing the average TD-error. If the standard parameter update (2b) is applied, RL does not converge with the proposed . If a smaller value is used, the algorithm does not diverge but the parameters are updated very slowly, making the approach impractical.

We performed a simulation to compare the RL-tuned NMPC scheme to the naive initial guess and the NMPC tuned using the economic-based approach proposed in [27], which relies on the nominal model. The economic gain obtained by RL is approximately and respectively. The effectiveness of the economically-tuned NMPC had been demonstrated in [27] in the absence of stochastic perturbations. In the considered scenario we observe that, while the economically-tuned scheme still performs better than the naively-tuned one, the RL tuning is able to significantly outperform the two other NMPC schemes by explicitly accounting for the stochastic perturbations.

The concentration and the difference in instantaneous cost between the RL-tuned and the naively-tuned scheme are displayed in Figure 2. In particular, one can see that RL is trying to stabilize the concentration to a value which is higher than the nominally optimal one: the optimum in the presence of perturbations is obtained as a compromise between the loss due to operating at and the cost of violating the constraint . Indeed, the constraint is violated but only rarely and by small amounts.

Vi Conclusions and Outlook

In this paper we analyzed the use of reinforcement learning to tune MPC schemes, aiming at a self-tuning controller which guarantees stability, safety (i.e., constraint satisfaction), and optimality. In order to be able to apply this approach in practice, we have proposed an improved algorithm based on -learning and we have tested it in simulations.

Future work will consider several research directions including: (a) further improvements in the algorithmic framework with the aim of developing data-efficient algorithms; (b) developing new algorithms for other RL paradigms, such as policy gradient methods; (c) further investigating the combination of system identification and RL in order to best update the MPC parameters while guaranteeing safety.


  1. R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Second Edition.   MIT press Cambridge, 2018. [Online]. Available:
  2. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, pp. 484–503, 2016.
  3. P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application of reinforcement learning to aerobatic helicopter flight,” in In Advances in Neural Information Processing Systems 19.   MIT Press, 2007, p. 2007.
  4. S. Wang, W. Chaovalitwongse, and R. Babuska, “Machine learning algorithms in bipedal robot control,” IEEE Transactions on Systems, Man, and Cybernetics Part C, vol. 42, no. 5, pp. 728–743, Sep. 2012.
  5. J. Rawlings and D. Mayne, Model Predictive Control: Theory and Design.   Nob Hill, 2009.
  6. L. Grüne and J. Pannek, Nonlinear Model Predictive Control.   London: Springer, 2011.
  7. F. Borrelli, A. Bemporad, and M. Morari, Predictive control for linear and hybrid systems.   Cambridge University Press, 2017.
  8. T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based Model Predictive Control for Safe Exploration and Reinforcement Learning,” 2018, published on Arxiv.
  9. A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin, “Provably safe and robust learning-based model predictive control,” Automatica, vol. 49, no. 5, pp. 1216 – 1226, 2013.
  10. C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot, “Robust Constrained Learning-based NMPC enabling reliable mobile robot path tracking,” The International Journal of Robotics Research, vol. 35, no. 13, pp. 1547–1563, 2016.
  11. F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe Model-based Reinforcement Learning with Stability Guarantees,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds.   Curran Associates, Inc., 2017, pp. 908–918.
  12. F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits and Systems Magazine, vol. 9, no. 3, pp. 32–50, 2009.
  13. F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Systems, vol. 32, no. 6, pp. 76–105, 2012.
  14. S. Gros and M. Zanon, “Data-Driven Economic NMPC using Reinforcement Learning,” IEEE Transactions on Automatic Control, 2018, (under revision). [Online]. Available:
  15. C. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, King’s College, Cambridge, 1989.
  16. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
  17. G. Theocharous, P. S. Thomas, and M. Ghavamzadeh, “Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees,” in IJCAI, 2015, pp. 1806–1812.
  18. J. P. Jens Kober, J. Andrew Bagnell, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, 2013.
  19. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Proceedings of the 12th International Conference on Neural Information Processing Systems, ser. NIPS’99.   Cambridge, MA, USA: MIT Press, 1999, pp. 1057–1063.
  20. D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ser. ICML’14, 2014, pp. I–387–I–395.
  21. P. Scokaert and J. Rawlings, “Feasibility Issues in Linear Model Predictive Control,” AIChE Journal, vol. 45, no. 8, pp. 1649–1659, 1999.
  22. H. Bock and K. Plitt, “A multiple shooting algorithm for direct solution of optimal control problems,” in Proceedings 9th IFAC World Congress Budapest.   Pergamon Press, 1984, pp. 242–247.
  23. C. Büskens and H. Maurer, Online Optimization of Large Scale Systems.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2001, ch. Sensitivity Analysis and Real-Time Optimization of Parametric Nonlinear Programming Problems, pp. 3–16.
  24. F. Y. Wang and I. T. Cameron, “Control studies on a model evaporation process — constrained state driving with conventional and higher relative degree systems,” Journal of Process Control, vol. 4, pp. 59–75, 1994.
  25. C. Sonntag, O. Stursberg, and S. Engell, “Dynamic Optimization of an Industrial Evaporator using Graph Search with Embedded Nonlinear Programming,” in Proc. 2nd IFAC Conf. on Analysis and Design of Hybrid Systems (ADHS), 2006, pp. 211–216.
  26. R. Amrit, J. B. Rawlings, and L. T. Biegler, “Optimizing process economics online using model predictive control,” Computers & Chemical Engineering, vol. 58, pp. 334 – 343, 2013.
  27. M. Zanon, S. Gros, and M. Diehl, “A Tracking MPC Formulation that is Locally Equivalent to Economic MPC,” Journal of Process Control, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description