Deep Value Model Predictive Control
Abstract
In this paper, we introduce an actorcritic algorithm called Deep Value Model Predictive Control (DMPC), which combines modelbased trajectory optimization with value function estimation. The DMPC actor is a Model Predictive Control (MPC) optimizer with an objective function defined in terms of a value function estimated by the critic. We show that our MPC actor is an importance sampler, which minimizes an upper bound of the crossentropy to the state distribution of the optimal sampling policy. In our experiments with a Ballbot system, we show that our algorithm can work with sparse and binary reward signals to efficiently solve obstacle avoidance and target reaching tasks. Compared to previous work, we show that including the value function in the running cost of the trajectory optimizer speeds up the convergence. We also discuss the necessary strategies to robustify the algorithm in practice. ^{†}^{†}footnotetext: Both authors contributed equally to this work (alphabetical ordering)
Reinforcement Learning, Value Function Learning, Trajectory Optimization, Model Predictive Control
1 Introduction
Learning in environments with sparse reward/cost functions remains a challenging problem for Reinforcement Learning (RL). As the exploration strategy employed plays a vital role in such scenarios, the agent has to find and leverage small sets of informative samples maximally. Often, an agent can be provided with prior knowledge about the environment in the form of an incomplete model, such as the agent’s dynamics. The sparsity of the reward and potential nondifferentiability rule out the possibility of using Trajectory Optimization (TO). Furthermore, the sparsity of the cost function can be problematic even for RL methods that do not use structured and directed exploration policies, e.g., greedy techniques. Thus, the goal of this work is to combine modelbased and samplebased approaches in order to exploit the knowledge of the system dynamics while effectively exploring the environment.
Model Predictive Control (MPC) as a TO technique has proven to be a powerful tool in many robotic tasks [1, 2, 3]. While this modelbased approach truncates the time horizon of the task, it continually shifts the shortened horizon forward and optimizes the stateinput trajectory based on new state measurements. The main disadvantage of the MPC approach is its relatively high computational cost. As a consequence, the optimization time horizon is often kept short (e.g., in the order of seconds), which in turn prevents MPC from finding temporally global solutions. Furthermore, MPC heavily relies on the differentiability of the formulation and has a hard time dealing with sparse and noncontinuous reward/cost signals.
On the other hand, RL solves the same problem by exploring and collecting information about the environment and making decisions based on samples. The RL agent has to learn about its environment, including its dynamics from scratch via trial and error. Nevertheless, deep reinforcement learning has displayed remarkable performance in longhorizon tasks with sparse rewards [4], even in the continuous domain [5, 6]. The main drawbacks of RL are that it still requires enormous amounts of data and suffers from the explorationexploitation dilemma [7].
In this work, we derive an actorcritic approach where the critic is a value function learner and the actor an MPC strategy. Leveraging the generality of value functions, we propose to extend past work such as [8], resulting in an algorithm called Deep Value Model Predictive Control (DMPC). The DMPC algorithm uses an MPC policy to interact with the environment and collect informative samples to update its approximation of the value function. The running cost and the heuristic function (also known as the terminal cost) of MPC are defined in terms of the value function estimated by the critic.
We also provide an indepth analysis of the bilateral effect of this actorcritic scheme. We show that the MPC actor is an importance sampler that minimizes an upper bound of the crossentropy to the state trajectory distribution of the optimal sampling policy. Using the value function to define the MPC cost enables us to transform an initially stochastic task into a deterministic optimal control problem. We further empirically validate that defining a running cost in addition to the heuristic function accelerates the convergence of the value function, which makes the DMPC algorithm suitable for tasks with sparse reward function.
2 Problem Formulation
In this section, we provide more details on the control problem we solve. We consider the problem of sequential decision making in which an agent interacts with an environment to minimize the cumulative cost from the current time and onwards. We formulate this problem as a discounted Markov Decision Processes (MDP) [7] with a stochastic termination time.
The agent interacts with an environment with state and performs actions according to a time and statedependent control policy . For the sake of brevity, the dependency of and on is dropped when clear from the context. We consider stochastic dynamical systems where the state evolves according to
(1)  
(2) 
where is a Brownian motion with and is the initial state at time .
The agent selects actions according to a policy to minimize the discounted expected return for a set of initial states . The discounted expected return is defined as
(3)  
(4) 
where is the termination time of the problem, the discount factor, the running cost, the termination cost, a positive definite matrix regularizing the control inputs. is a stochastic variable called path cost. is computed by averaging over path costs generated from rollouts of the stochastic process in Equation (1) given a policy .
3 Preliminaries
Path Integral Optimal Control
The optimal policy and the value function it induces can be computed according to
(5)  
(6) 
The optimal value function, , satisfies the stochastic HamiltonJacobiBellman (HJB) equation. The derivation of the HJB equation is based on the principles of dynamic programming. This formulation is quite general, but unfortunately, computing an analytical solution is only possible for some special cases such as LQ regulators. The original work of Kappen [9, 10] studies one of these cases in which the nonlinear HJB equation can be transformed into a linear equation by enforcing a constraint on the input regularization matrix and the covariance of the noise: where . As a result of this linearity, the backward computation of the HJB equation can be replaced by a forward diffusion process that can be computed by stochastic integration. Therefore, the stochastic optimal control solution can be estimated with a Monte Carlo sampling method, resulting in the path integral control formulation
(7)  
(8)  
(9) 
where is a state trajectory in the time interval , its corresponding probability distribution under policy , and is called the desirability function for the optimal policy.
Equations (7) and (8) give an explicit expression for the optimal value function and the optimal distribution of state trajectories in terms of the expectation of the cumulative cost over trajectories. The quality of this estimation depends on how well we can estimate the desirability function. In general, for problems with sparse cost functions, using the path integral approach is challenging and requires the use of more advanced approaches to extract samples.
Cross Entropy
Efficient estimation of the solution to the path integral control problem critically depends on the quality of the collected samples. As shown by [11], the best sampling strategy for estimating the optimal solution is, ironically, the optimal policy itself^{*}^{*}*Thijssen and Kappen [11] show that the variance of the desirability function estimation approaches to zero when the optimal policy is used as the sampling distribution (). Based on this observation, openloop approaches such as [12, 13] use the latest estimate of the optimal policy to generate trajectory samples. However, they do not provide a systematic approach to control the variance of the sampling policy, which could lead to an inefficient sampling method, in particular, if the underlying system is unstable.
Using a policy that also controls the variance of the sampled state trajectories is required to have a statedependent feedback [10]. However, finding such a feedback policy is equivalent to estimating the optimal sampling policy over the entire state space for each time. The challenge of this approach lies in the design and the update scheme of such a policy.
A common approach to tackle this issue is the crossentropy method, which is an adaptive importance sampling scheme to estimate the probability of rare events. In this approach, the optimal sampler is estimated by a sequence of more tractable distributions, which are iteratively improved based on the collected samples. For that matter, the crossentropy between the state probability distributions of the optimal () and the current sampler () is minimized, where the crossentropy is defined as
It follows that minimizing the crossentropy with respect to , is equivalent to minimizing the KL divergence between and . This method has been proven to be useful for pathintegral based problems. For example, Kappen and Ruiz [10] have employed a parameterized distribution to estimate the optimal sampler. Similarly, we here use a crossentropy approach to estimate the optimal sampler. However, instead of a parameterized distribution family, we use an MPC strategy to estimate the optimal sampler implicitly.
Model Predictive Control
The MPC strategy replaces the infinitehorizon optimization problem by a sequence of finitehorizon optimal control problems with prediction horizon , which are numerically more tractable. At each control step, MPC solves the following optimal control problem from the current state and time. Then, only the first segment of the optimized sequence is used until the controller receives a new state and repeats the procedure.
(10)  
subject to  
Here, is the running cost, the heuristic function which accounts for the truncatedtime accumulated cost. We denote the state by to clearly distinguish between the actor’s computation and the state of the MDP (indicated by ). Note that the system dynamics are deterministic, and therefore the optimal control problem is formulated deterministically.
4 Deep Value Model Predictive Control
Our DMPC algorithm is an actorcritic approach where a value function is used to provide a measure of how well an MPC actor performs. We assume that a model of the robot dynamics is available to an agent, and this internal nominal model is accurate. In the following, we briefly discuss the structure of the critic and the actor.
DMPC Critic
The goal of the critic is to asses the performance of the actor by means of the value function. When the actor interacts with the environment (i.e. rolls out the policy), it collects transition tuples [7]. Using these samples, the critic is able to compute the empirical value along each trajectory and thus refine its estimate of the actual value function. The value function is represented by a function approximator parametrized by . Computing the one step Bellman target for a sample taken at time , i.e.,
(11) 
where is the timestep and a cost reflecting the performance on a given task, the critic can refine the value function estimate by solving
(12) 
Similar to [5], while the target depends on , we neglect this dependency during the optimization.
DMPC Actor
The DMPC actor is a trajectory optimizer as defined in equation (10), where the heuristic function and the running cost are defined as
(13)  
(14) 
with .
This MPC formulation replaces the original horizon problem by a sequence of finitehorizon optimal control subproblems with prediction horizon (). The dynamic programming principle ensures that if the subproblems are formulated with the exact value function as a heuristic function, then the MPC method solves the problem globally. However, in practice, the actor only has an approximation of the optimal value function. While the approximation error degrades the performance of greedy policies, the MPC benefits from the lookahead mechanism. It can be shown that if the horizon of the actor is long enough, the effect of the value function error is mitigated [8]. Therefore, MPC strategies using an approximate value function for the heuristic function, in general, outperform methods that are based on the instantaneous minimization of the value function estimate.
Another significant advantage of using such an approach is that we can compute a temporally global optimal sequence of actions using the temporally local solution of the MPC. This further allows us to tune the MPC time horizon based on the available computational resource and use the value function to account for the truncated accumulated cost.
ActorCritic Interaction
The procedure for the DMPC algorithm directly follows: From an estimate of given by the critic, the MPC actor computes the policy by solving Equation (10). The critic improves the estimate based on the collected samples using Equation (12). This results in an offpolicy actorcritic method since instead of assessing the value of the current policy, the actor directly estimates the optimal value function. The main steps of DMPC are outlined in Algorithm 1.
The MPC policy is an importance sampler for the value function learner. To motivate this, consider a problem with a sparse cost. Since the MPC actor predicts the state evolution, it can foresee less costly areas of the state space and propagate this information back to the current time. As a result, it can coordinate the action sequence to steer the agent towards future rewarding regions. In the next section, we provide more formal insights and show that MPC minimizes an upper bound of the crossentropy between the state trajectory distribution of the optimal policy and the current policy.
By repeating the actorcritic interactions, the estimation of the value function gets closer to the optimal one, which, in turn, improves the MPC performance by enhancing the estimate of the truncated cumulative cost.
5 DMPC Properties
In this section, we analyze the properties of the DMPC algorithm. We start with a setup where the MPC actor uses the learned value function only for the heuristic function. There, we study the effect of this actorcritic setup on the convergence of the value function learning. Next, we propose a setting where a running cost is defined based on the value function. We discuss that such a cost extends the application of the actorcritic method to problems with a sparse and nondifferentiable cost. Moreover, it allows us to use a deterministic MPC solver instead of a stochastic one.
5.1 Role of the Heuristic Function
As a first step, we focus on the case where we only use the learned value function as the heuristic function of the MPC actor. This setup is similar to the approach proposed by [8]. We here build upon their result and provide a more indepth analysis of the impact of the MPC actor on the acceleration of the value function convergence.
As discussed, the convergence of the value function critically depends on the sampling distribution and, the optimal sampler for the stochastic optimal control problem defined in (1)(4) is the optimal control policy [11]. However, we cannot compute the optimal distribution during the learning process, and instead, we wish to find the nearoptimal control policy such that is close to . As discussed in the introduction of the crossentropy method, we wish to minimize
(15) 
Theorem 1.
Assuming that from a state the policy remains in the vicinity of the optimal policy up to time , i.e., , an upper bound for the forward KL divergence between the state trajectory distributions of the optimal policy and policy is given by
(16) 
Proof.
The proof is provided in Appendix A.2. ∎
This shows that has an upper bound defined by the reverse KL divergence and the variance of the normalized desirability function using the policy for sampling. Next, we show that the MPC actor minimizes this reverse KL divergence and that its performance is bounded by the approximation error of the value function. Moreover, we show that the quality of the estimation improves as the MPC horizon increases.
Theorem 2.
Suppose that is an approximation of the optimal value function with infinity norm . The policy is the solution to the problem (10) with terminal cost . Then for all MPDs, the reverse KL divergence can be bounded as
(17) 
Proof.
The proof is provided in Appendix A.3. ∎
Note that if then . Therefore, as the horizon increases the MPC is less susceptible to the value function approximation error.
5.2 Role of the Running Cost
In this section, we motivate the choice of the DMPC running cost. However, we do not provide any formal proof. The goal is to see how we can extend the idea of using the value function in the heuristic function to the running cost. We show that, at least for the case where we have the optimal value function, this is indeed possible. Thus, for the following analysis, we assume that the exact optimal value function is provided. However, later in the result section, we will empirically show that for the scenarios with sparse costs, the running cost based on the approximate value function also accelerates the convergence of the critic.
Proposition 1.
The control policy that minimizes the reverse KL divergence of the problem defined in Equations (1)(4) is also the solution to the deterministic problem
(18) 
where , the running cost, and , the heuristic function are defined as
(19)  
(20) 
with is the optimal value function of the stochastic problem defined in (6) and the state trajectory evolves based on the following deterministic dynamics
(21) 
Proof.
The proof is provided in Appendix A.4. ∎
Proposition 1 has an important implication; the primary stochastic optimization problem is transformed into a deterministic one. Therefore, a deterministic MPC solver such as the one defined in Equation (10) can be used. This ultimately means that in order to find the optimal sampling policy, we only need to solve a deterministic MPC problem. This further allows us to employ more sophisticated tools from deterministic optimal control, e.g., Differential Dynamic Programming (DDP) [14].
6 Results
In this section, we describe the implementation details of the DMPC pipeline. We describe how the different components are designed and highlight the techniques used to make the interaction between the actor and the critic more robust. We then present the experiment results.
Critic Implementation
During the rollout of the policy, the critic stores the transition tuples in a replay buffer. Regularly, it samples minibatches of size to update the value function. The value function is represented by a multilayer perceptron (in our experiments, 3 layers with 12 units each and tanh activations). Following an approach similar to [15] and [16], we condition the value function on a goal and a time to reach the goal. It can be interpreted as the value of being in a particular state if a specific target has to be reached within a given amount of time. Since the derivatives of the network are computed on the actor side, particular attention has to be given to the architecture and the training procedure. The choice of a differentiable activation function, such as tanh, is necessary to guarantee the differentiability of the whole network. Moreover, we noticed that decaying the weights results in smoother behavior. Finally, the critic uses a target network [5] so that the actor only receives a Polyak averaged version of the value function.
Actor Implementation
6.1 Experimental Results
With the following experiments in simulation, we would like to confirm the theory and intuition derived above. More specifically, we would like to show that the algorithm is capable of handling a sparse binary reward setup, that using the running cost (19) yields better convergence, and highlight the importance of the MPC time horizon during learning.
Our experiments are based on the Ballbot robot (see Appendix B for more details). In order to answer the questions above, we design a task where the agent has to reach a target in 3 seconds from any initial position. Additionally, we add walls to the environment so that most of the time, it cannot reach the target via the shortest path, see Figure 1(a). While the target reaching cost is simply encoded as the euclidean distance from the current position to the goal, the walls are encoded by a termination of the episode with a fixed penalty. First, we analyze the system for different MPC time horizons. Then, we assess the performance when the running cost (19) is left out (i.e., we only use the value function as the heuristic function), which corresponds to the vanilla case [8]. Trajectories of the Ballbot’s center of mass for different starting positions are shown in Figure 1(a).


Influence of the Running Cost
As shown in Figure 1(b), the running cost plays a vital role in solving the task. While the actors devoid of the running cost solve the reaching task well (it is a smooth and straightforward cost after all), they often collide with the walls. When the horizon is too long for them, the optimization is prone to overlooking the wall. This is because the value is only used at the end of the trajectory. Also, the exploration provided by the value function is not sufficient enough, and the Ballbot often collides with the wall. When using the running cost, the task is solved successfully upon convergence. The impact of that term is shown in Figure 3. The information encoded in the value function is able to produce regions of low cost and guide the solution to avoid the obstacles along the path. When the robot starts on the left side of the field, a region of low cost on the bottom right side is created, encouraging the robot to move forward and avoid the obstacle. The opposite happens when the robot starts on the right. Moreover, at the beginning of the trajectories, regions of high cost at the top of the field encourage the robot to move towards the goal.
Influence of the Horizon on the Convergence of the Value Function
As described in the previous section, the time horizon of the MPC plays a vital role. Indeed, with a longer time horizon, the actor can look further into the future during the trajectory optimization step. It accelerates the convergence of the value function and improves exploration. As can be seen in Figure 1(b), during training, the actor with the longer time horizon outperforms the one with a shorter one. The actor with a longer time horizon is able to see beyond the wall faster in order to reach the target. The actor with the shorter horizon tends to focus more on the wall and needs more time to get past it. On the other hand, increasing the time horizon results in a higher computational cost for the trajectory optimizer so that a tradeoff has to be made.
7 Related Work
The use of learning in conjunction with planning has been studied extensively in the past. Prominently, in the discrete domain, learning evaluation functions has been employed with tree search methods resulting in systems capable of planning over long time horizons [4, 18]. In inverse reinforcement learning/optimal control [19, 20, 21], the planner is taught to match the behavior of an expert by inferring a cost function. These methods, however, are bounded by the performance of the expert, which is assumed to showcase optimal behavior. In our problem setting, the cost function is inferred indirectly via the value function that is learned from the environmentissued rewards/costs. The performance is only bounded by the capacity to learn the optimal value function.
The use of trajectory optimization with value function learning has been studied most recently in the Plan Online Learn Offline framework [8]. By using the value function as the terminal cost of their trajectory optimizer, they show improved performance of the policy beyond the optimizer’s time horizon. Here, we further extend their idea to handle stochastic systems explicitly and formulate the optimizer’s running cost such that it results in an importance sampler of the optimal value function.
When combining planning and learning, exploration plays a key role because it is difficult to cover the taskrelevant state space in highdimensional problems efficiently. To this end, methods such as path integral optimal control employ importance sample schemes [10]. In RL, methods such as Guided Policy Search [22] use a planner to direct the policy learning stage and sample more efficiently from high reward regions.
8 Conclusion and Future Work
In this paper, we presented an offpolicy actorcritic algorithm called DMPC that extends previous work on the combination of trajectory optimization and global value function learning. We first show that using a value function in the heuristic function leads to a temporally global optimal solution. Next, we show that using the running cost (19) results not only in an importance sampling scheme that improves the convergence of the value function estimation but is also capable of taking system uncertainties into account. In future work, we like to extend the value function to encode more information. For example, we could condition it on a local robotcentric map that would allow it to make decisions in dynamic environments to avoid obstacles.
This work was supported by NVIDIA, the Swiss National Science Foundation through the National Centre of Competence in Research Robotics (NCCR Robotics), the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 852044.
References
 Alexis et al. [2011] K. Alexis, C. Papachristos, G. Nikolakopoulos, and A. Tzes. Model predictive quadrotor indoor position control. In 2011 19th Mediterranean Conference on Control Automation (MED), pages 1247–1252, June 2011. doi: 10.1109/MED.2011.5983144.
 Farshidian et al. [2017] F. Farshidian, E. Jelavic, A. Satapathy, M. Giftthaler, and J. Buchli. Realtime motion planning of legged robots: A model predictive control approach. In Humanoids, pages 577–584, 2017. doi: 10.1109/HUMANOIDS.2017.8246930.
 Koenemann et al. [2015] J. Koenemann, A. D. Prete, Y. Tassa, E. Todorov, O. Stasse, M. Bennewitz, and N. Mansard. Wholebody modelpredictive control applied to the hrp2 humanoid. In IROS, pages 3346–3351, 2015.
 Silver et al. [2017] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. Nature, 550:354–, Oct. 2017. URL http://dx.doi.org/10.1038/nature24270.
 Lillicrap et al. [2016] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1509.02971.
 Hwangbo et al. [2019] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter. Learning agile and dynamic motor skills for legged robots. Science Robotics, 4(26), 2019. doi: 10.1126/scirobotics.aau5872. URL https://robotics.sciencemag.org/content/4/26/eaau5872.
 Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/thebook2nd.html.
 Lowrey et al. [2019] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch. Plan online, learn offline: Efficient learning and exploration via modelbased control. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Byey7n05FQ.
 Kappen [2007] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. In AIP conference proceedings, volume 887, pages 149–181. AIP, 2007.
 Kappen and Ruiz [2016] H. J. Kappen and H. C. Ruiz. Adaptive importance sampling for control and inference. Journal of Statistical Physics, 162(5):1244–1266, Mar 2016. ISSN 15729613. doi: 10.1007/s1095501614467.
 Thijssen and Kappen [2015] S. Thijssen and H. Kappen. Path integral control and statedependent feedback. Physical Review E, 91(3):032104, 2015.
 Theodorou et al. [2010] E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement learning. journal of machine learning research, 11(Nov):3137–3181, 2010.
 Williams et al. [2016] G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou. Aggressive driving with model predictive path integral control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1433–1440. IEEE, 2016.
 Mayne [1966] D. Mayne. A secondorder gradient method for determining optimal trajectories of nonlinear discretetime systems. International Journal of Control, 3(1):85–95, 1966.
 Pong* et al. [2018] V. Pong*, S. Gu*, M. Dalal, and S. Levine. Temporal difference models: Modelfree deep rl for modelbased control. In 6th International Conference on Learning Representations (ICLR), May 2018. URL https://openreview.net/forum?id=Skw0nW0Z¬eId=Skw0nW0Z. *equal contribution.
 Schaul et al. [2015] T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pages 1312–1320. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045258.
 Farshidian et al. [2017] F. Farshidian, M. Neunert, A. W. Winkler, G. Rey, and J. Buchli. An efficient optimal planning and control framework for quadrupedal locomotion. In ICRA, pages 93–100. IEEE, 2017.
 Guez et al. [2018] A. Guez, T. Weber, I. Antonoglou, K. Simonyan, O. Vinyals, D. Wierstra, R. Munos, and D. Silver. Learning to search with MCTSnets, 2018. URL http://arxiv.org/pdf/1802.04697v2.
 Ross et al. [2010] S. Ross, G. J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. Journal of Machine Learning Research  Proceedings Track, 15, 11 2010.
 Abbeel and Ng [2004] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twentyfirst International Conference on Machine Learning, ICML ’04, pages 1–, New York, NY, USA, 2004. ACM. ISBN 1581138385. doi: 10.1145/1015330.1015430. URL http://doi.acm.org/10.1145/1015330.1015430.
 Dvijotham and Todorov [2010] K. Dvijotham and E. Todorov. Inverse optimal control with linearlysolvable mdps. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 335–342, USA, 2010. Omnipress. ISBN 9781605589077. URL http://dl.acm.org/citation.cfm?id=3104322.3104366.
 Levine and Koltun [2013] S. Levine and V. Koltun. Guided policy search. In Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ICML’13, pages III–1–III–9. JMLR.org, 2013. URL http://dl.acm.org/citation.cfm?id=3042817.3042937.
 Minniti et al. [2019] M. V. Minniti, F. Farshidian, R. Grandia, and M. Hutter. Wholebody mpc for a dynamically stable mobile manipulator. ArXiv, abs/1902.10415, 2019.
Appendix A Proof of Theorems
a.1 Girsanov Theorem
For the stochastic process defined in Equation (1), it is possible to relate the state trajectory distribution of two policies and via the Girsanov theorem. Here, we briefly outline the steps of an informal derivation by discretizing the state trajectory with infinitesimal time . The distribution of conditioned on is a Gaussian distribution with mean and variance . As a result, for the state trajectory distribution under policy we have
After further simplifications,
Thus we have
(22) 
Using the Girsanov theorem, it can be readily shown that the KL divergence between and takes the following form
(23) 
a.2 Proof of Theorem 1
Proof.
Based on the Girsanov theorem and Equation (23), the KLdivergence between the optimal and the current state trajectory distribution starting from an initial time and an initial state , can be written as
Equation (8) defines the relationship in between the state trajectory probability distribution of the optimal policy, , and the sampling policy, as
where we have replaced the second term by . The second line naturally follows by using Equation (23) for . Now we further examine the term .
In the first line, we used the CauchySchwarz inequality. Then, in the second line we have used the definition of the desirability function in Equation (9). Finally, in the third line, we used the BhatiaDavis inequality which provides an upper bound on the variance of a bounded probability distribution. For the above we have
Thus, based on the BhatiaDavis inequality we can write
Thus, we will have the following upper bound on
∎
a.3 Proof of Theorem 2
Proof.
By using the relationship between the state trajectory probability distribution of the optimal policy, , and the sampling policy, , defined in Equation (8), we get
(24) 
where we used Equation (7) and then .
The rest of this proof follows similar to the result presented in [8]. The difference is that our formulation is continuoustime while the formulation in [8] is discretetime. For the sake of brevity, we will use during this proof.
(25) 
where and . We here truncate the horizon of optimization to the time horizon of MPC. To compensate for the truncated cost, we use the optimal value function as the termination cost.
When the actor uses an MPC strategy, it only has access to the approximated value function.
(26) 
We can then write the KL divergence as
Adding and subtracting
(27) 
Using our assumption ,
using these bounds in Equation (27), we get
(28) 
Since is generated to minimize , therefore
We then have
(29) 
Recursively applying the KL bound to , we get
∎
a.4 Proof of Proposition 1
First, we note that HJB equation for the problem defined in Equations (1)(4) with has the form
(30) 
with and . For sake of brevity, we define and .
The optimal control policy can be derived as
(31) 
Lemma 1.
Proof.
The proof easily follows using the definition of the HJB equation for the stochastic and deterministic optimal control problems. ∎
We finally provide the proof of Proposition 1.
Proof.
Start from the Girsanov theorem and Equation (23), we get
(35)  
(36)  
(37) 
In Equation (35) we have replaced using Equation (31); in Equation (36) we have used Itô’s Lemma for the process
(38) 
After reordering the terms in Equation (A.4), taking the time integral over the interval , and taking the expectation with respect to , we get
(39) 
where the term involving cancels out.
As a result, the policy which minimizes the reverse KLdivergence can be derived as
(40) 
This expectation is based on the probability distribution generated by the stochastic system in (1). According to Lemma 1, we can transform this optimization problem to an equivalent deterministic problem in Equation (18).∎
Appendix B Experimental Details
Platform
The Ballbot robot depicted in Figure 4 is a 3D inverted pendulum capable of balancing on a ball with the help of three actuators. The mathematical formulation of system dynamics can be found in [23]. Since it balances on a single ball, it has dynamic stability, is omnidirectional, and is capable of carrying out agile movements. Due to its inherent instability and highly nonlinear dynamics, this robot can also be used as a testing platform to validate general control algorithms, which may be applied to other types of mobile platforms.
Experiments
An advantage of using an MPC strategy is that a nominal controller can be used to stabilize the system. In these experiments, we use an additional cost term in the trajectory optimizer that stabilizes the system in an upright position. As a result, the system will stand from the beginning resulting in much faster convergence to the desired behaviour. This is not a limitation of the pipeline, since the value function could also be trained to encode the upright stabilization.
In Figure 1(b), we tuned the hyperparameters to achieve the best performance for each scenario. The performance at each learning iteration is computed by taking the average performance over 8 trajectories sampled from different starting positions.