# Guided Policy Search as Approximate Mirror Descent

###### Abstract

Guided policy search algorithms can be used to optimize complex nonlinear policies, such as deep neural networks, without directly computing policy gradients in the high-dimensional parameter space. Instead, these methods use supervised learning to train the policy to mimic a “teacher” algorithm, such as a trajectory optimizer or a trajectory-centric reinforcement learning method. Guided policy search methods provide asymptotic local convergence guarantees by construction, but it is not clear how much the policy improves within a small, finite number of iterations. We show that guided policy search algorithms can be interpreted as an approximate variant of mirror descent, where the projection onto the constraint manifold is not exact. We derive a new guided policy search algorithm that is simpler and provides appealing improvement and convergence guarantees in simplified convex and linear settings, and show that in the more general nonlinear setting, the error in the projection step can be bounded. We provide empirical results on several simulated robotic navigation and manipulation tasks that show that our method is stable and achieves similar or better performance when compared to prior guided policy search methods, with a simpler formulation and fewer hyperparameters.

Guided Policy Search as Approximate Mirror Descent

William Montgomery Dept. of Computer Science and Engineering University of Washington wmonty@cs.washington.edu Sergey Levine Dept. of Computer Science and Engineering University of Washington svlevine@cs.washington.edu

## 1 Introduction

Policy search algorithms based on supervised learning from a computational or human “teacher” have gained prominence in recent years due to their ability to optimize complex policies for autonomous flight [16], video game playing [15; 4], and bipedal locomotion [11]. Among these methods, guided policy search algorithms [6] are particularly appealing due to their ability to adapt the teacher to produce data that is best suited for training the final policy with supervised learning. Such algorithms have been used to train complex deep neural network policies for vision-based robotic manipulation [6], as well as a variety of other tasks [19; 11]. However, convergence results for these methods typically follow by construction from their formulation as a constrained optimization, where the teacher is gradually constrained to match the learned policy, and guarantees on the performance of the final policy only hold at convergence if the constraint is enforced exactly. This is problematic in practical applications, where such algorithms are typically executed for a small number of iterations.

In this paper, we show that guided policy search algorithms can be interpreted as approximate variants of mirror descent under constraints imposed by the policy parameterization, with supervised learning corresponding to a projection onto the constraint manifold. Based on this interpretation, we can derive a new, simplified variant of guided policy search, which corresponds exactly to mirror descent under linear dynamics and convex policy spaces. When these convexity and linearity assumptions do not hold, we can show that the projection step is approximate, up to a bound that depends on the step size of the algorithm, which suggests that for a small enough step size, we can achieve continuous improvement. The form of this bound provides us with intuition about how to adjust the step size in practice, so as to obtain a simple algorithm with a small number of hyperparameters.

The main contribution of this paper is a simple new guided policy search algorithm that can train complex, high-dimensional policies by alternating between trajectory-centric reinforcement learning and supervised learning, as well as a connection between guided policy search methods and mirror descent. We also extend previous work on bounding policy cost in terms of KL divergence [15; 17] to derive a bound on the cost of the policy at each iteration, which provides guidance on how to adjust the step size of the method. We provide empirical results on several simulated robotic navigation and manipulation tasks that show that our method is stable and achieves similar or better performance when compared to prior guided policy search methods, with a simpler formulation and fewer hyperparameters.

## 2 Guided Policy Search Algorithms

We first review guided policy search methods and background. Policy search algorithms aim to optimize a parameterized policy over actions conditioned on the state . Given stochastic dynamics and cost , the goal is to minimize the expected cost under the policy’s trajectory distribution, given by , where we overload notation to use to denote the marginals of , where denotes a trajectory. A standard reinforcement learning (RL) approach to policy search is to compute the gradient and use it to improve [18; 14]. The gradient is typically estimated using samples obtained from the real physical system being controlled, and recent work has shown that such methods can be applied to very complex, high-dimensional policies such as deep neural networks [17; 10]. However, for complex, high-dimensional policies, such methods tend to be inefficient, and practical real-world applications of such model-free policy search techniques are typically limited to policies with about one hundred parameters [3].

Instead of directly optimizing , guided policy search algorithms split the optimization into a “control phase” (which we’ll call the C-step) that finds multiple simple local policies that can solve the task from different initial states , and a “supervised phase” (S-step) that optimizes the global policy to match all of these local policies using standard supervised learning. In fact, a variational formulation of guided policy search [7] corresponds to the EM algorithm, where the C-step is actually the E-step, and the S-step is the M-step. The benefit of this approach is that the local policies can be optimized separately using domain-specific local methods. Trajectory optimization might be used when the dynamics are known [19; 11], while local RL methods might be used with unknown dynamics [5; 6], which still requires samples from the real system, though substantially fewer than the direct approach, due to the simplicity of the local policies. This sample efficiency is the main advantage of guided policy search, which can train policies with nearly a hundred thousand parameters for vision-based control using under 200 episodes [6], in contrast to direct deep RL methods that might require orders of magnitude more experience [17; 10].

A generic guided policy search method is shown in Algorithm 1. The C-step invokes a local policy optimizer (trajectory optimization or local RL) for each on line 2, and the S-step uses supervised learning to optimize the global policy on line 3 using samples from each , which are generated during the C-step. On line 4, the surrogate cost for each is adjusted to ensure convergence. This step is crucial, because supervised learning does not in general guarantee that will achieve similar long-horizon performance to [15]. The local policies might not even be reproducible by a single global policy in general. To address this issue, most guided policy search methods have some mechanism to force the local policies to agree with the global policy, typically by framing the entire algorithm as a constrained optimization that seeks at convergence to enforce equality between and each . The form of the overall optimization problem resembles dual decomposition, and usually looks something like this:

(1) |

Since , we have when the constraints are enforced exactly. The particular form of the constraint varies depending on the method: prior works have used dual gradient descent [8], penalty methods [11], ADMM [12], and Bregman ADMM [6]. We omit the derivation of these prior variants due to space constraints.

### 2.1 Efficiently Optimizing Local Policies

A common and simple choice for the local policies is to use time-varying linear-Gaussian controllers of the form , though other options are also possible [12; 11; 19]. Linear-Gaussian controllers represent individual trajectories with linear stabilization and Gaussian noise, and are convenient in domains where each local policy can be trained from a different (but consistent) initial state . This represents an additional assumption beyond standard RL, but allows for an extremely efficient and convenient local model-based RL algorithm based on iterative LQR [9]. The algorithm proceeds by generating samples on the real physical system from each local policy during the C-step, using these samples to fit local linear-Gaussian dynamics for each local policy of the form using linear regression, and then using these fitted dynamics to improve the linear-Gaussian controller via a modified LQR algorithm [5]. This modified LQR method solves the following optimization problem:

(2) |

where we again use to denote the trajectory distribution induced by and the fitted dynamics . Here, denotes the previous local policy, and the constraint ensures that the change in the local policy is bounded, as proposed also in prior works [1; 14; 13]. This is particularly important when using linearized dynamics fitted to local samples, since these dynamics are not valid outside of a small region around the current controller. In the case of linear-Gaussian dynamics and policies, the KL-divergence constraint can be shown to simplify, as shown in prior work [5] and Appendix A:

and the resulting Lagrangian of the problem in Equation (2) can be optimized with respect to the primal variables using the standard LQR algorithm, which suggests a simple method for solving the problem in Equation (2) using dual gradient descent [5]. The surrogate objective typically includes some term that encourages the local policy to stay close to the global policy , such as a KL-divergence of the form .

### 2.2 Prior Convergence Results

Prior work on guided policy search typically shows convergence by construction, by framing the C-step and S-step as block coordinate ascent on the (augmented) Lagrangian of the problem in Equation (1), with the surrogate cost for the local policies corresponding to the (augmented) Lagrangian, and the overall algorithm being an instance of dual gradient descent [8], ADMM [12], or Bregman ADMM [6]. Since these methods enforce the constraint at convergence (up to linearization or sampling error, depending on the method), we know that at convergence.^{1}^{1}1As mentioned previously, the initial state of each local policy is assumed to be drawn from , hence the outer sum corresponds to Monte Carlo integration of the expectation under . However, prior work does not say anything about at intermediate iterations, and the constraints of policy search in the real world might often preclude running the method to full convergence. We propose a simplified variant of guided policy search, and present an analysis that sheds light on the performance of both the new algorithm and prior guided policy search methods.

## 3 Mirror Descent Guided Policy Search

In this section, we propose our new simplified guided policy search, which we term mirror descent guided policy search (MDGPS). This algorithm uses the constrained LQR optimization in Equation (2) to optimize each of the local policies, but instead of constraining each local policy against the previous local policy , we instead constraint it directly against the global policy , and simply set the surrogate cost to be the true cost, such that . The method is summarized in Algorithm 2. In the case of linear dynamics and a quadratic cost (i.e. the LQR setting), and assuming that supervised learning can globally solve a convex optimization problem, we can show that this method corresponds to an instance of mirror descent [2] on the objective . In this formulation, the optimization is performed on the space of trajectory distributions, with a constraint that the policy must lie on the manifold of policies with the chosen parameterization. Let be the set of all possible policies for a given parameterization, where we overload notation to also let denote the set of trajectory distributions that are possible under the chosen parameterization. The return can be optimized according to . Mirror descent solves this optimization by alternating between two steps at each iteration :

The first step finds a new distribution that minimizes the cost and is close to the previous policy in terms of the divergence , while the second step projects this distribution onto the constraint set , with respect to the divergence . In the linear-quadratic case with a convex supervised learning phase, this corresponds exactly to Algorithm 2: the C-step optimizes , while the S-step is the projection. Monotonic improvement of the global policy follows from the monotonic improvement of mirror descent [2]. In the case of linear-Gaussian dynamics and policies, the S-step, which minimizes KL-divergence between trajectory distributions, in fact only requires minimizing the KL-divergence between policies. Using the identity in Appendix A, we know that

(3) |

### 3.1 Implementation for Nonlinear Global Policies and Unknown Dynamics

In practice, we aim to optimize complex policies for nonlinear systems with unknown dynamics. This requires a few practical considerations. The C-step requires a local quadratic cost function, which can be obtained via Taylor expansion, as well as local linear-Gaussian dynamics , which we can fit to samples as in prior work [5]. We also need a local time-varying linear-Gaussian approximation to the global policy , denoted . This can be obtained either by analytically differentiating the policy, or by using the same linear regression method that we use to estimate , which is the approach in our implementation. In both cases, we get a different global policy linearization around each local policy. Following prior work [5], we use a Gaussian mixture model prior for both the dynamics and global policy fit.

The S-step can be performed approximately in the nonlinear case by using the samples collected for dynamics fitting to also train the global policy. Following prior work [6], our S-step minimizes^{2}^{2}2Note that we flip the KL-divergence inside the expectation, following [6]. We found that this produced better results. The intuition behind this is that, because is proportional to the Q-function of (see Appendix B.1), minimizes the cost-to-go under with respect to , which provides for a more informative objective than the unweighted likelihood in Equation (3).

where is the sample from obtained by running on the real system. For linear-Gaussian and (nonlinear) conditionally Gaussian , where and can be any function (such as a deep neural network), the KL-divergence can easily be evaluated and differentiated in closed form [6]. However, in the nonlinear setting, minimizing this objective no longer minimizes the KL-divergence between trajectory distributions exactly, which means that MDGPS does not correspond exactly to mirror descent: although the C-step can still be evaluated exactly, the S-step now corresponds to an approximate projection onto the constraint manifold. In the next section, we discuss how we can bound the error in this projection. A summary of the nonlinear MDGPS method is provided in Algorithm 4, and additional details are in Appendix B. The samples for linearizing the dynamics and policy can be obtained by running either the last local policy , or the last global policy . Both variants produce good results, and we compare them in Section 6.

### 3.2 Analysis of Prior Guided Policy Search Methods as Approximate Mirror Descent

The main distinction between the proposed method and prior guided policy search methods is that the constraint is enforced on the local policies at each iteration, while in prior methods, this constraint is iteratively enforced via a dual descent procedure over multiple iterations. This means that the prior methods perform approximate mirror descent with step sizes that are adapted (by adjusting the Lagrange multipliers) but not constrained exactly. In our empirical evaluation, we show that our approach is somewhat more stable, though sometimes slower than these prior methods. This empirical observation agrees with our intuition: prior methods can sometimes be faster, because they do not exactly constrain the step size, but our method is simpler, requires less tuning, and always takes bounded steps on the global policy in trajectory space.

## 4 Analysis in the Nonlinear Case

Although the S-step under nonlinear dynamics is not an optimal projection onto the constraint manifold, we can bound the additional cost incurred by this projection in terms of the KL-divergence between and . This analysis also reveals why prior guided policy search algorithms, which only have asymptotic convergence guarantees, still attain good performance in practice even after a small number of iterations. We will drop the subscript from in this section for conciseness, though the same analysis can be repeated for multiple local policies .

### 4.1 Bounding the Global Policy Cost

The analysis in this section is based on the following lemma, which we prove in Appendix C.1, building off of earlier results by Ross et al. [15] and Schulman et al. [17]:

###### Lemma 4.1

Let . Then .

This means that if we can bound the KL-divergence between the policies, then the total variation divergence between their state marginals (given by ) will also be bounded. This bound allows us in turn to relate the total expected costs of the two policies to each other according to the following lemma, which we prove in Appendix C.2:

###### Lemma 4.2

If , then we can bound the total cost of as

where , the maximum total cost from time to .

This bound on the cost of tells us that if we update so as to decrease its total cost or decrease its KL-divergence against , we will eventually reduce the cost of . For the MDGPS algorithm, this bound suggests that we can ensure improvement of the global policy within a small number of iterations by appropriately choosing the constraint during the C-step. Recall that the C-step constrains , so if we choose to be small enough, we can close the gap between the local and global policies. Optimizing the bound directly turns out to produce very slow learning in practice, because the bound is very loose. However, it tells us that we can either decrease toward the end of the optimization process or if we observe the global policy performing much worse than the local policies. We discuss how this idea can be put into action in the next section.

### 4.2 Step Size Selection

In prior work [8], the step size in the local policy optimization is adjusted by considering the difference between the predicted change in the cost of the local policy under the fitted dynamics, and the actual cost obtained when sampling from that policy. The intuition is that, because the linearized dynamics are local, we incur a larger cost the further we deviate from the previous policy. We can adjust the step size by estimating the rate at which the additional cost is incurred and choose the optimal tradeoff. Let denote the expected cost under the previous local policy , the cost under the current local policy and the previous fitted dynamics (which were estimated using samples from and used to optimize ), and the cost of the current local policy under the dynamics estimated using samples from itself. Each of these can be computed analytically under the linearized dynamics. We can view the difference as the additional cost we incur from imperfect dynamics estimation. Previous work suggested modeling the change in cost as a function of as following: , where is the change in cost per unit of KL-divergence, and is additional cost incurred due to inaccurate dynamics [8]. This model is reasonable because the integral of a quadratic cost under a linear-Gaussian system changes roughly linearly with KL-divergence. The additional cost due to dynamics errors is assumes to scale superlinearly, allowing us to solve for by looking at the difference and then solving for a new optimal according to , resulting in the update .

In MDGPS, we propose to use two step size adjustment rules. The first rule simply adapts the previous method to the case where we constrain the new local policy against the global policy , instead of the previous local policy . In this case, we simply replace with the expected cost under the previous global policy, given by , obtained using its linearization . We call this the “classic” step size: .

However, we can also incorporate intuition from the bound in the previous section to obtain a more conservative step adjustment that reduces not only when the obtained local policy improvement doesn’t meet expectations, but also when we detect that the global policy is unable to reproduce the behavior of the local policy. In this case, reducing reduces the KL-divergence between the global and local policies which, as shown in the previous section, tightens the bound on the global policy return. As mentioned previously, directly optimizing the bound tends to perform poorly because the bound is quite loose. However, if we estimate the cost of the global policy using its linearization, we can instead adjust the step size based on a simple model of global policy cost. We use the same model for the change in cost, given by . However, for the term , which reflects the actual cost of the new policy, we instead use the cost of the new global policy , so that now models the additional loss due to both inaccurate dynamics and inaccurate projection: if is much worse than , then either the dynamics were too local, or S-step failed to match the performance of the local policies. In either case, we decrease the step size.^{3}^{3}3Although we showed before that the discrepancy depends on , here we use . This is a simplification, but the net result is the same: when the global policy is worse than expected, is reduced. As before, we can solve for the new step size according to . We call this the “global” step size. Details of how each quantity in this equation is computed are provided in Appendix B.3.

## 5 Relation to Prior Work

While we’ve discussed the connections between MDGPS and prior guided policy search methods, in this section we’ll also discuss the connections between our method and other policy search methods. One popular supervised policy learning methods is DAGGER [15], which also trains the policy using supervised learning, but does not attempt to adapt the teacher to provide better training data. MDGPS removes the assumption in DAGGER that the supervised learning stage has bounded error against an arbitrary teacher policy. MDGPS does not need to make this assumption, since the teacher can be adapted to the limitations of the global policy learning. This is particularly important when the global policy has computational or observational limitations, such as when learning to use camera images for partially observed control tasks or, as shown in our evaluation, blind peg insertion.

When we sample from the global policy , our method resembles policy gradient methods with KL-divergence constraints [14; 13; 17]. However, policy gradient methods update the policy at each iteration by linearizing with respect to the policy parameters, which often requires small steps for complex, nonlinear policies, such as neural networks. In contrast, we linearize in the space of time-varying linear dynamics, while the policy is optimized at each iteration with many steps of supervised learning (e.g. stochastic gradient descent). This makes MDGPS much better suited for quickly and efficiently training highly nonlinear, high-dimensional policies.

## 6 Experimental Evaluation

We compare several variants of MDGPS and a prior guided policy search method based on Bregman ADMM (BADMM) [6]. We evaluate all methods on one simulated robotic navigation task and two manipulation tasks. Guided policy search code, including BADMM and MDGPS methods, is available at https://www.github.com/cbfinn/gps.

#### Obstacle Navigation.

In this task, a 2D point mass (grey) must navigate around obstacles to reach a target (shown in green), using velocities and positions relative to the target. We use initial states, with 5 samples per initial state per iteration. The target and obstacles are fixed, but the starting position varies.

#### Peg Insertion.

This task, which is more complex, requires controlling a 7 DoF 3D arm to insert a tight-fitting peg into a hole. The hole can be in different positions, and the state consists of joint angles, velocities, and end-effector positions relative to the target. This task is substantially more challenging physically. We use different hole positions, with 5 samples per initial state per iteration.

#### Blind Peg Insertion.

The last task is a blind variant of the peg insertion task, where the target-relative end effector positions are provided to the local policies, but not to the global policy . This requires the global policy to search for the hole, since no input to the global policy can distinguish between the different initial state . This makes it much more challenging to adapt the global and local policies to each other, and makes it impossible for the global learner to succeed without adaptation of the local policies. We use different hole positions, with 5 samples per initial state per iteration.

The global policy for each task consists of a fully connected neural network with two hidden layers with rectified linear units. The same settings are used for MDGPS and the prior BADMM-based method, except for the difference in surrogate costs, constraints, and step size adjustment methods discussed in the paper. Results are presented in Figure 1. On the easier point mass and peg tasks, all of the methods achieve similar performance. However, the MDGPS methods are all substantially easier to apply to these tasks, since they have very few free hyperparameters. An initial step size must be selected, but the adaptive step size adjustment rules make this choice less important. In contrast, the BADMM method requires choosing an initial weight on the augmented Lagrangian term, an adjustment schedule for this term, a step size on the dual variables, and a step size for local policies, all of which have a substantial impact on the final performance of the method (the reported results are for the best setting of these parameters, identified with a hyperparameter sweep).

On the harder blind peg task, MDGPS consistently outperforms BADMM when sampling from the local policies (“off policy”), with both the classic and global step sizes. This is particularly apparent in the success rates in Table 1, which shows that the MDGPS policies succeed at actually inserting the peg into the hole more often and on more conditions. This suggests that our method is better able to improve global policies particularly in situations where informational or representational constraints make naïve imitation of the local policies insufficient to solve the task. On-policy sampling tends to learn slower, since the approximate projection causes the global policy to lag behind the local policy in performance, but this method is still able to consistently improve the global policies. Sampling from the global policies may be desirable in practice, since the global policies can directly use observations at runtime instead of requiring access to the state [6]. The global step size also tends to be more conservative, but produces more consistent and monotonic improvement.

Iteration | BADMM | Off Pol., Classic | Off Pol., Global | On Pol., Classic | On Pol., Global | |
---|---|---|---|---|---|---|

Peg |
3 | 0.00 % | 0.00 % | 0.00 % | 0.00 % | 0.00 % |

6 | 51.85 % | 62.96 % | 22.22 % | 48.15 % | 33.33 % | |

9 | 51.85 % | 77.78 % | 74.07 % | 77.78 % | 81.48 % | |

12 | 77.78 % | 70.73 % | 92.59 % | 92.59 % | 85.19 % | |

Blind Peg |
3 | 0.00 % | 0.00 % | 0.00 % | 0.00 % | 0.00 % |

6 | 50.00 % | 58.33 % | 25.00 % | 33.33 % | 25.00 % | |

9 | 58.33 % | 75.00 % | 50.00 % | 58.33 % | 33.33 % | |

12 | 75.00 % | 83.33 % | 91.67 % | 58.33 % | 58.33 % |

## 7 Discussion and Future Work

We presented a new guided policy search method that corresponds to mirror descent under linearity and convexity assumptions, and showed how prior guided policy search methods can be seen as approximating mirror descent. We provide a bound on the return of the global policy in the nonlinear case, and argue that an appropriate step size can provide improvement of the global policy in this case also. Our analysis provides us with the intuition to design an automated step size adjustment rule, and we illustrate empirically that our method achieves good results on a complex simulated robotic manipulation task while requiring substantially less tuning and hyperparameter optimization than prior guided policy search methods. Manual tuning and hyperparameter searches are a major challenge across a range of deep reinforcement learning algorithms, and developing scalable policy search methods that are simple and reliable is vital to enable further progress.

As discussed in Section 5, MDGPS has interesting connections to other policy search methods. Like DAGGER [15], MDGPS uses supervised learning to train the policy, but unlike DAGGER, MDGPS does not assume that the learner is able to reproduce an arbitrary teacher’s behavior with bounded error, which makes it very appealing for tasks with partial observability or other limits on information, such as learning to use camera images for robotic manipulation [6]. When sampling directly from the global policy, MDGPS also has close connections to policy gradient methods that take steps of fixed KL-divergence [14; 17], but with the steps taken in the space of trajectories rather than policy parameters, followed by a projection step. In future work, it would be interesting to explore this connection further, so as to develop new model-free policy gradient methods.

## References

- [1] J. A. Bagnell and J. Schneider. Covariant policy search. In International Joint Conference on Artificial Intelligence (IJCAI), 2003.
- [2] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, May 2003.
- [3] M. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2):1–142, 2013.
- [4] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In Advances in Neural Information Processing Systems (NIPS), 2014.
- [5] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems (NIPS), 2014.
- [6] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research (JMLR), 17, 2016.
- [7] S. Levine and V. Koltun. Variational policy search via trajectory optimization. In Advances in Neural Information Processing Systems (NIPS), 2013.
- [8] S. Levine, N. Wagener, and P. Abbeel. Learning contact-rich manipulation skills with guided policy search. In International Conference on Robotics and Automation (ICRA), 2015.
- [9] W. Li and E. Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), pages 222–229, 2004.
- [10] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.
- [11] I. Mordatch, K. Lowrey, G. Andrew, Z. Popovic, and E. Todorov. Interactive control of diverse complex characters with neural networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
- [12] I. Mordatch and E. Todorov. Combining the benefits of function approximation and trajectory optimization. In Robotics: Science and Systems (RSS), 2014.
- [13] J. Peters, K. Mülling, and Y. Altün. Relative entropy policy search. In AAAI Conference on Artificial Intelligence, 2010.
- [14] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008.
- [15] S. Ross, G. Gordon, and A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. Journal of Machine Learning Research, 15:627–635, 2011.
- [16] S. Ross, N. Melik-Barkhudarov, K. Shaurya Shankar, A. Wendel, D. Dey, J. A. Bagnell, and M. Hebert. Learning monocular reactive UAV control in cluttered natural environments. In International Conference on Robotics and Automation (ICRA), 2013.
- [17] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization. In International Conference on Machine Learning (ICML), 2015.
- [18] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, May 1992.
- [19] T. Zhang, G. Kahn, S. Levine, and P. Abbeel. Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In International Conference on Robotics and Automation (ICRA), 2016.

## Appendix A KL Divergence Between Gaussian Trajectory Distributions

In this appendix, we derive the KL-divergence between two Gaussian trajectory distributions corresponding to time-varying linear-Gaussian dynamics and two policies and . The two policies induce Gaussian trajectory distributions (with block-diagonal covariances) according to

We can therefore derive their KL-divergence as

where the second step follows because the dynamics and initial state distribution cancel, the third step follows by linearity of expectations, the fourth step from the definition of differential entropy, and the last step follows from the fact that the entropy of a conditional Gaussian distribution is independent on the quantity that it is conditioned on, since it depends only on the covariance and not the mean. We therefore have

By the definition of KL-divergence, we can also write this as

## Appendix B Details of the MDGPS Algorithm

A summary of the MDGPS algorithm appears in Algorithm 2, and is repeated below for convenience:

### b.1 C-Step Details

The C-step solves the following constrained optimization problem:

The solution to this problem follows prior work [5], and is reviewed here for completeness. First, the Lagrangian of this problem is given by

where equality follows from the identity in Appendix A. As discussed in prior work [5], we can minimize this Lagrangian with respect to by solving an LQR problem (assuming a quadratic expansion of ) with a surrogate cost

This follows because LQR can be shown to solve the following problem [5]

if we set , where and are the optimal feedback and feedforward terms, respectively, and is the action component of the Q-function matrix computed by LQR, where the full Q-function is given by

This maximum entropy LQR solution also directly from the so-called Kalman duality, which describes a connection between LQR and Kalman smoothing.

Once we can minimize the Lagrangian with respect to , we can solve the original constrained problem by using dual gradient descent to iteratively adjust the dual variable . Since there is only a single dual variable, we can find it very efficiently by using a bracketing line search, exploiting the fact that the dual function is convex.

As discussed in the paper, the dynamics are estimated by using samples (drawn from either the local policy or the global policy) and linear regression. Following prior work [5], the dynamics at each step are fitted using linear regression with a Gaussian mixture model prior. This prior incorporates samples from other time steps and previous iterations to allow the regression procedure to use a very small number of sampled trajectories.

### b.2 S-Step Details

The step solves the following optimization problem:

Since both and are assumed to be conditionally Gaussian, this objective can be rewritten in closed form as

Note that the last term is simply a weighted quadratic cost on the policy mean , which lends itself to simple and straightforward optimization using stochastic gradient descent. In our implementation, we use a policy where the covariance is independent of the state , and therefore we can solve for the covariance in closed form, as discussed in prior work [6]. However, in general, the covariance could also be optimized using stochastic gradient descent.

### b.3 Step Size Adjustment

As discussed in Section 4.2, the step size adjustment procedure requires estimating quantities of the type , where is the marginal of the local policy used to generate samples at iteration and the dynamics fitted at iteration (not to be confused with , which we use to denote the local policy for the initial state, independent of the iteration number). We also use terms of the form , which give the expected cost under the dynamics at iteration and the linearized global policy at iteration . Specifically, we require , , and , as well as the corresponding global policy terms , , and .

All of these terms can be computed analytically, since the fitted dynamics, local policies, and linearized global policy are all linear-Gaussian. The state-action marginals in linear-Gaussian policies can be computed simply by propagating Gaussian densities forward in time, according to

where we have and , and then we can estimate the expectation of the cost at time simply by integrating the quadratic cost under the Gaussian state-action marginals.

## Appendix C Global Policy Cost Bounds

In this appendix, we prove the bound on the policy cost discussed in Section 4.1. The proof combines the earlier results from Ross et al. [15] and Schulman et al. [17], and extends them to the case of finite-horizon episodic tasks.

### c.1 Policy State Distribution Bound

We begin by proving Lemma 4.1, which we restate below with slightly simplified notation, replacing by :

###### Lemma C.1

Let . Then .

The proof first requires introducing a lemma that relates the total variation divergence between two policies to the probability that the policies will take the same action in a discrete setting (extensions to the continuous setting are also possible):

###### Lemma C.2

Assume that , then the probability that and take the same action at time step is .

The proof for this lemma was presented by Schulman et al. [17]. We can use it to bound the state distribution difference as following. First, we are acting according to , the probability that the same action would have been taken by , based on Lemma C.2, is , so the probability that all actions up to time would have been taken by is given by . We can therefore express the state distribution as

where is some other distribution. In order to bound , we can substitute this equation into to get