Don’t Forget Your Teacher:A Corrective Reinforcement Learning Framework

Don’t Forget Your Teacher:
A Corrective Reinforcement Learning Framework

Mohammadreza Nazari
\AndMajid Jahani
\ANDLawrence V. Snyder
\AndMartin Takáč \AND Department of Industrial and Systems Engineering
Lehigh University, Bethlehem, PA 18015

Although reinforcement learning (RL) can provide reliable solutions in many settings, practitioners are often wary of the discrepancies between the RL solution and their status quo procedures. Therefore, they may be reluctant to adapt to the novel way of executing tasks proposed by RL. On the other hand, many real-world problems require relatively small adjustments from the status quo policies to achieve improved performance. Therefore, we propose a student-teacher RL mechanism in which the RL (the “student”) learns to maximize its reward, subject to a constraint that bounds the difference between the RL policy and the “teacher” policy. The teacher can be another RL policy (e.g., trained under a slightly different setting), the status quo policy, or any other exogenous policy. We formulate this problem using a stochastic optimization model and solve it using a primal-dual policy gradient algorithm. We prove that the policy is asymptotically optimal. However, a naive implementation suffers from high variance and convergence to a stochastic optimal policy. With a few practical adjustments to address these issues, our numerical experiments confirm the effectiveness of our proposed method in multiple GridWorld scenarios.

1 Introduction

We encourage using a new paradigm called corrective RL, in which a reinforcement learning (RL) agent is trained to maximize its reward while not straying “too far” from a previously defined policy. The motivation is twofold: (1) to provide a gentler transition for a decision-maker who is accustomed to using a certain policy but now considers implementing RL, and (2) to develop a framework for gently transitioning from one RL solution to another when the underlying environment has changed.

RL has recently achieved considerable success in artificially created environments, such as Atari games Mnih et al. (2013, 2016) or robotic simulators Lillicrap et al. (2015). Exploiting the power of neural networks in RL algorithms has been shown to exhibit super-human performance by enabling automatic feature extraction and policy representations, but real-world applications are still very limited, conceivably due to lack of representativity of the optimized policies. Over the past few years, a major portion of the RL literature has been developed for RL agents with no prior information about how to do a task. Typically, these algorithms start with random actions and learn while interacting with the environment through trial and error. However, in many practical settings, prior information about good solutions is available, whether from a previous RL algorithm or a human decision-maker’s prior experience. Our approach trains the RL agent to make use of this prior information when optimizing, in order to avoid deviating too far from a target policy.

Although toy environments and Atari games are prevalent in the RL literature due to their simplicity, RL has recently been trying to find its path to real-world applications such as recommender systems Chen et al. (2018), transportation Nazari et al. (2018), Internet of Things Feng et al. (2017), supply chain Gijsbrechts et al. (2018); Oroojlooyjadid et al. (2017) and various control tasks in robotics Gu et al. (2017). In all of these applications, there is a crucial risk that the new policy might not operate logically or safely, as one was expecting it to do. A policy that attains a large reward but deviates too much from a known policy—which follows logical steps and processes—is not desirable for these tasks. For example, users of a system who were accustomed to the old way of doing things would likely find it hard to switch to a newly discovered policy, especially if the benefit of the new policy is not obvious or immediately forthcoming. Indeed, we argue that many real-world tasks only need a small corrective fix to the currently running policies to achieve their desired goals, instead of designing everything from scratch. Throughout this paper, we adhere to this paradigm—we call it “corrective RL”—which utilizes an acceptable policy as a gauge when designing novel policies. We consider two agents, namely a teacher and a student. Our main question is how the student can improve upon the teacher’s policy while not deviating too far from it. More formally, we would like to train a student in a way that it maximizes a long-term RL objective while keeping its own policy close to that of the teacher.

For example, consider an airplane that is controlled by an autopilot that follows the shortest haversine path policy towards the destination. Then, some turbulence occurs, and we want to modify the current path to avoid the turbulence. A “pure” RL algorithm would re-optimize the trajectory from scratch, potentially deviating too far from the optimal path in order to avoid the turbulence. Corrective RL would ensure that the adjustments to the current policy are small, so that the flight follows a similar path and has a similar estimated time of arrival, while ensuring that the passengers experience a more comfortable (less turbulent) flight. Another example is in predictive maintenance, where devices are periodically inspected for possible failures. Inspection schedules are usually prescribed by the device designers, but many environmental conditions affect failure rates, hence there is no guarantee that factory schedules are perfect. If the objective is to reduce downtime with only slight adjustments to the current schedules, conventional RL algorithms would have a hard time finding such policies.

Similar concerns arise in other business and engineering domains as well, including supply chain management, queuing systems, finance, and robotics. For example, an enormous number of exact and inexact methods have been proposed for classical inventory management problems under some assumptions on the demand, e.g., that it follows a normal distribution Snyder and Shen (2019). Once we add more stochasticity to the demand distribution or consider more complicated cost functions, these problems often become intractable using classical methods. Of course, vanilla RL can provide a mechanism for solving more complex inventory optimization problems, but practitioners may prefer policies that are simple to implement and intuitive to understand. Corrective RL can help steer the policy toward the preferred ones, while still maintaining near-optimal performance. Given these examples, one can interpret our approach as an improvement on black-box heuristics, which uses a data-driven approach to improve the performance of these algorithms without dramatic reformulations.

The contributions of this work are as follows: i) we introduce a new paradigm for RL tasks, convenient for many real-world tasks, that improves upon the currently running system’s policy with a small perturbation of the policy; ii) we formulate this problem using an stochastic optimization problem and propose a primal–dual policy gradient algorithm which we prove to be asymptotically optimal, and iii) using practical adjustments, we illustrate how an RL framework can act as an improvement heuristic. We show the effectiveness and properties of the algorithm in multiple GridWorld motion planning experiments.

2 Problem Definition

We consider the standard definition of a Markov decision process (MDP) using a tuple . In our notation, is the state space, where is the set of transient states and is the terminal state; is the set of actions; is the cost function; is the transition probability distribution; and is the distribution of the initial state . At each time step , the agent observes , selects , and incurs a cost . Selecting the action at state transitions the agent to the next state .

Consider two agents, a teacher and a student. The teacher has prior knowledge about the task and prescribes an action for any state that the student encounters, and the student has the authority to follow the teacher’s action or act independently based on its own learned policy. Let denote the policy of the student and be the policy of the teacher, where both and are stationary stochastic policies defined as a mapping from a state–action pair to a probability distribution, i.e., , . For example, denotes the probability of choosing action in state by the student. In policy gradient methods, the policies are represented with a function approximator, usually modeled by a neural network, where we denote by and the corresponding policy weights of the student and teacher, respectively; the teacher and student parameterized policies are denoted by and . In what follows, we adapt this parameterization structure into the notation and interchangeably refer to and with their associated weights, and .

We consider the simulation optimization setting, where we can sample from the underlying MDP and observe the costs. Consider a possible state–action–cost trajectory and let to be the set of all possible trajectories under all admissible policies. For simplicity of exposition, we assume that the first hitting time of a terminal state from any given and following a stationary policy is bounded almost surely with an upper bound , i.e., almost surely. Since the sample trajectories in many RL tasks terminate in finite time, this assumption is not restrictive. For example, the game fails after reaching a certain state or a time-out signal may terminate the trajectory. Along a trajectory , the system incurs a discounted cost , with discount factor , and the probability of sampling such a trajectory is . We denote the expected cost from state onward until hitting the terminal state by , i.e.,


2.1 Distance Measure

An important question that arises is how to quantify the distance between the policies of the teacher and the student. There are several distance measures studied in the literature for computing the closeness of two probability measures. Among those, the Kullback-Leibler (KL) divergence Nasrabadi (2007) is a widely used metric. In this work, we consider both reverse (KL-R) and forward (KL-F) KL-divergence, defined as


KL-divergence is known to be an asymmetric distance measure, meaning that changing the order of the student and teacher distributions will cause different learning behaviors. We will use the reverse KL-divergence in the theoretical analysis since it provides more compact notations. However, in all of the experiments, we will consider the forward setting, i.e., , unless otherwise specified. Informally speaking, this form of KL-divergence, which is also known as the mean-seeking KL, allows the student to perform actions that are not included in the teacher’s behavior. This is because i) when the teacher can perform an action in a given state , the student should also have to keep the distance finite, and ii) the student can have irrespective of whether the teacher is doing that action or not. The reverse direction, , known as the mode-seeking KL, can be useful as well. For example, let’s assume that the teacher policy is a mixture of several Gaussian sub-policies. Using the reverse order will allow the student to assign only one sub-policy as its decision making policy. Hence, the choice of reverse KL would be preferred if the student wants to find a policy which is close to a teacher’s sub-policy with the highest return. The justification of this behavior is also visible from the definition: when for a given state and action, then the teacher also should have . Also, would not allow . For more detailed discussion and examples, we refer the interested reader to Appendix C.1 and Section 10.1.2 of Nasrabadi (2007).

2.2 Optimization Problems

The student’s optimization problems that we would like to solve for reverse KL-divergence (OPT-R) and the forward KL-divergence (OPT-F) are defined as


where is an upper bound on the KL-divergence and is a convex compact set of possible policy parameters. Most of the theoretical analysis of the two optimization problems is quite similar, so we will use (OPT-R) as our main formulation. In Appendix B.3, we investigate the equivalence of both problems and state their minor differences.

The widely adopted problem studied for MDPs only contains the objective function; however, we impose an additional constraint to restrict the student’s policy. By fixing an appropriate value for , one can enforce a constraint on the maximum allowed deviation of the student policy from that of the teacher. The objective is to find a set of optimal points that minimizes the discounted expected cost while not violating the KL constraint. Notice that is a trivial feasible solution. In addition, we need to have the following assumption to ensure that (OPT-R) is well-defined:

Assumption 1.

(Well-defined (OPT-R)) For any state–action pair with , we have .

Intuitively, Assumption 1 specifies that when the teacher does not take a specific action in a given state, the student also cannot choose that action. Even though this assumption might seem restrictive, it is valid in situations in which the student is indeed limited to the positive-probability action space of the teacher. Alternatively, we can certify this assumption by adding a small noise term to the outcome of the teacher’s policy at the expense of some information loss.

2.3 Lagrangian Relaxation of (Opt-R)

The standard method for solving (OPT-R) is by applying Lagrangian relaxation Bertsekas (1999). We define the Lagrangian function


where is the Lagrange multiplier. Then the optimization problem (OPT-R) can be converted into the following problem:


The intuition beyond (3) is that we now allow the student to deviate arbitrarily much from the teacher’s policy in order to decrease the cumulative cost, but we penalize any such deviation.

Next, we define a dynamical system which, as we will prove in Appendix B, solves problem (OPT-R) under several common assumptions for stochastic approximation methods. Once we know the optimal Lagrange multiplier , then the student’s optimal policy is


A point is a saddle point of if for some , we have


for all and , where represents a ball around with radius . Then, the saddle point theorem Bertsekas (1999) immediately implies that is the local optimal solution of (OPT-R).

3 Primal–Dual Policy Gradient Algorithm

We propose a primal–dual policy gradient (PDPG) algorithm for solving (OPT-R). Due to space limitation, we leave the detailed algorithm to Appendix A.2, but the overall scheme is as follows. After initializing the student’s policy parameters, possibly with those of the teacher, we sample multiple trajectories under the student’s policy at each iteration . Then the sampled trajectories are used to calculate the approximate gradient of the Lagrangian function with respect to and . Finally, using an optimization algorithm, we update and according to the approximated gradients.

In order to prove our main convergence result, we need some technical assumptions on the student’s policy and step sizes.

Assumption 2.

(Smooth policy) For any , is a continuously differentiable function in and its gradient is -Lipschitz continuous, i.e., for any and ,

Assumption 3.

(Step-size rules) The step-sizes and in update rules (15) and (16) satisfy the following relations:

  1. ,

  2. ,

  3. .

Relations 1 and 2 in Assumption 3 are common in stochastic approximation algorithms, and 3 indicates that the Lagrange multiplier update is in a slower time-scale compared to the policy updates. The latter condition simplifies the convergence proof by allowing us to study the PDPG as a two-time-scale stochastic approximation algorithm. The following theorem states the main theoretical result of this paper.

Theorem 1.

Under Assumptions 1, 2, and 3, the sequence of policy updates (starting from sufficiently close to a local optimum point ) and Lagrange multipliers converges almost surely to a saddle point of the Lagrangian, i.e., ). Moreover, is a local optimal solution of (OPT-R).


(sketch) The proof is similar to those found in Tamar et al. (2012); Chow et al. (2017). It is based on representing and update rules with a two-time-scale stochastic approximation algorithm. For each timescale, the algorithm can be shown to converge to the stationary points of the corresponding continuous-time system. Finally, it can be shown that the fixed point is, in fact, a locally optimal point. In Appendix B.1, we provide a formal proof of this theorem. ∎

Corollary 1.

Under Assumptions 1, 2, and 3, the sequence of policy updates and Lagrange multipliers converges globally to a stationary point of the Lagrangian almost surely. Moreover, if is in the interior of , then is a feasible first order stationary point of (OPT-R), i.e., and .

Theorem 1 and Corollary 1 are also valid for the forward KL constraint case, as we discuss in Appendix B.3.

4 Practical PDPG Algorithm

Although the algorithm presented in the previous section is proved to converge to a first-order stationary point, it cannot directly serve as a practical learner algorithm. The main reason is that it produces a high-variance approximation of the gradients, which would lead to unstable learning. In this section, we propose several approximations to the theoretically-justified PDPG in order to develop a more practical algorithm. For this algorithm, we will consider the forward definition of KL-divergence due to the mean-covering property.

One source of variance is the reward bias, which can be handled by adding a critic, similar to Konda and Tsitsiklis (2000). Our next adjustment is to use an approximation of the step-wise KL-divergence, defined as


where As we discuss in Appendix C.1, using (7) results in a much smaller variance, while still ensuring the convergence results. Intuitively, this equation suggests that instead of computing the trajectory probabilities and then computing the KL-divergence, as in (KL-R), one can compute the KL in every visited state along a trajectory and sum them up. In addition to this change, we will further normalize each by its trajectory length to remove the effect of the variable horizon length. The latter modification will lead to more sensible KL values and will make the choice of easier.

A second difficulty with the algorithm in Section 3 is that, unlike conventional policy gradient algorithms, there is no guarantee that the student’s optimal policy is a deterministic one. In fact, in most of our experiments, it happens that the optimal policy is stochastic, especially when the teacher’s policy itself is stochastic. To illustrate this, consider two scenarios: i) The student refuses to do the suggested action of a deterministic teacher. In this case, she would incur an infinite cost as a result of her disobedience, so the problem will be infeasible. ii) The teacher is less informative and has no clue about most of the state space, so often takes random actions. Trying to emulate this teacher would cause degraded performance for the student as well, so the student would also take many less informed actions.

A stochastic optimal policy is usually not desirable since it poses major safety and reliability challenges, so our next adjustments are an attempt to address this issue. One possible mitigation for the first scenario might be using a bounded distance measure such as Hellinger Cramer (1946) instead of KL-divergence, but our numerical experiments did not confirm that this is effective. We observe that by using the Hellinger constraint, the total entropy of the student’s policy stays high, without any improvement in the student’s policy. Instead, we propose using percentile KL-clipping, which is defined as


In fact, the clip function enables the student to totally disagree with the teacher in % of the visited states, without receiving an extremely large penalty. Selecting the values for depends on our perception about how perfect the teacher is. Setting close to 100 means that we believe in the teacher’s suggestions. As we decrease , we rely less on the teacher and can disobey more freely.

The last major modification is to control the expected entropy at a certain small level , i.e.,


The justification for adding (9) is that we would like the optimal policy to be close to a deterministic one as much as possible. By setting a small value for , we can enforce this property. Also, this constraint tries to avoid having a deterministic policy in intermediate training steps, in order to allow more exploration. To add this constraint, we use the same Lagrangian technique, adding an extra term to the Lagrangian function:


All of these modifications, along with a few others, are summarized in Algorithm LABEL:alg:ppg of Appendix LABEL:app:ppg.

5 Experiments

We illustrate the efficiency of the proposed methods with multiple GridWorld experiments. In the first set of experiments, the teacher tries to teach the student to perform an oscillating maneuver around the walls. In the second set, we study how the student can comprehend changes in the environment change and utilize them to increase its rewards.

5.1 Square-Wave Teacher

In this experiment, we consider a teacher who gives a suggestion in every state of a GridWorld. We study two variants of the teacher, one who is very determined about all of his suggestions and the other who is less confident. Figure 1 illustrates the environment and both teachers’ suggestions. A student wants to find a path from the blue state to the green target. Each step has a reward and reaching the target brings reward. If the student wants to act independently, the optimal path is a trivial horizontal line. However, our objective is to force the student to “listen” to her teacher up to some level.

(a) “Determined” teacher with the corresponding suggested actions for every state. The red line shows the teacher’s suggested path to target
(b) “Less confident” teacher with deterministic actions in a subset of states and uniformly random actions in the rest
(c) Optimal Path (in green) versus a sample path found by PDPG (in purple) with and
Figure 1: Two different teachers with suggested actions and optimal path.

Determined Teacher: In this part, a teacher has a preferred action in every state with a probability of around 98%. As we observe in Figure 0(a), these suggestions might help the student in reaching the target, but they are inefficient. For instance, if the student follows a square-wave sequence of actions as illustrated by the red line, she will be able to reach the target while following all of the teacher’s suggestions exactly. Our objective is to allow the student deviate from the teacher for a few steps, to find shorter routes.

By using PDPG, the student is able to find policies that are a mixture of the horizontal path and the square-wave route. For example, in Figure 0(c), we have illustrated an instance of the student’s optimized path with . The extent to which either policy is followed depends on values of and the KL-clipping parameter . Figure 1(a) illustrates the student’s total reward for different quantities without KL-clipping. We observe that as we increase , we allow the student to act more freely, hence she gets a higher reward. However, after 5000 training iterations, the reward remains at the same level with too much oscillation. Recalling the discussions of Section 4, this behavior indicates convergence to a stochastic policy.

To reduce the oscillating behavior, we proposed adding an entropy constraint and KL-clipping. Figure 1(b) shows how adding the entropy constraint results in a more deterministic (i.e., lower entropy) policy. Also, in Figure 1(c), we have added KL-clipping. As we decrease , the student can totally disagree with the teacher in a larger proportion of the visited states, so she can find better policies with higher rewards. For different values of , we see that the policy can converge to either a stochastic or deterministic one. For , it converges to a deterministic horizontal line policy. With , it learns to deterministically follow one -shaped path followed by a horizontal route, and for , it follows a -shaped path with a horizontal line at the end. Notice that even with , which means no clipping, the student is not exactly following the teacher. We also observe that for , it fails to converge to a deterministic policy. One justification for such a failure is that the student’s policy is far better than the less-rewarding deterministic one, but not good enough to get to the next level of performance. Finally, Figure 1(d) shows how the and values converge to their optimal values.

Less Confident Teacher: This experiment is designed to illustrate how a less confident teacher can still teach the student to follow some of his suggestions, but it will yield a lower level of confidence of the student. Figure 0(b) shows the suggested actions of the teacher; he is deterministic only in a subset of the states. For the rest, he does not have any information, so he suggests actions uniformly at random. The less confident teacher still has the square wave as the general idea (which is bad, just like the determined teacher), but also has extra randomness that points the student in even worse directions. In other words, the less confident teacher has a worse policy overall than the determined one. Recommending random actions causes the student to have more volatile behavior. We can observe this fact by comparing Figure 2(a) with 1(a), where the student’s converged policy produces a wider range of rewards for the less-confident teacher’s case. Also, the average reward for this case is slightly lower, which can be explained by the inadequate information that the less-confident teacher provides for solving the task.

Figure 2(b) shows that adding KL-clipping helps in reducing the volatility, but one needs to choose a much smaller value for (compare it with Figure 1(c)). Yet, even a small does not necessarily result in a deterministic policy; for as small as , the student has converged to a stochastic one.

(a) The effect of on reward; no KL-clipping
(b) The effect of entropy constraint;
(c) Total reward for different and
(d) Convergence of and ;
Figure 2: Performance of a student learning from the deterministic teacher
Figure 3: Performance of a student learning from the less confident teacher
Figure 4: Teacher and student environments as well as their policies

5.2 Wall Leaping

The purpose of this experiment is to show that PDPG can act as an improvement method when the student encounters a slightly modified environment. The teacher’s reward structure is similar to the structure in the previous experiment, i.e., for every step and for reaching the target. However, the student comprehends that she can leap over some of the walls with a reward of . We use a vanilla policy gradient algorithm to train the teacher, which provides paths like the one illustrated in Figure 3(a). If we allow the student to learn without any constraint, it will find the green path in Figure 4 with a KL-divergence of . However, this is not what we are looking for since it is extremely different from the teacher. Instead, we use the PDPG algorithm to constrain the policy deviation with . Using this parameter, the student learns to follow the purple path, with a KL-divergence of .

6 Related Work

Learning from a teacher is a well-studied problem in the literature on supervised learning Girshick et al. (2014) and imitation learning Schaal (1999); Thomaz et al. (2006). However, we are not aware of any work using a teacher to control specific behaviors of a student. The typical use case of a student–teacher framework in RL is in “policy compression,” where the objective is to train a student from a collection of well-trained RL policies. Policy distillation Rusu et al. (2015) and actor–mimic Parisotto et al. (2015) are two methods that distill the trained RL agents, in a supervised learning fashion, into a unified policy of the student. In contrast, we follow a completely distinct objective, where a student is continually interacting with an environment and it only uses the teacher’s signals as a guideline for shaping her policy.

Closest to ours, Schmitt et al. (2018) propose “kickstarting RL,” a method that uses the teacher’s information for better training. Incorporating the idea of population-based training, they design a hand-crafted decreasing schedule of Lagrange multipliers, . Nevertheless, the justification for such a schedule is not clearly visible. However, noticing that their problem is a special case of ours with , our findings confirm the credibility of their approach, i.e., our findings indicate that according to strong duality. This observation also conforms with the experimental findings of Schmitt et al. (2018), and our theoretical results indicate that when there is no obligation on being similar to the teacher, the student is better off eventually operating independently. Similarly, their method only uses the teacher for faster learning.

Imposing certain constraints on the behavior of a policy is also a common problem in the context of “safe RL” Achiam et al. (2017); Leike et al. (2017); Chow et al. (2018). Typically, these problems look for policies that avoid hazardous states either during training or execution. Our problem is different in that we follow another type of constraint, yet similar methods might be applied. Using a domain-specific programming language instead of neural networks can be an alternative method to add interpretability Verma et al. (2018), but it lacks the numerous advantages inherent in end-to-end and differentiable learning. In an alternative direction, it is also possible to manipulate the policy shape by introducing auxiliary tasks or reward shaping Jaderberg et al. (2016). Despite the simplicity of the latter approach, it has a very limited capability. For example, it is unclear how reward shaping can suggest directions similar to our square-wave teacher. In summary, we believe that our end-to-end method, by implicitly adding interpretable components, can partially alleviate the concerns related to the RL policies.

7 Concluding Remarks

In this paper, we introduce a new paradigm called corrective RL, which allows a “student” agent to learn to optimize its own policy while also staying sufficiently close to the policy of a “teacher.” Our approach is motivated by the fact that practitioners may be reluctant to adopt the policies proposed by RL algorithms if they differ too much from the status quo. Even if the RL policy produces an impressive expected return, this may not be satisfactory evidence to switch the operation of a billion-dollar company to a policy found by an RL. We believe that corrective RL provides a straightforward remedy by constraining how far the new policy can deviate from the old one or another desired, target policy. Doing so will help reduce the stresses of adopting a novel policy.

We believe that, with further extensions, corrective RL has the potential to address to some of RL’s interpretability challenges. Using more advanced optimization algorithms, studying different distance measures, considering continuous-action problems, and having multiple teachers represent fruitful avenues for future research.


This work was partially supported by the U.S. National Science Foundation, under award numbers NSF:CCF:1618717, NSF:CMMI:1663256 and NSF:CCF:1740796, and XSEDE IRI180020.


  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
  • Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Chen et al. [2018] Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, and Hai-Hong Tang. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1187–1196. ACM, 2018.
  • Nazari et al. [2018] MohammadReza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takac. Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems, pages 9861–9871, 2018.
  • Feng et al. [2017] Shuo Feng, Peyman Setoodeh, and Simon Haykin. Smart home: Cognitive interactive people-centric internet of things. IEEE Communications Magazine, 55(2):34–39, 2017.
  • Gijsbrechts et al. [2018] Joren Gijsbrechts, Robert N Boute, Jan A Van Mieghem, and Dennis Zhang. Can deep reinforcement learning improve inventory management? performance and implementation of dual sourcing-mode problems. Performance and Implementation of Dual Sourcing-Mode Problems (December 17, 2018), 2018.
  • Oroojlooyjadid et al. [2017] Afshin Oroojlooyjadid, MohammadReza Nazari, Lawrence Snyder, and Martin Takáč. A deep q-network for the beer game with partial information. arXiv preprint arXiv:1708.05924, 2017.
  • Gu et al. [2017] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation (ICRA), pages 3389–3396, 2017.
  • Snyder and Shen [2019] Lawrence V Snyder and Zuo-Jun Max Shen. Fundamentals of supply chain theory, 2th Edition. John Wiley & Sons, 2019.
  • Nasrabadi [2007] Nasser M Nasrabadi. Pattern recognition and machine learning. Journal of Electronic Imaging, 16(4):049901, 2007.
  • Bertsekas [1999] Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
  • Tamar et al. [2012] Aviv Tamar, Dotan Di Castro, and Shie Mannor. Policy gradients with variance related risk criteria. In Proceedings of the twenty-ninth International Conference on Machine Learning, pages 387–396, 2012.
  • Chow et al. [2017] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(167):1–167, 2017.
  • Konda and Tsitsiklis [2000] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems, pages 1008–1014, 2000.
  • Cramer [1946] Harold Cramer. Mathematical methods of statistics, princeton univ. Press, Princeton, NJ, 1946.
  • Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 580–587, 2014.
  • Schaal [1999] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6):233–242, 1999.
  • Thomaz et al. [2006] Andrea Lockerd Thomaz, Cynthia Breazeal, et al. Reinforcement learning with human teachers: Evidence of feedback and guidance with implications for learning performance. In Aaai, volume 6, pages 1000–1005. Boston, MA, 2006.
  • Rusu et al. [2015] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
  • Parisotto et al. [2015] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
  • Schmitt et al. [2018] Simon Schmitt, Jonathan J Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M Czarnecki, Joel Z Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, et al. Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835, 2018.
  • Achiam et al. [2017] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–31. JMLR. org, 2017.
  • Leike et al. [2017] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.
  • Chow et al. [2018] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 8092–8101, 2018.
  • Verma et al. [2018] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. arXiv preprint arXiv:1804.02477, 2018.
  • Jaderberg et al. [2016] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
  • Bhatnagar et al. [2009] Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Automatica, 45(11), 2009.
  • Borkar [2009] Vivek S Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
  • Bertsekas [2009] Dimitri P Bertsekas. Convex optimization theory. Athena Scientific Belmont, 2009.
  • Slotine et al. [1991] Jean-Jacques E Slotine, Weiping Li, et al. Applied nonlinear control, volume 199. Prentice hall Englewood Cliffs, NJ, 1991.
  • Teh et al. [2017] Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages 4496–4506, 2017.
  • Ghosh et al. [2017] Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. Divide-and-conquer reinforcement learning. arXiv preprint arXiv:1711.09874, 2017.
  • Liu et al. [2018] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pages 5361–5371, 2018.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Bello et al. [2016] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.

Appendix A PDPG Algorithms

a.1 Computing the Gradients

The Lagrangian function in the optimization problem (3) can be re-written as


Recall that is the set of all trajectories under all admissible policies. By taking the gradient of with respect to , we have:


and the term can be simplified as


The gradient of with respect to is


By using a set of sample trajectories generated under the student policy, one can approximate the gradients (12) and (14) as

which are the update rules that will be used later on, in (15) and (16).

a.2 PDPG Algorithm

Having derived the gradients of the Lagrangian (in Appendix A.1), we have all the necessary information for proposing our primal–dual policy gradient (PDPG) algorithm, which is described in Algorithm 1.

1:  input: teacher’s policy with weights
2:  initialize: student’s policy with , possibly equal to ; initialize step size schedules and
3:  while TRUE do
4:     for  do
5:        following policy , generate a set of trajectories , each starting from an initial state
6:        (-update) update according to
7:        (-update) update according to
8:     end for
9:     if  converges to  then
11:     else
12:        return and ; break
13:     end if
14:  end while
Algorithm 1 Primal-Dual Policy Gradient (PDPG) Algorithm for (OPT-R)

After initializing the student with the teacher’s policy, at each iteration , we take a mini-batch of sample trajectories under the student’s policy . In step 6, we use the sampled trajectories to compute an approximate gradient of the Lagrangian function with respect to and update the policy parameters in the negative direction of the approximate gradient with step size . In addition to policy parameter updates, the dual variables are learned concurrently using the recursive formula


where represents the associated step-size rule.

In this algorithm, we need to use two projection operators to ensure the convergence of the algorithm. Specifically, is an operator that projects to the closest point in , i.e., . Similarly, is an operator that maps to the interval . Finally, in steps 913, we check whether has converged to some point on the boundary. Such a convergence means that the projection space for the Lagrange multipliers is small, so we increment the upper bound and repeat searching for a better policy.

Appendix B Convergence Analysis of PDPG for (Opt-R)

Before starting the proof of Theorem 1, noting the definition of and , one can make the following observations:

Lemma 1.

Under Assumption 2, the following holds:

  1. is Lipschitz continuous in , which further implies that


    for some .

  2. is Lipschitz continuous in , which further implies that


    for some constant .

  3. is Lipschitz continuous in .


Recall from (13) that whenever for all and for some . Assumption 2 indicates that is -Lispchitz continuous in . Then using the fact the sum of the product of (bounded) Lipschitz functions is Lipschitz itself, one can conclude the Lipschitz continuity of , and we denote by its finite Lipschitz constant. Also, noting that w.p. 1, then w.p. 1. The Lipschitz continuity implies that for any fixed ,


The first inequality follows from the linear growth condition of Lipschitz functions and the last one holds for a suitable value of . Taking the square of both sides of (20) yields (18) with .

Since and are continuously differentiable in whenever , the Lipschitz continuity of can be investigated, from its definition (12), as the sums of products of (bounded) Lipschitz functions. From the definition (12), and recalling Assumption 1 and the compactness of , one can verify the validity of (19) with


Finally, 3 immediately follows from the fact that is a constant function of . ∎

b.1 Convergence of PDPG Algorithm

We use the standard procedure for proving the convergence of the PDPG algorithm. The proof steps are common for stochastic approximation methods and we refer the reader to Chow et al. [2017], Bhatnagar et al. [2009] and references therein for more details. We summarize the scheme of the proof in the following steps:

  1. Tracking o.d.e.: Under Assumption 3, one can view the PDPG as a two-time-scale stochastic approximation method. Then, using the results of Section 6 of Borkar [2009], we show that the sequence of converges almost surely to a stationary point of the corresponding continuous-time dynamical system.

  2. Lyapunov Stability: By using Lyapunov analysis, we show that the continuous-time system is locally asymptotically stable at a first-order stationary point.

  3. Saddle Point Analysis: Since we have used the Lagrangian as the Lyapunov function, it implies the system is stable in the stationary point of the Lagrangian, which is, in fact, a local saddle point. Finally, we show that with an appropriate initial policy, the policy converges to a local optimal solution for the OPT-R.

First, let us denote by the right directional derivative of in the direction of , defined as

for any compact set and .

Since converges on a faster time-scale than by Assumption 3, one can write the -update rule (15) with a relation that is invariant to :

Consider the continuous-time dynamics of defined as


where by using the right directional derivative in the gradient descent algorithm for , the gradient will point in the descent direction of along the boundary of (denoted by ) whenever the -update hits the boundary. We refer the interested reader to Section 5.4 of Borkar [2009] for discussions about the existence of the limit in (22).

Since converges in the slowest time-scale, the -update rule (16) can be re-written for a converged value as

Consider the continuous-time dynamics corresponding to , i.e.


where by using in the gradient ascent algorithm, the gradient will point in the ascent direction along the boundary of (denoted by ) whenever the -update hits the boundary.

We prove Theorem 1 next.


Convergence of the -update: First, we need to show that the assumptions of Lemma 1 in Chapter 6 of Borkar [2009] hold for the -update and an arbitrary value of . Let us justify these assumptions: (i) the Lipschitz continuity follows from Lemma 1, and (ii) the step-size rules follow from Assumption 3. (iii) For an arbitrary value , one can write the -update as a stochastic approximation, i.e.,




For to be a Martingale difference error term, we need to show that its expectation with respect to the filtration is zero and that it is square integrable with for some . Since the trajectories are generated from the probability mass function , it immediately follows that . Also, we have:


The first and second inequality uses the relation . Also, the second one uses the results of Lemma 1. Finally, the boundedness of follows from Assumption 1 and having , w.p. 1. Finally, (iv) almost surely, because all are within the compact set . Hence, by Theorem 2 of Chapter 2 in Borkar [2009], the sequence converges almost surely to a (possibly sample path dependent) internally chain transitive invariant set of o.d.e. (22).

For a given , define the Lyapunov function


where is a local minimum point. For the sake of simplifying the proof, let us consider that is an isolated local minimum point, i.e., there exists such that for all , . This means that the Lyapunov function is locally positive definite, i.e., and for .

If we establish the negative semi-definiteness of , then we can use the Lyapunov stability theorems to show the convergence of the dynamical system. Consider the time derivative of corresponding continuous-time system for , i.e.,


Consider two cases:

  1. For a fixed , there exists such that the update for all . In this case, , which further implies that

    and this quantity is non-zero as long as .

  2. For fixed and any , there exists such that . The projection maps to a point in . This projection is single-valued because of the compactness and convexity of , and we denote the projected point by . Consider , then

    where the last inequality follows from the Projection Theorem (see Proposition 1.1.9 of Bertsekas [2009]). Again, one can verify that the time-derivative quantity is non-zero as long as .

In summary, and this quantity is nonzero as long as . Then by LaSalle’s Local Invariant Set Theorem (see, e.g., Theorem 3.4 of Slotine et al. [1991]), we conclude that the dynamical system tends to the largest positive invariant set within . Notice that . Let be equal to . Then every trajectory starting from the attraction region will tend to the local minimum . Since we chose to be arbitrary, this holds for all local minima. Hence, using Corollary of Chapter 2 in Borkar [2009], we conclude that if the initial policy is within the attraction region of a local minimum point , then it will converge to it almost surely.


The case in which is not isolated can be handled similarly, with the minor difference that the convergence happens to a set of optimal points instead of to a single point.

Convergence of the -update: We need to show that the assumptions of Theorem 2 in Chapter 6 of Borkar [2009] hold for the two-time-scale stochastic approximation theory. Let us verify the validity of these assumptions: (i) is a Lipschitz function in from Lemma 1, and (ii) step-size rules follow from Assumption 3. (iii) Since converges in a slower time-scale, we have almost surely as , which, according to the Lipschitz continuity of , implies that


Hence the -update can be written as



From (29), we can verify that , where is a filtration of generated by different independent trajectories. Also, we have:

Hence, is a Martingale difference error. Also, (v) . Recall that from the convergence analysis of the -update for a , we know that is an asymptotically stable point. Then by Theorem 2 of Chapter 6 in Borkar [2009], we can conclude that converges almost surely to