Abstract
Dealing with high variance is a significant challenge in modelfree reinforcement learning (RL). Existing methods are unreliable, exhibiting high variance in performance from run to run using different initializations/seeds. Focusing on problems arising in continuous control, we propose a functional regularization approach to augmenting modelfree RL. In particular, we regularize the behavior of the deep policy to be similar to a policy prior, i.e., we regularize in function space. We show that functional regularization yields a biasvariance tradeoff, and propose an adaptive tuning strategy to optimize this tradeoff. When the policy prior has controltheoretic stability guarantees, we further show that this regularization approximately preserves those stability guarantees throughout learning. We validate our approach empirically on a range of settings, and demonstrate significantly reduced variance, guaranteed dynamic stability, and more efficient learning than deep RL alone.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
Control Regularization for Reduced Variance Reinforcement Learning
Richard Cheng ^{0 } Abhinav Verma ^{0 } Gábor Orosz ^{0 } Swarat Chaudhuri ^{0 } Yisong Yue ^{0 } Joel W. Burdick ^{0 }
Proceedings of the International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).\@xsect
Reinforcement learning (RL) focuses on finding an agent’s policy (i.e. controller) that maximizes longterm accumulated reward. This is done by the agent repeatedly observing its state, taking an action (according to a policy), and receiving a reward. Over time the agent modifies its policy to maximize its longterm reward. Amongst other applications, this method has been successfully applied to control tasks (Lillicrap et al., 2016; Schulman et al., 2015; Ghosh et al., 2018), learning to stabilize complex robots.
In this paper, we focus particularly on policy gradient (PG) RL algorithms, which have become popular in solving continuous control tasks (Duan et al., 2016). Since PG algorithms focus on maximizing the longterm reward through trial and error, they can learn to control complex tasks without a prior model of the system. This comes at the cost of slow, high variance, learning – complex tasks can take millions of iterations to learn. More importantly, variation between learning runs can be very high, meaning some runs of an RL algorithm succeed while others fail depending on randomness in initialization and sampling. Several studies have noted this high variability in learning as a significant hurdle for the application of RL, since learning becomes unreliable (Henderson et al., 2018; Arulkumaran et al., 2017; Recht, 2019). All policy gradient algorithms face the same issue.
We can alleviate the aforementioned issues by introducing a controltheoretic prior into the learning process using functional regularization. Theories and procedures exist to design stable controllers for the vast majority of realworld physical systems (from humanoid robots to robotic grasping to smart power grids). However, conventional controllers for complex systems can be highly suboptimal and/or require great effort in system modeling and controller design. It would be ideal then to leverage simple, suboptimal controllers in RL to reliably learn highperformance policies.
In this work, we propose a policy gradient algorithm, CORERL (COntrol REgularized Reinforcement Learning), that utilizes a functional regularizer around a, typically suboptimal, control prior (i.e. a controller designed from any prior knowledge of the system). We show that this approach significantly lowers variance in the policy updates, and leads to higher performance policies when compared to both the baseline RL algorithm and the control prior. In addition, we prove that our policy can maintain controltheoretic stability guarantees throughout the learning process. Finally, we empirically validate our approach using three benchmarks: a carfollowing task with real driving data, the TORCS racecar simulator, and a simulated cartpole problem. In summary, the main contributions of this paper are as follows:

We introduce functional regularization using a control prior, and prove that this significantly reduces variance during learning at the cost of potentially increasing bias.

We provide controltheoretic stability guarantees throughout learning when utilizing a robust control prior.

We validate experimentally that our algorithm, CORERL, exhibits reliably higher performance than the base RL algorithm (and control prior), achieves significant variance reduction in the learning process, and maintains stability throughout learning for stabilization tasks.
Significant previous research has examined variance reduction and bias in policy gradient RL. It has been shown that an unbiased estimate of the policy gradient can be obtained from sample trajectories (Williams, 1992; Sutton et al., 1999; Baxter & Bartlett, 2000), though these estimates exhibit extremely high variance. This variance can be reduced without introducing bias by subtracting a baseline from the reward function in the policy gradient (Weaver & Tao, 2001; Greensmith et al., 2004). Several works have studied the optimal baseline for variance reduction, often using a critic structure to estimate a value function or advantage function for the baseline (Zhao et al., 2012; Silver et al., 2014; Schulman et al., 2016; Wu et al., 2018). Other works have examined variance reduction in the value function using temporal regularization or regularization directly on the sampled gradient variance (Zhao et al., 2015; Thodoroff et al., 2018). However, even with these tools, variance still remains problematically high in reinforcement learning (Islam et al., 2017; Henderson et al., 2018). Our work aims to achieve significant further variance reduction directly on the policy using controlbased functional regularization.
Recently, there has been increased interest in functional regularization of deep neural networks, both in reinforcement learning and other domains. Work by Le et al. (2016) has utilized functional regularization to guarantee smoothness of learned functions, and Benjamin et al. (2018) studied properties of functional regularization to limit function distances, though they relied on pointwise sampling from the functions which can lead to high regularizer variance. In terms of utilizing control priors, work by Johannink et al. (2018) adds a control prior during learning, and empirically demonstrates improved performance. Researchers in Farshidian et al. (2014); Nagabandi et al. (2017) used modelbased priors to produce a good initialization for their RL algorithm, but did not use regularization during learning.
Another thread of related work is that of safe RL. Several works on modelbased RL have looked at constrained learning such that stability is always guaranteed using Lyapunovbased methods (Perkins & Barto, 2003; Chow et al., 2018; Berkenkamp et al., 2017). However, these approaches do not address reward maximization or they overly constrain exploration. On the other hand, work by Achiam et al. (2017) has incorporated constraints (such as stability) into the learning objective, though modelfree methods only guarantee approximate constraint satisfaction after a learning period, not during learning (García & Fernández, 2015). Our work proves stability properties throughout learning by taking advantage of the robustness of controltheoretic priors.
Consider an infinitehorizon discounted Markov decision process (MDP) with deterministic dynamics defined by the tuple , where is a set of states, is a continuous and convex action space, and describes the system dynamics, which is unknown to the learning agent. The evolution of the system is given by the following dynamical system and its continuoustime analogue,
(1) 
where captures the known dynamics, represents the unknowns, denotes the continuous timederivative of the state , and denotes the continuoustime analogue of the discrete time dynamics . A control prior can typically be designed from the known part of the system model, .
Consider a stochastic policy parameterized by . RL aims to find the policy (i.e. parameters, ) that maximizes the expected accumulated reward :
(2) 
Here is a trajectory whose actions and states are sampled from the policy distribution and the environmental dynamics (1), respectively. The function is the reward function, and is the discount factor.
This work focuses on policy gradient RL methods, which estimate the gradient of the expected return with respect to the policy based on sampled trajectories. We can estimate the gradient, , as follows (Sutton et al., 1999),
(3) 
where . With a good Qfunction estimate, the term (3) is a lowbias estimator of the policy gradient, and utilizes the variancereduction technique of subtracting a baseline from the reward. However, the resulting policy gradient still has very high variance with respect to , because the expectation in term (3) must be estimated using a finite set of sampled trajectories. This high variance in the policy gradient, , translates to high variance in the updated policy, , as seen below,
(4) 
where is the userdefined learning rate. It is important to note that the variance we are concerned about is with respect to the parameters , not the noise in the exploration process.
To illustrate the variance issue, Fig. 1 shows the results of 100 separate learning runs using direct policy search on the OpenAI gym task Humanoidv1 (Recht, 2019). Though high rewards are often achieved, huge variance arises from random initializations and seeds. In this paper, we show that introducing a control prior reduces learning variability, improves learning efficiency, and can provide controltheoretic stability guarantees during learning.
The policy gradient allows us to optimize the objective from sampled trajectories, but it does not utilize any prior model. However, in many cases we have enough system information to propose at least a crude nominal controller. Therefore, suppose we have a (suboptimal) control prior, , and we want to combine our RL policy, , with this control prior at each learning stage, . Before we proceed, let us define to represent the realized controller sampled from the stochastic RL policy (we will use to represent deterministic policies and to represent the analogous stochastic ones). We propose to combine the RL policy with the control prior as follows,
(5) 
where we assume a continuous, convex action space. Note that is the realized controller sampled from stochastic policy , whose distribution over actions has been shifted by such that . We refer to as the mixed policy, and as the RL policy.
Utilizing the mixed policy (5) is equivalent to placing a functional regularizer on the RL policy, , with regularizer weight . Let be Gaussian distributed: , so that describes the exploration noise. Then we obtain the following,
(6) 
where the control prior, can be interpreted as a Gaussian prior on the mixed control policy (see Appendix A). Let us define the norm .
Lemma 1.
The policy in Equation (6) is the solution to the following regularized optimization problem,
(7) 
which can be equivalently expressed as the constrained optimization problem,
(8) 
where constrains the policy search. Assuming convergence of the RL algorithm, converges to the solution,
(9) 
This lemma is proved in Appendix A. The equivalence between (6) and (7) illustrates that the control prior acts as a functional regularization (recall that solves the reward maximization problem appearing in (9) ). The policy mixing (6) can also be interpreted as constraining policy search near the control prior, as shown by (8). More weight on the control prior (higher ) constrains the policy search more heavily. In certain settings, the problem can be solved in the constrained optimization formulation (Le et al., 2019).
Our learning algorithm is described in Algorithm 1. At the high level, the process can be described as:

First compute the control prior based on prior knowledge (Line 1). See Section 5 for details on controller synthesis.

For a given policy iteration, compute the regularization weight, , using the strategy described in Section 4.3 (Lines 79). The algorithm can also use a fixed regularization weight, (Lines 1011).

Deploy the mixed policy (5) on the system, and record the resulting states/action/rewards (Lines 1315).

At the end of each policy iteration, update the policy based on the recorded state/action/rewards (Lines 1618).
Theorem 1 formally states that mixing the policy gradientbased controller, , with the control prior, , decreases learning variability. However, the mixing may introduce bias into the learned policy that depends on the (a) regularization , and (b) suboptimality of the control prior. Bias is defined in (10) and refers to the difference between the mixed policy and the (potentially locally) optimal RL policy at convergence.
Theorem 1.
Consider the mixed policy (5) where is a policy gradientbased RL policy, and denote the (potentially local) optimal policy to be . The variance (4) of the mixed policy arising from the policy gradient is reduced by a factor when compared to the RL policy with no control prior.
However, the mixed policy may introduce bias proportional to the suboptimality of the control prior. If we let , then the policy bias (i.e. ) is bounded as follows,
(10) 
where represents the total variation distance between two probability measures (i.e. policies). Thus, if and are large, this will introduce policy bias.
The proof can be found in Appendix B. Recall that is the stochastic analogue to the deterministic control prior , such that where is the indicator function. Note that the bias/variance results apply to the policy – not the accumulated reward.
Intuition: Using Figure 2, we provide some intuition for the control regularization discussed above. Note the following:

The explorable region of the state space is denoted by the set , which grows as decreases and vice versa. This illustrates the constrained policy search interpretation of regularization in the state space.

The difference between the control prior trajectory and optimal trajectory (i.e. ) may bias the final policy depending on the explorable region (i.e. ). Fig 2. shows this difference, and its implications, in state space.

If the optimal trajectory is within the explorable region, then we can learn the corresponding optimal policy – otherwise the policy will remain suboptimal.
Points 1 and 3 will be formally addressed in Section 5.
A remaining challenge is automatically tuning , especially as we acquire more training data. While setting a fixed can perform well, intuitively, should be large when the RL controller is highly uncertain, and it should decrease as we become more confident in our learned controller.
Consider the multiple model adaptive control (MMAC) framework, where a set of controllers (each based on a different underlying model) are generated. A metacontroller computes the overall controller by selecting the weighting for different candidate controllers, based on how close the underlying system model for each candidate controller is to the “true” model (Kuipers & Ioannou, 2010). Inspired by this approach, we should weight the RL controller proportional to our confidence in its model. Our confidence should be statedependent (i.e. low confidence in areas of the state space where little data has been collected). However, since the RL controller does not utilize a dynamical system model, we propose measuring confidence in the RL controller via the magnitude of the temporal difference (TD) error,
(11) 
where . This TD error measures how poorly the RL algorithm predicts the value of subsequent actions from a given state. A high TDerror implies that the estimate of the actionvalue function at a given state is poor, so we should rely more heavily on the control prior (a high value). In order to scale the TDerror to a value in the interval , we take the negative exponential of the TDerror, computed at runtime,
(12) 
The parameters and are tuning parameters of the adaptive weighting strategy. Note that Equation (12) uses rather than , because computing requires measurement of state . Thus we rely on the reasonable assumption that , since should be very close to in practice.
Equation (12) yields a low value of if the RL actionvalue function predictions are accurate. This measure is chosen because the (explicit) underlying model of the RL controller is the value function (rather than a dynamical system model). Our experiments show that this adaptive scheme based on the TD error allows better tuning of the variance and performance of the policy.
In many controls applications, it is crucial to ensure dynamic stability, not just high rewards, during learning. When a (crude) dynamical system model is available, we can utilize classic controller synthesis tools (i.e. LQR, PID, etc.) to obtain a stable control prior in a region of the state space. In this section, we utilize a wellestablished tool from robust control theory ( control), to analyze system stability under the mixed policy (5), and prove stability guarantees throughout learning when using a robust control prior.
Our work is built on the idea that the control prior should maximize robustness to disturbances and model uncertainty, so that we can treat the RL control, , as a performancemaximizing “disturbance” to the control prior, . The mixed policy then takes advantage of the stability properties of the robust control prior, and the performance optimization properties of the RL algorithm. To obtain a robust control prior, we utilize concepts from control (Doyle, 1996).
Consider the nonlinear dynamical system (1), and let us linearize the known part of the model around a desired equilibrium point to obtain the following,
(13) 
where is the disturbance vector, and is the controlled output. Note that we analyze the continuoustime dynamics rather than discretetime, since all mechanical systems have continuous time dynamics that can be discovered through analysis of the system Lagrangian. However, similar analysis can be done for discretetime dynamics. We make the following standard assumption – conditions for its satisfaction can be found in (Doyle et al., 1989),
Assumption 1.
A controller exists for linear system (13) that stabilizes the system in a region of the state space.
Stability here means that system trajectories are bounded around the origin/setpoint. We can then synthesize an controller, , using established techniques described in (Doyle et al., 1989). The resulting controller is robust with worstcase disturbances attenuated by a factor before entering the output, where is a parameter returned by the synthesis algorithm. See Appendix F for further details on control and its robustness properties.
Having synthesized a robust controller for the linear system model (13), we are interested in how those robustness properties (e.g. disturbance attenuation by ) influence the nonlinear system (1) controlled by the mixed policy (5). We rewrite the system dynamics (1) in terms of the linearization (13) plus a disturbance term as follows,
(14) 
where gathers together all dynamic uncertainties and nonlinearities. To keep this small, we could use feedback linearization based on the nominal nonlinear model (1).
We now analyze stability of the nonlinear system (14) under the mixed policy (5) using Lyapunov analysis (Khalil, 2000). Consider the Lyapunov function , where is obtained when synthesizing the controller (see Appendix F). If we can define a closed region, , around the origin such that outside that region, then by standard Lyapunov analysis, is forward invariant and asymptotically stable (note is the timederivative of the Lyapunov function). Since the control law satisfies an Algebraic Riccati Equation, we obtain the following relation,
Lemma 2.
For any state , satisfaction of the condition,
implies that .
This lemma is proved in Appendix C. Note that denotes the difference between the RL controller and control prior, and come from (13). Let us bound the RL control output such that , and define the set . We also bound the “disturbance” , for all , and define the minimum singular value , which reflects the robustness of the control prior (i.e. larger imply greater robustness). Then using Lemma 2 and Lyapunov analysis tools, we can derive a conservative set that is guaranteed asymptotically stable and forward invariant under the mixed policy, as described in the following theorem (proof in Appendix D).
Theorem 2.
Assume a stabilizing control prior within the set for the dynamical system (14). Then asymptotic stability and forward invariance of the set
(15) 
is guaranteed under the mixed policy (5) for all . The set contracts as we (a) increase robustness of the control prior (increase ), (b) decrease our dynamic uncertainty/nonlinearity , or (c) increase weighting on the control prior.
Put simply, Theorem 2 says that all states in will converge to (and remain within) set under the mixed policy (5). Therefore, the stability guarantee is stronger if has smaller cardinality. The set is drawn pictorally in Fig. 2, and essentially dictates the explorable region. Note that is not the region of attraction.
Theorem 2 highlights the tradeoff between robustness parameter, , of the control prior, the nonlinear uncertainty in the dynamics , and the utilization of the learned controller, . If we have a more robust control prior (higher ) or better knowledge of the dynamics (smaller ), we can heavily weight the learned controller (lower ) during the learning process while still guaranteeing stability.
While shrinking the set and achieving asymptotic stability along a trajectory or equilibrium point may seem desirable, Fig. 2 illustrates why this is not necessarily the case in an RL context. The optimal trajectory for a task typically deviates from the nominal trajectory (i.e. the control theoretictrajectory), as shown in Fig. 2 – the set illustrates the explorable region under regularization. Fig. 2(a) shows that we do not want strict stability of the nominal trajectory, and instead would like limited flexibility (a sufficiently large ) to explore. By increasing the weighting on the learned policy (decreasing ), we expand the set and allow for greater exploration around the nominal trajectory (at the cost of stability) as seen in Fig. 2(b).
We apply the CORERL Algorithm to three problems: (1) cartpole stabilization, (2) carfollowing control with experimental data, and (3) racecar driving with the TORCS simulator. We show results using DDPG or PPO or TRPO (Lillicrap et al., 2016; Schulman et al., 2017; 2015) as the policy gradient RL algorithm (PPO + TRPO results moved to Appendix G), though any similar RL algorithm could be used. All code can be found at https://github.com/rcheng805/CORERL.
Note that our results focus on reward rather than bias. Bias (as defined in Section 4.2) assumes convergence to a (locally) optimal policy, and does not include many factors influencing performance (e.g. slow learning, failure to converge, etc.). In practice, DeepRL algorithms often do not converge (or take very long to do so). Therefore, reward better demonstrates the influence of control regularization on performance, which is of greater practical interest.
We apply the CORERL algorithm to control of the cartpole from the OpenAI gym environment (CartPolev1). We modified the CartPole environment so that it takes a continuous input, rather than discrete input, and we utilize a reward function that encourages the cartpole to maintain its position while keeping the pole upright. Further details on the environment and reward function are in Appendix E. To obtain a control prior, we assume a crude model (i.e. linearization of the nonlinear dynamics with error in the mass and length values), and from this we synthesize an controller. Using this control prior, we run Algorithm 1 with several different regularization weights, . For each , we run CORERL 6 times with different random seeds.
Figure 4a plots reward improvement over the control prior, which shows that the regularized controllers perform much better than the baseline DDPG algorithm (in terms of variance, reward, and learning speed). We also see that intermediate values of (i.e. ) result in the best learning, demonstrating the importance of policy regularization.
Figure 4b better illustrates the performancevariance tradeoff. For small , we see high variance and poor performance. With intermediate , we see higher performance and lower variance. As we further increase , variance continues to decrease, but the performance also decreases since policy exploration is heavily constrained. The adaptive mixing strategy performs very well, exhibiting low variance through learning, and converging on a highperformance policy.
While Lemma 1 proved that the mixed controller (6) has the same optimal solution as optimization problem (7), when we ran experiments directly using the loss in (7), we found that performance (i.e. reward) was worse than CORERL and still suffered high variance. In addition, learning with pretraining on the control prior likewise exhibited high variance and had worse performance than CORERL.
Importantly, according to Theorem 2, the system should maintain stability (i.e. remain within an invariant set around our desired equilibrium point) throughout the learning process, and the stable region shrinks as we increase . Our simulations exhibit exactly this property as seen in Figure 3, which shows the maximum deviation from the equilibrium point across all episodes. The system converges to a stability region throughout learning, and this region contracts as we increase . Therefore, regularization not only improves learning performance and decreases variance, but can capture stability guarantees from a robust control prior.
We next examine experimental data from a chain of 5 cars following each other on an 8mile segment of a singlelane public road. We obtain position (via GPS), velocity, and acceleration data from each of the cars, and we control the acceleration/deceleration of the car in the chain. The goal is to learn an optimal controller for this car that maximizes fuel efficiency while avoiding collisions. The experimental setup and data collection process are described in (Ge et al., 2018). For the control prior, we utilize a bangbang controller that (inefficiently) tries to maintain a large distance from the car in front and behind the controlled car. The reward function penalizes fuel consumption and collisions (or nearcollisions). Specifics of the control prior, reward function, and experiments are in Appendix E.
For our experiments, we split the data into 10 second “episodes”, shuffle the episodes, and run CORERL six times with different random seeds (for several different ).
Figure 4a shows again that the regularized controllers perform much better than the baseline DDPG algorithm for the carfollowing problem, and demonstrates that regularization leads to performance improvements over the control prior and gains in learning efficiency. Figure 4b reinforces that intermediate values of (i.e. ) exhibit optimal performance. Low values of exhibit significant deterioration of performance, because the car must learn (with few samples) in a much larger policy search space; the RL algorithm does not have enough data to converge on an optimal policy. High values of also exhibit lower performance because they heavily constrain learning. Intermediate allow for the best learning using the limited number of experiments.
Using an adaptive strategy for setting (or alternatively tuning to an optimal ), we obtain highperformance policies that improve upon both the control prior and RL baseline controller. The variance is also low, so that the learning process reliably learns a good controller.
Finally we run CORERL to generate controllers for cars in The Open Racing Car Simulator (Torcs) (Wymann et al., 2014). The simulator provides readings from sensors, which describe the environment state. The sensors provide information like car speed, distance from track center, wheel spin, etc. The controller decides values for the acceleration, steering and braking actions taken by the car.
To obtain a control prior for this environment, we use a simple PIDlike linearized controller for each action, similar to the one described in (Verma et al., 2018). These types of controllers are known to have suboptimal performance, while still being able to drive the car around a lap. We perform all our experiments on the CGSpeedway track in Torcs. For each , we run the algorithm times with different initializations and random seeds.
For Torcs, we plot laptime improvement over the control prior so that values above zero denote improved performance over the prior. The laps are timed out at s, and the objective is to minimize laptime by completing a lap as fast as possible. Due to the sparsity of the laptime signal, we use a pseudoreward function during training that provides a heuristic estimate of the agent’s performance at each time step during the simulation (details in Appendix E).
Once more, Figure 4a shows that regularized controllers perform better on average than the baseline DDPG algorithm, and that we improve upon the control prior with proper regularization. Figure 4b shows that intermediate values of exhibit good performance, but using the adaptive strategy for setting in the TORCS setting gives us the highestperformance policy that significantly beats both the control prior and DDPG baseline. Also, the variance with the adaptive strategy is significantly lower than for the DDPG baseline, which again shows that the learning process reliably learns a good controller.
Note that we have only shown results for DDPG. Results for PPO and TRPO are similar for CartPole and Carfollowing (different for TORCS), and can be found in Appendix G.
A significant criticism of RL is that random seeds can produce vastly different learning behaviors, limiting application of RL to real systems. This paper shows, through theoretical results and experimental validation, that our method of control regularization substantially alleviates this problem, enabling significant variance reduction and performance improvements in RL. This regularization can be interpreted as constraining the explored action space during learning.
Our method also allows us to capture dynamic stability properties of a robust control prior to guarantee stability during learning, and has the added benefit that it can easily incorporate different RL algorithms (e.g. PPO, DDPG, etc.). The main limitation of our approach is that it relies on a reasonable control prior, and it remains to be analyzed how bad of a control prior can be used while still aiding learning.
Acknowledgements This work was funded in part by Raytheon under the Learning to Fly program, and by DARPA under the PhysicsInfused AI Program.
References
 Achiam et al. (2017) Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained Policy Optimization. In International Conference on Machine Learning (ICML), 2017.
 Arulkumaran et al. (2017) Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 2017.
 Baxter & Bartlett (2000) Baxter, J. and Bartlett, P. Reinforcement learning in POMDP’s via direct gradient ascent. International Conference on Machine Learning, 2000.
 Benjamin et al. (2018) Benjamin, A. S., Rolnick, D., and Kording, K. Measuring and Regularizing Networks in Function Space. arXiv:1805.08289, 2018.
 Berkenkamp et al. (2017) Berkenkamp, F., Turchetta, M., Schoellig, A. P., and Krause, A. Safe Modelbased Reinforcement Learning with Stability Guarantees. In Neural Information Processing Systems (NeurIPS), 2017.
 Chow et al. (2018) Chow, Y., Nachum, O., DuenezGuzman, E., and Ghavamzadeh, M. A Lyapunovbased Approach to Safe Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
 Doyle (1996) Doyle, J. Robust and Optimal Control. In Conference on Decision and Control, 1996.
 Doyle et al. (1989) Doyle, J., Glover, K., Khargonekar, P., and Francis, B. Statespace solutions to standard H/sub 2/ and H/sub infinity / control problems. IEEE Transactions on Automatic ControlTransactions on Automatic Control, 1989. ISSN 00189286. doi: 10.1109/9.29425.
 Duan et al. (2016) Duan, Y., Chen, X., Schulman, J., and Abbeel, P. Benchmarking Deep Reinforcement Learning for Continuous Control. In International Conference on Machine Learning (ICML), 2016.
 Farshidian et al. (2014) Farshidian, F., Neunert, M., and Buchli, J. Learning of closedloop motion control. In IEEE International Conference on Intelligent Robots and Systems, 2014.
 García & Fernández (2015) García, J. and Fernández, F. A Comprehensive Survey on Safe Reinforcement Learning. JMLR, 2015.
 Ge et al. (2018) Ge, J. I., Avedisov, S. S., He, C. R., Qin, W. B., Sadeghpour, M., and Orosz, G. Experimental validation of connected automated vehicle design among humandriven vehicles. Transportation Research Part C: Emerging Technologies, 2018.
 Ghosh et al. (2018) Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., and Levine, S. Divideandconquer reinforcement learning. In Neural Information Processing Systems (NeurIPS), volume abs/1711.09874, 2018.
 Greensmith et al. (2004) Greensmith, E., Bartlett, P., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. JMLR, 2004.
 Henderson et al. (2018) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep Reinforcement Learning that Matters. In AAAI Conference on Artificial Intelligence, 2018.
 Islam et al. (2017) Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. Reproducibility of Benchmarked Deep Reinforcement Learning of Tasks for Continuous Control. In Reproducibility in Machine Learning Workshop, 2017.
 Johannink et al. (2018) Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Aparicio Ojea, J., Solowjow, E., and Levine, S. Residual Reinforcement Learning for Robot Control. arXiv eprints, art. arXiv:1812.03201, Dec 2018.
 Khalil (2000) Khalil, H. K. Nonlinear Systems (Third Edition). Prentice Hall, 2000.
 Kuipers & Ioannou (2010) Kuipers, M. and Ioannou, P. Multiple model adaptive control with mixing. IEEE Transactions on Automatic Control, 2010.
 Le et al. (2016) Le, H., Kang, A., Yue, Y., and Carr, P. Smooth Imitation Learning for Online Sequence Prediction. In International Conference on Machine Learning (ICML), 2016.
 Le et al. (2019) Le, H. M., Voloshin, C., and Yue, Y. Batch policy learning under constraints. In International Conference on Machine Learning, 2019.
 Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
 Nagabandi et al. (2017) Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. Neural Network Dynamics for ModelBased Deep Reinforcement Learning with ModelFree FineTuning. arXiv eprints, art. arXiv:1708.02596, Aug 2017.
 Perkins & Barto (2003) Perkins, T. J. and Barto, A. G. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 2003.
 Recht (2019) Recht, B. A Tour of Reinforcement Learning: The View from Continuous Control. Annual Review of Control, Robotics, and Autonomous Systems, 2(1):253–279, 2019.
 Schulman et al. (2015) Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. Trust Region Policy Optimization. In International Conference on Machine Learning (ICML), 2015.
 Schulman et al. (2016) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. HighDimensional Continuous Control Using Generalized Advantage Estimation. International Conference on Learning Representations, 2016.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal Policy Optimization Algorithms. arXiv eprints, art. arXiv:1707.06347, Jul 2017.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on Machine Learning (ICML14), 2014.
 Sutton et al. (1999) Sutton, R., McAllester, D., Singh, S. P., and Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems, 1999.
 Thodoroff et al. (2018) Thodoroff, P., Durand, A., Pineau, J., and Precup, D. Temporal Regularization for Markov Decision Process. In Advances in Neural Information Processing Systems, 2018.
 Verma et al. (2018) Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri, S. Programmatically interpretable reinforcement learning. In International Conference on Machine Learning (ICML), 2018.
 Weaver & Tao (2001) Weaver, L. and Tao, N. The Optimal Reward Baseline for GradientBased Reinforcement Learning. In Uncertainty in Artificial Intelligence (UAI), 2001.
 Williams (1992) Williams, R. J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 1992.
 Wu et al. (2018) Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A. M., Kakade, S., Mordatch, I., and Abbeel, P. Variance Reduction for Policy Gradient with ActionDependent Factorized Baselines. In International Conference on Learning Representations, 2018.
 Wymann et al. (2014) Wymann, B., Espié, E., Guionneau, C., Dimitrakakis, C., Coulom, R., and Sumner, A. TORCS, The Open Racing Car Simulator. http://www.torcs.org, 2014.
 Zhao et al. (2012) Zhao, T., Hachiya, H., Niu, G., and Sugiyama, M. Analysis and improvement of policy gradient estimation. Neural Networks, 2012.
 Zhao et al. (2015) Zhao, T., Niu, G., Xie, N., Yang, J., and Sugiyama, M. Regularized Policy Gradients : Direct Variance Reduction in Policy Gradient Estimation. Proceedings of the Asian Conference on Machine Learning, 2015.
Appendix: Control Regularization for Reduced Variance Reinforcement Learning
Lemma 1.
The policy in Equation (6) is the solution to the following regularized optimization problem,
(16) 
which can be equivalently expressed as the constrained optimization problem:
(17) 
where constrains the policy search. Assuming convergence of the RL algorithm, converges to the solution,
(18) 
Proof.
Equivalence between (6) and (16) : Let be a Gaussian distributed policy with mean : . Thus, describes exploration noise. From the mixed policy definition (6), we can obtain the following Gaussian distribution describing the mixed policy:
(19) 
where the second equality follows based on the properties of products of Gaussians. Let us define , and let be the determinant of . Then, distribution (19) can be rewritten as the product,
(20) 
where is a random variable with representing the probability of taking action from state under policy (6). Further simplifying this PDF, we obtain:
(21) 
Since the probability is maximized when the argument of the exponential in Equation (21) is minimized, then the maximum probability policy can be expressed as the solution to the following regularized optimization problem,
(22) 
Convergence of (16) to (18): Note that and are parameterized by the same and represent the iterative solution to the optimization problem at the latest policy iteration. Thus, assuming convergence of the RL algorithm, we can rewrite problem (22) as follows,
(23) 
Equivalence between (16) and (17) : Finally, we want to show that the solutions for regularized problem (16) and the constrained optimization problem (17) are equivalent.
First, note that Problem (16) is the dual to Problem (17), where is the dual variable. Clearly problem (16) is convex in . Furthermore, Slater’s condition holds, since there is always a feasible point (e.g. trivially ). Therefore strong duality holds. This means that such that the solution to Problem (17) must also be optimal for Problem (16).
To show the other direction, fix and define and for all . Let us denote as the optimal solution for Problem (16) with (note we can choose ). However supposed is not optimal for Problem (17). Then there exists such that and . Denote the difference in the two rewards by . Thus the following relations hold,
(24) 
This leads to the conditional statement,
(25) 
For fixed , there always exists such that the condition holds. However, this leads to a contradiction, since we assumed that is optimal for Problem (16). We can conclude then that such that the solution to Problem (16) must be optimal for Problem (17). Therefore, Problems (16) and (17) have equivalent solutions.
∎
Theorem 1.
Consider the mixed policy (5) where is an RL controller learned through policy gradients, and denote the (potentially local) optimal policy to be . The variance (4) of the mixed policy arising from the policy gradient is reduced by a factor when compared to the RL policy with no control prior.
However, the mixed policy may introduce bias proportional to the suboptimality of the control prior. More formally, if we let , then the policy bias (i.e. ) is bounded as follows:
(26) 
where represents the total variation distance between two probability measures (i.e. policies). Thus, if and are large, this will introduce policy bias.
Proof.
Let us define the stochastic action (i.e. random variable) . Then recall from Equation (4) that assuming a fixed, Gaussian distributed policy, ,
(27) 
Based on the mixed policy definition (5), we obtain the following relation between the variance of and (the mixed policy and RL policy, respectively),
(28) 
Compared to the variance (4), we achieve a variance reduction when utilizing the same learning rate . Taking the same policy gradient from (4), , then the variance is reduced by a factor of by introducing policy mixing.
Lower variance comes at a price – potential introduction of bias into policy. Let us define the policy bias as , and let us denote . Since total variational distance, is a metric, we can use the triangle inequality to obtain:
(29) 
We can further break down the term :
(30) 
This holds for all . From (29) and (30), we can obtain the lower bound in (26),
To obtain the upper bound, let the policy gradient algorithm with no control prior achieve asymptotic convergence to the (locally) optimal policy (as proven for certain classes of function approximators in (Sutton et al., 1999)). Denote this policy as , such that as . In this case, we can derive the total variation distance between the mixed policy (5) and the optimal policy as follows,
(31) 
Note that this represents an upper bound on the bias, since it assumes that is uninfluenced by during learning. It shows that is a feasible policy, but not necessarily optimal when accounting for regularization with . Therefore, we can obtain the upper bound:
(32) 
∎
Lemma 2.
For any state , satisfaction of the condition,