OnPolicy Trust Region Policy Optimisation with Replay Buffers
Abstract
Building upon the recent success of deep reinforcement learning methods, we investigate the possibility of onpolicy reinforcement learning improvement by reusing the data from several consecutive policies. Onpolicy methods bring many benefits, such as ability to evaluate each resulting policy. However, they usually discard all the information about the policies which existed before. In this work, we propose adaptation of the replay buffer concept, borrowed from the offpolicy learning setting, to create the method, combining advantages of on and offpolicy learning. To achieve this, the proposed algorithm generalises the , value and advantage functions for data from multiple policies. The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics, as well as the trainable covariance matrix instead of the fixed one. In many cases, the method not only improves the results comparing to the stateoftheart trust region onpolicy learning algorithms such as PPO, ACKTR and TRPO, but also with respect to their offpolicy counterpart DDPG.
OnPolicy Trust Region Policy Optimisation with Replay Buffers
Dmitry Kangin & Nicolas Pugeault 

Department of Computer Science, University of Exeter, UK 
{d.kangin, n.pugeault}@exeter.ac.uk 
1 Introduction
The past few years have been marked by active development of reinforcement learning methods. Although the mathematical foundations of reinforcement learning have been known long before (Sutton & Barto, 1998), starting from 2013, the novel deep learning techniques allowed to solve vision based discrete control tasks such as Atari 2600 games (Mnih et al., 2013) as well as continuous control problems (Lillicrap et al., 2015; Mnih et al., 2016). Many of the leading stateoftheart reinforcement learning methods share the actorcritic architecture (Crites & Barto, 1995). Actorcritic methods separate the actor, providing a policy, and the critic, providing an approximation for the expected discounted cumulative reward or some derived quantities such as advantage functions (Baird III, 1993). However, despite improvements, stateoftheart reinforcement learning still suffers from poor sample efficiency and extensive parameterisation. For most realworld applications, in contrast to simulations, there is a need to learn in real time and over a limited training period, while minimising any risk that would cause damage to the actor or the environment.
Reinforcement learning algorithms can be divided into two groups: onpolicy and offpolicy learning. Onpolicy approaches (e. g., SARSA (Rummery & Niranjan, 1994), ACKTR (Wu et al., 2017)) evaluate the target policy by assuming that future actions will be chosen according to it, hence the exploration strategy must be incorporated as a part of the policy. Offpolicy methods (e. g., Qlearning (Watkins, 1989), DDPG (Lillicrap et al., 2015)) separate the exploration strategy, which modifies the policy to explore different states, from the target policy.
The offpolicy methods commonly use the concept of replay buffers to memorise the outcomes of the previous policies and therefore exploit the information accumulated through the previous iterations (Lin, 1993). Mnih et al. (2013) combined this experience replay mechanism with Deep QNetworks (DQN), demonstrating endtoend learning on Atari 2600 games. One limitation of DQN is that it can only operate on discrete action spaces. Lillicrap et al. (2015) proposed an extension of DQN to handle continuous action spaces based on the Deep Deterministic Policy Gradient (DDPG). There, exponential smoothing of the target actor and critic weights has been introduced to ensure stability of the rewards and critic predictions over the subsequent iterations. In order to improve the variance of policy gradients, Schulman et al. (2015b) proposed a Generalised Advantage Function. Mnih et al. (2016) combined this advantage function learning with a parallelisation of exploration using differently trained actors in their Asynchronous Advantage Actor Critic model (A3C); however, Wang et al. (2016) demonstrated that such parallelisation may also have negative impact on sample efficiency. Although some work has been performed on improvement of exploratory strategies for reinforcement learning (Hester et al., 2013), but it still does not solve the fundamental restriction of inability to evaluate the actual policy, neither it removes the necessity to provide a separate exploratory strategy as a separate part of the method.
In contrast to those, stateoftheart onpolicy methods have many attractive properties: they are able to evaluate exactly the resulting policy with no need to provide a separate exploration strategy. However, they suffer from poor sample efficiency, to a larger extent than offpolicy reinforcement learning. TRPO method (Schulman et al., 2015a) has introduced trust region policy optimisation to explicitly control the speed of policy evolution of Gaussian policies over time, expressed in a form of KullbackLeibler divergence, during the training process. Nevertheless, the original TRPO method suffered from poor sample efficiency in comparison to offpolicy methods such as DDPG. One way to solve this issue is by replacing the first order gradient descent methods, standard for deep learning, with second order natural gradient (Amari, 1998). Wu et al. (2017) used a Kroneckerfactored Approximate Curvature (KFAC) optimiser (Martens & Grosse, 2015) in their ACKTR method. PPO method (Schulman et al., 2017) proposes a number of modifications to the TRPO scheme, including changing the objective function formulation and clipping the gradients. Wang et al. (2016) proposed another approach in their ACER algorithm: in this method, the target network is still maintained in the offpolicy way, similar to DDPG (Lillicrap et al., 2015), while the trust region constraint is built upon the difference between the current and the target network.
Related to our approach, recently a group of methods has appeared in an attempt to get the benefits of both groups of methods. Gu et al. (2017) propose interpolated policy gradient, which uses the weighted sum of both stochastic (Sutton et al., 2000) and deterministic policy gradient (Silver et al., 2014). Nachum et al. (2018) propose an offpolicy trust region method, TrustPCL, which exploits offpolicy data within the trust regions optimisation framework, while maintaining stability of optimisation by using relative entropy regularisation.
While it is a common practice to use replay buffers for the offpolicy reinforcement learning, their existing concept is not used in combination with the existing onpolicy scenarios, which results in discarding all policies but the last. Furthermore, many onpolicy methods, such as TRPO (Schulman et al., 2015a), rely on stochastic policy gradient (Sutton et al., 2000), which is restricted by stationarity assumptions, in a contrast to those based on deterministic policy gradient (Silver et al., 2014), like DDPG (Lillicrap et al., 2015). In this article, we describe a novel reinforcement learning algorithm, allowing the joint use of replay buffers with trust region optimisation and leading to sample efficiency improvement. The contributions of the paper are given as follows:

a reinforcement learning method, enabling replay buffer concept along with onpolicy data;

theoretical insights into the replay buffer usage within the onpolicy setting are discussed;

we show that, unlike the stateoftheart methods as ACKTR (Wu et al., 2017), PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015a), a single nonadaptive set of hyperparameters such as the trust region radius is sufficient for achieving better performance on a number of reinforcement learning tasks.
The code for this paper is available at https://github.com/dkangin/baselines/tree/master/baselines/trpo_replay.
2 Background
2.1 Actorcritic reinforcement learning
Consider an agent, interacting with the environment by responding to the states , , from the state space , which are assumed to be also the observations, with actions from the action space chosen by the policy distribution , where are the parameters of the policy. The initial state distribution is . Every time the agent produces an action, the environment gives back a reward , which serves as a feedback on how good the action choice was and switches to the next state according to the transitional probability . Altogether, it can be formalised as an infinite horizon discounted Markov Decision Process , (Wu et al., 2017; Schulman et al., 2015a). The expected discounted return (Bellman, 1957) is defined as per Schulman et al. (2015a):
(1) 
The advantage function (Baird III, 1993), the value function and the function are defined as per Mnih et al. (2016); Schulman et al. (2015a):
(2) 
(3) 
(4) 
In all above definitions , , , and the policy is defined by its parameters .
2.2 Trust Region Policy Optimisation (TRPO)
A straightforward approach for learning a policy is to perform unconstrained maximisation with respect to the policy parameters . However, for the stateoftheart iterative gradientbased optimisation methods, this approach would lead to unpredictable and uncontrolled changes in the policy, which would impede efficient exploration. Furthermore, in practice the exact values of are unknown, and the quality of its estimates depends on approximators which tend to be correct only in the vicinity of parameters of observed policies.
Schulman et al. (2015a), based on theorems by Kakade (2002), prove the minorisationmaximisation (MM) algorithm (Hunter & Lange, 2004) for policy parameters optimisation. Schulman et al. (2015a) mention that in practice the algorithm’s convergence rate and the complexity of maximum KL divergence computations makes it impractical to apply this method directly.
Therefore, they proposed to replace the unconstrained optimisation with a similar constrained optimisation problem, the Trust Region Policy Optimisation (TRPO) problem:
(5) 
(6) 
where is the KL divergence between the old and the new policy and respectively, and is the trust region radius. Despite this improvement, it needs some further enhancements to solve this problem efficiently, as we will elaborate in the next section.
2.3 Second order actorcritic natural gradient optimisation
Many of the stateoftheart trust region based methods, including TRPO (Schulman et al., 2015a) and ACKTR (Wu et al., 2017), use second order natural gradient based actorcritic optimisation (Amari, 1998; Kakade, 2002). The motivation behind it is to eliminate the issue that gradient descent loss, calculated as the Euclidean norm, is dependent on parametrisation. For this purpose, the Fisher information matrix is used, which is, as it follows from Amari (1998) and Kakade (2002), normalises perparameter changes in the objective function. In the context of actorcritic optimisation it can be written as (Wu et al., 2017; Kakade, 2002), where is the trajectory distribution :
(7) 
However, the computation of the Fisher matrix is intractable in practice due to the large number of parameters involved; therefore, there is a need to resort to approximations, such as the Kroneckerfactored approximate curvature (KFAC) method (Martens & Grosse, 2015), which has been first proposed for ACKTR in (Wu et al., 2017). In the proposed method, as it is detailed in Algorithm 1, this optimisation method is used for optimising the policy.
3 Method description
While the original trust regions optimisation method can only use the samples from the very last policy, discarding the potentially useful information from the previous ones, we make use of samples over several consecutive policies. The rest of the section contains definition of the proposed replay buffer concept adaptation, and then formulation and discussion of the proposed algorithm.
3.1 Usage of Replay Buffers
Mnih et al. (2013) suggested to use replay buffers for DQN to improve stability of learning, which then has been extended to other offpolicy methods such as DDPG (Lillicrap et al., 2015). The concept has not been applied to onpolicy methods like TRPO (Schulman et al., 2015a) or ACKTR (Wu et al., 2017), which do not use of previous data generated by other policies. Although based on trust regions optimisation, ACER (Wang et al., 2016) uses replay buffers for its offpolicy part.
In this paper, we propose a different concept of the replay buffers, which combines the onpolicy data with data from several previous policies, to avoid the restrictions of policy distribution stationarity for stochastic policy gradient (Sutton et al., 2000). Such replay buffers are used for storing simulations from several policies at the same time, which are then utilised in the method, built upon generalised value and advantage functions, accommodating data from these policies. The following definitions are necessary for the formalisation of the proposed algorithm and theorems.
We define a generalised function for multiple policies as
(8) 
(9) 
We also define the generalised value function and the generalised advantage function as
(10) 
(11) 
To conform with the notation from Sutton et al. (2000), we define
(12) 
, as in Sutton et al. (2000), is the probability of transition from the state to the state in steps using policy .
Theorem 1.
For the set of policies the following equality will be true for the gradient:
(13) 
where are the joint parameters of all policies and is a bias function for the policy.
3.2 Algorithm description
The proposed approach is summarised in Algorithm 1. The replay buffer contains data collected from several subsequent policies. The size of this buffer is . During Stage 1, the data are collected for every path until the termination state is received, but at least steps in total for all paths. The policy actions are assumed to be sampled from the Gaussian distribution, with the mean values predicted by the policy estimator along with the covariance matrix diagonal. The covariance matrix output was inspired, although the idea is different, by the EPG paper (Ciosek & Whiteson, 2017).
At Stage 2, the obtained data for every policy are saved in the policy replay buffer .
At Stage 3, the regression of the value function is trained using Adam optimiser (Kingma & Ba, 2015) with step size for iterations. For this regression, the sumofsquares loss function is used. The value function target values are computed for every state for every policy in the replay buffer using the actual sampled policy values, where is the maximum policy step index:
(15) 
During Stage 4, we perform the advantage function estimation. Schulman et al. (2015b) proposed the Generalised Advantage Estimator for the advantage function as follows:
(16) 
where
(17) 
(18) 
Here is a cutoff value, defined by the length of the sequence of occured states and actions within the MDP, is an estimator parameter, and is the approximation for the value function , with the approximation targets defined in Equation (15). As proved in Schulman et al. (2015b), after rearrangement this would result in the generalised advantage function estimator
(19) 
For the proposed advantage function (see Equation 11), the estimator could be defined similarly to Schulman et al. (2015b) as
(20) 
(21) 
(22) 
However, it would mean the estimation of multiple value functions, which diminishes the replay buffer idea. To avoid it, we modify this estimator for the proposed advantage function as
(23) 
The proof of Theorem 2 is given in Appendix C. It shows that the difference between two estimators is dependent of the difference in the conventional and the generalised value functions; given the continuous value function approximator it reveals that the closer are the policies, within a few trust regions radii, the smaller will be the bias.
During Stage 5, the policy function is approximated, using the KFAC optimiser (Martens & Grosse, 2015) with the constant step size . As one can see from the description, and differently from ACKTR, we do not use any adaptation of the trust region radius and/or optimisation algorithm parameters. Also, the output parameters include the diagonal of the (diagonal) policy covariance matrix. The elements of the covariance matrix, for the purpose of efficient optimisation, are restricted to universal minimum and maximum values and . As an extention from Schulman et al. (2015b) and following Theorem 1 with the substitution of likelihood ratio, the policy gradient estimation is defined as
(25) 
To practically implement this gradient, we substitute the parameters , derived from the latest policy for the replay buffer, instead of joint parameters assuming that the parameters would not deviate far from each other due to the trust region restrictions; it is still possible to calculate the estimation of for each policy using Equation (23) as these policies are observed. For the constrained optimisation we add the linear barrier function to the function :
(26) 
where is a barrier function parameter and are the parameters of the policy on the previous iteration. Besides of removing the necessity of heuristical estimation of the optimisation parameters, it also conforms with the theoretical prepositions shown in Schulman et al. (2017) and, while our approach is proposed independently, pursues the similar ideas of using actual constrained optimisation method instead of changing the gradient step size parameters as per Schulman et al. (2015a).
The networks’ architectures correspond to OpenAI Baselines ACKTR implementation (Dhariwal et al., 2017) ,which has been implemented by the ACKTR authors (Wu et al., 2017). The only departure from the proposed architecture is the diagonal covariance matrix outputs, which are present, in addition to the mean output, in the policy network.
4 Experiments
4.1 Experimental results
In order to provide the experimental evidence for the method, we have compared it with the onpolicy ACKTR (Wu et al., 2017), PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015a) methods, as well as with the offpolicy DDPG (Lillicrap et al., 2015) method on the MuJoCo (Todorov et al., 2012) robotic simulations. The technical implementation is described in Appendix A, and additional ablation studies are given in Appendix D.
Figure 1 shows the total reward values and their standard deviations, averaged over every one hundred simulation steps over three randomised runs. The results show drastic improvements over the stateoftheart methods, including the onpolicy ones (ACKTR, TRPO, PPO), on most problems. In contrast to those methods, the method shows that the adaptive values for trust region radius can be advantageously replaced by a fixed value in a combination with the trainable policy distribution covariance matrix, thus reducing the number of necessary hyperparameters. The results for ACKTR for the tasks HumanoidStandup, Striker and Thrower are not included as the baseline ACKTR implementation (Dhariwal et al., 2017) diverged at the first iterations with the predefined parameterisation. PPO results are obtained from baselines implementation PPO1 (Dhariwal et al., 2017).
Figure 2 compares results for different replay buffer sizes; the size of the replay buffers reflects the number of policies in it and not actions (i.e. buffer size means data from three successive policies in the replay buffer). We see that in most of the cases, the use of replay buffers show performance improvement against those with replay buffer size (i.e., no replay buffer with only the current policy used for policy gradient); substantial improvements can be seen for HumanoidStandup task.
Figure 3 shows the performance comparison with the DDPG method (Lillicrap et al., 2015). In all the tasks except HalfCheetah and Humanoid, the proposed method outperforms DDPG. For HalfCheetah, the versions with a replay buffer marginally overcomes the one without. It is also remarkable that the method demonstrates stable performance on the tasks HumanoidStandup, Pusher, Striker and Thrower, on which DDPG failed (and these tasks were not included into the DDPG article).
5 Conclusion
The paper combines replay buffers and onpolicy data for reinforcement learning. Experimental results on various tasks from the MuJoCo suite (Todorov et al., 2012) show significant improvements compared to the state of the art. Moreover, we proposed a replacement of the heuristically calculated trust region parameters, to a single fixed hyperparameter, which also reduces the computational expences, and a trainable diagonal covariance matrix.
The proposed approach opens the door to using a combination of replay buffers and trust regions for reinforcement learning problems. While it is formulated for continuous tasks, it is possible to reuse the same ideas for discrete reinforcement learning tasks, such as ATARI games.
References
 Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 Amari (1998) ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Baird III (1993) Leemon C Baird III. Advantage updating. Technical report, WRIGHT LAB WRIGHTPATTERSON AFB OH, 1993.
 Bellman (1957) Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics, pp. 679–684, 1957.
 Ciosek & Whiteson (2017) Kamil Ciosek and Shimon Whiteson. Expected policy gradients. arXiv preprint arXiv:1706.05374, 2017.
 Crites & Barto (1995) Robert H Crites and Andrew G Barto. An actor/critic algorithm that is equivalent to qlearning. In Advances in Neural Information Processing Systems, pp. 401–408, 1995.
 Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.
 Gu et al. (2017) Shixiang Gu, Tim Lillicrap, Richard E Turner, Zoubin Ghahramani, Bernhard Schölkopf, and Sergey Levine. Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3846–3855, 2017.
 Hester et al. (2013) Todd Hester, Manuel Lopes, and Peter Stone. Learning exploration strategies in modelbased reinforcement learning. In Proceedings of the 2013 international conference on Autonomous agents and multiagent systems, pp. 1069–1076. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
 Hunter & Lange (2004) David R Hunter and Kenneth Lange. A tutorial on MM algorithms. The American Statistician, 58(1):30–37, 2004.
 Kakade (2002) Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.
 Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceeding of the International Conference on Learning Representations, 2015.
 Lillicrap et al. (2015) Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. Proceeding of the International Conference on Learning Representations, 2015.
 Lin (1993) LongJi Lin. Reinforcement learning for robots using neural networks. Technical report, CarnegieMellon Univ Pittsburgh PA School of Computer Science, 1993.
 Martens & Grosse (2015) James Martens and Roger Grosse. Optimizing neural networks with Kroneckerfactored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.
 Mnih et al. (2013) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. (2016) V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning, pp. 1928–1937, 2016.
 Nachum et al. (2018) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trustpcl: An offpolicy trust region method for continuous control. International Conference on Learning Representations, 2018.
 Rummery & Niranjan (1994) Gavin A Rummery and Mahesan Niranjan. Online qlearning using connectionist systems. Technical report, Cambridge, England: University of Cambridge, Department of Engineering, 1994.
 Schulman et al. (2015a) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 1889–1897, 2015a.
 Schulman et al. (2015b) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
 Sutton & Barto (1998) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Wang et al. (2016) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
 Watkins (1989) Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989.
 Wu et al. (2017) Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trustregion method for deep reinforcement learning using Kroneckerfactored approximation. In Advances in Neural Information Processing Systems, pp. 5285–5294, 2017.
Appendix A Technical implementation
The parameters of Algorithm 1, used in the experiment, are given in Table 1; the parameters were initially set, where possible, to the ones taken from the stateoftheart trust region approach implementation (Wu et al., 2017; Dhariwal et al., 2017), and then some of them have been changed based on the experimental evidence. As the underlying numerical optimisation algorithms are out of the scope of the paper, the parameters of KFAC optimiser from Dhariwal et al. (2017) have been used for the experiments; for the Adam algorithm (Kingma & Ba, 2015), the default parameters from Tensorflow (Abadi et al., 2016) implementation () have been used.
The method has been implemented in Python 3 using Tensorflow (Abadi et al., 2016) as an extension of the OpenAI baselines package (Dhariwal et al., 2017). The neural network for the control experiments consists of two fully connected layers, containing 64 neurons each, following the OpenAI ACKTR network implementation (Dhariwal et al., 2017).
Appendix B Proof of Theorem 1
Appendix C Proof of Theorem 2
Proof.
The difference between the two th estimators is given as
(29) 
By substituting this into the GAE estimator difference one can obtain
(30) 
∎
Appendix D Further experiments outlining the impact of trainable covariance matrix
To demonstrate the impact of trainable covariance matrix on the performance of the method and its useful contribution towards the overall performance, we have carried out two experiments: comparison between the proposed method and its performance with fixed (identity) covariance matrix, as well as its impact on the ACKTR method (Wu et al., 2017) which shares similar aspects with the proposed method but uses fixed (identity) covariance matrix for the policies. The latter comparison shows that such training of covariance matrix could be successfully used in combination with other methods as ACKTR and is also justified by the fact that ACKTR could help to give an evidence of covariance matrix training impact without considering the influence of other proposed improvements.
In figure 4, the results of the proposed method are shown in comparison with a fixed identity covariance matrix; the replay buffer size is set to the value , all hyperparameters used in both version correspond to the ones for the experiments depicted in Figure 2. In most of the tasks, the original version with trainable matrix does outperform the modified one with no covariance matrix training; for those tasks where it does not (Strikerv2, Humanoidv2, Pusherv2, and Reacherv2), one can notice that the proposed method reaches higher reward values faster; the possible explanation for better final performance of the method on those tasks is that the identity covariance matrix fits those tasks well in terms of exploratory capabilities. It is also remarkable that without trainable covariance matrix, substatial fluctuation of performance in different runs is shown on a number of tasks (HalfCheetahv2, Walker2dv2, HumanoidStandupv2, Hopperv2).
Figure 5 shows the comparison between ACKTR (Wu et al., 2017) and its modification, implementing the same covariance matrix training as for the proposed method (i.e. outputting the diagonal covariance matrix from the policy network as in Stage 5 of Algorithm 1); there are no other differences with the original ACKTR method. The results for HumanoidStandupv2 are excluded as the official implementation (Dhariwal et al., 2017) of ACKTR method (and our modification) numerically diverges on this task; unlike the original version of ACKTR, the modification does not diverge on the tasks Strikerv2 and Throwerv2, therefore the results on those graphs are given in the comparison graphs. The method shows similar trends similar to shown in Figure 4; unexpectedly, the modification with covariance matrix gives better results on Humanoidv2 in these experimental settings rather than in the proposed method.