# Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning

## Abstract

Model-free deep reinforcement learning (RL) algorithms have been widely used for a range of complex control tasks. However, slow convergence and sample inefficiency remain challenging problems in RL, especially when handling continuous and high-dimensional state spaces. To tackle this problem, we propose a general acceleration method for model-free, off-policy deep RL algorithms by drawing the idea underlying regularized Anderson acceleration (RAA), which is an effective approach to accelerating the solving of fixed point problems with perturbations. Specifically, we first explain how policy iteration can be applied directly with Anderson acceleration. Then we extend RAA to the case of deep RL by introducing a regularization term to control the impact of perturbation induced by function approximation errors. We further propose two strategies, i.e., progressive update and adaptive restart, to enhance the performance. The effectiveness of our method is evaluated on a variety of benchmark tasks, including Atari 2600 and MuJoCo. Experimental results show that our approach substantially improves both the learning speed and final performance of state-of-the-art deep RL algorithms. The code and models are available at: https://github.com/shiwj16/raa-drl.

## 1 Introduction

Reinforcement learning (RL) is a principled mathematical framework for experience-based autonomous learning of policies. In recent years, model-free deep RL algorithms have been applied in a variety of challenging domains, from game playing Mnih et al. (2015); Silver et al. (2016) to robot navigation Shi et al. (2018); Mirowski et al. (2017). However, sample inefficiency, i.e., the required number of interactions with the environment is impractically high, remains a major limitation of current RL algorithms for problems with continuous and high-dimensional state spaces. For example, many RL approaches on tasks with low-dimensional state spaces and fairly benign dynamics may even require thousands of trials to learn. Sample inefficiency makes learning in real physical systems impractical and severely prohibits the applicability of RL approaches in more challenging scenarios.

A promising way to improve the sample efficiency of RL is to learn models of the underlying system dynamics. However, learning models of the underlying transition dynamics is difficult and inevitably leads to modelling errors. Alternatively, off-policy algorithms such as deep Q-learning (DQN) Mnih et al. (2015) and its variants Wang et al. (2016); Van Hasselt et al. (2016), deep deterministic policy gradient (DDPG) Lillicrap et al. (2016), soft actor-critic (SAC) Haarnoja et al. (2018); Shi et al. (2019) and off-policy hierarchical RL Nachum et al. (2018), which instead aim to reuse past experience, are commonly used to alleviate the sample inefficiency problem. Unfortunately, off-policy algorithms are typically based on policy iteration or value iteration, which repeatedly apply the Bellman operator of interest and generally require an infinite number of iterations to converge exactly to the optima. Moreover, the Bellman iteration constructs a contraction mapping which converges asymptotically to the optimal value function Bertsekas and Tsitsiklis (1996). Iterating this mapping essentially results in a fixed-point problem Granas and Dugundji (2013) and thus may be unacceptably slow to converge. These issues are further exacerbated when nonlinear function approximator such as neural network is utilized or the tasks have continuous state and action spaces.

This paper explores how to accelerate the convergence or improve the sample efficiency for model-free, off-policy deep RL. We make the observation that RL is closely linked to fixed-point iteration: the optimal policy can be found by solving a fixed-point problem of associated Bellman operator. Therefore, we attempt to embrace the idea underlying Anderson acceleration (also known as Anderson mixing, Pulay mixing) Walker and Ni (2011); Toth and Kelley (2015), which is a method capable of speeding up the computation of fixed-point iterations. While the classic fixed-point iteration repeatedly applies the operator to the last estimate, Anderson acceleration searches for the optimal point that has minimal residual within the subspace spanned by several previous estimates, and then applies the operator to this optimal estimate. Prior work Geist and Scherrer (2018) has successfully applied Anderson acceleration to value iteration and preliminary experiments show a significant speed up of convergence. However, existing application is only feasible on simple tasks with low-dimensional, discrete state and action spaces. Besides, as far as we know, Anderson acceleration has never been applied to deep RL due to some long-standing issues including biases induced by sampling a minibatch and function approximation errors.

In this paper, Anderson acceleration is first applied to policy iteration under a tabular setting. Then, we propose a practical acceleration method for model-free, off-policy deep RL algorithms based on regularized Anderson acceleration (RAA) Scieur et al. (2016), which is a general paradigm with a Tikhonov regularization term to control the impact of perturbations. The structure of perturbations could be the noise injected from the outside and high-order error terms induced by a nonlinear fixed-point iteration function. In the context of deep RL, function approximation errors are major perturbation source for RAA. We present two bounds to characterize how the regularization term controls the impact of function approximation errors. Two strategies, i.e., progressive update and adaptive restart, are further proposed to enhance the performance. Moreover, our acceleration method can be implemented readily to deep RL algorithms including Dueling-DQN Wang et al. (2016) and twin delayed DDPG (TD3) Fujimoto et al. (2018) to solve very complex, high-dimensional tasks, such as Atari 2600 and MuJoCo Todorov et al. (2012) benchmarks. Finally, the empirical results show that our approach exhibits a substantial improvement in both learning speed and final performance over vanilla deep RL algorithms.

## 2 Related Work

Prior works have made a number of efforts to improve the sample efficiency and speed up the convergence of deep RL from different respects, such as variance reduction Greensmith et al. (2004); Schulman et al. (2016), model-based RL Deisenroth and Rasmussen (2011); Williams et al. (2017); Buckman et al. (2018), guided exploration Levine and Abbeel (2014); Chebotar et al. (2017), etc. One of the most widely used techniques is off-policy RL, which combines temporal difference Sutton (1988) and experience replay Lin (1993); Wang et al. (2017) so as to make use of all the previous samples before each update to the policy parameters. Though introducing biases by using previous samples, off-policy RL alleviates the high variance in estimation of Q-value and policy gradient Gu et al. (2017). Consequently, fast convergence is rendered when under fine parameter-tuning.

As one kernel technique of off-policy RL, temporal difference is derived from the Bellman iteration which can be regarded as a fixed-point problem Granas and Dugundji (2013). Our work focuses on speeding up the convergence of off-policy RL via speeding up the convergence of the eseential fixed-point problem, and replying on a technique namely Anderson acceleration. This method is exploited by prior work Walker and Ni (2011); Henderson and Varadhan (2019) to accelerate the fixed-point iteration by computing the new iteration as linear combination of previous evaluations. In the linear case, the convergence rate of Anderson acceleration has been elaborately analyzed and proved to be equal to or better than fixed-point iteration in Toth and Kelley (2015). For nonlinear fixed-point iteration, regularized Anderson acceleration is proposed by Scieur et al. (2016) to constrain the norm of coefficient vector and reduce the impact of perturbations. Recent works Geist and Scherrer (2018); Xie et al. (2018) have applied the Anderson acceleration to value iteration and deep neural network, and preliminary experiments show that a significant speedup of convergence is achieved. However, there is still no research showing its acceleration effect on deep RL for complex high-dimensional problems, as far as we know.

## 3 Preliminaries

Under RL paradigm, the interaction between an agent and the environment is described as a Markov Decision Process (MDP). Specifically, at a discrete timestamp , the agent takes an action in a state and transits to a subsequent state while obtaining a reward from the environment. The transition between states satisfies the Markov property, i.e., . Usually, the RL algorithm aims to search a policy that maximizes the expected sum of discounted future rewards. Q-value function describes the expected return starting from a state-action pair : , where the policy is a function or conditional distribution mapping the state space to the action space .

### 3.1 Off-policy reinforcement learning

Most off-policy RL algorithms are derived from policy iteration, which alternates between policy evaluation and policy improvement to monotonically improve the policy and the value function until convergence. For complex environments with unknown dynamics and continuous spaces, policy iteration is generally combined with function approximation, and parameterized Q-value function (or critic) and policy function are learned from sampled interactions with environment. Since critic is represented as parameterized function instead of look-up table, the policy evaluation is replaced with an optimization problem which minimizes the squared temporal difference error, the discrepancy between the outputs of critics after and before applying the Bellman operator

(1) |

where typically the Bellman operator is applied to a separate target value network whose parameter is periodically replaced or softly updated with copy of current Q-network weight.

In off-policy RL field, prior works have proposed a number of modifications on the Bellman operator to alleviate the overestimation or function approximation error problems and thus achieved significant improvement. Similar to policy improvement, DQN replaces the current policy with a greedy policy for the next state in the Bellman operator

(2) |

As the state-of-the-art actor-critic algorithm for continuous control, TD3 Fujimoto et al. (2018) proposes a clipped double Q-learning variant and a target policy smoothing regularization to modify the Bellman operator, which alleviates overestimation and overfitting problems,

(3) |

where denote two critics with decoupled parameters . The added noise is clipped by the positive constant .

### 3.2 Anderson acceleration for value iteration

Most RL algorithms are derived from a fundamental framework named policy iteration which consists of two phases, i.e. policy evaluation and policy improvement. The policy evaluation estimates the Q-value function induced by current policy by iterating a Bellman operator from an initial estimate. Following the policy evaluation, the policy improvement acquires a better policy from a greedy strategy, The policy iteration alternates two phases to update the Q-value and the policy respectively until convergence. As a special variant of policy iteration, value iteration merges policy evaluation and policy improvement into one iteration

(4) |

and iterates it until convergence from a initial , where the Bellman operation is only repeatedly applied to the last estimate. Anderson acceleration is a widely used technique to speed up the convergence of fixed-point iterations and has been successfully applied to speed up value iteration Geist and Scherrer (2018) by linearly combining previous value estimates,

(5) |

where the coefficient vector is determined by minimizing the norm of total Bellman residuals of these estimates,

(6) |

For the , the minimum can be analytically solved by using the Karush-Kuhn-Tucker conditions. Corresponding coefficient vector is given by

(7) |

where is a Bellman residuals matrix with , and denotes the vector with all components equal to one Geist and Scherrer (2018).

## 4 Regularized Anderson Acceleration for Deep Reinforcement Learning

Our regularized Anderson acceleration (RAA) method for deep RL can be derived starting from a direct implementation of Anderson acceleration to the classic policy iteration algorithm. We will first present this derivation to show that the resulting algorithm converges faster to the optimal policy than the vanilla form. Then, a regularized variant is proposed for a more general case with function approximation. Based on this theory, a progressive and practical acceleration method with adaptive restart is presented for off-policy deep RL algorithms.

### 4.1 Anderson acceleration for policy iteration

As described above, Anderson acceleration can be directly applied to value iteration. However, policy iteration is more fundamental and suitable to scale to deep RL, compared to value iteration. Unfortunately, the implementation of Anderson acceleration is complicated when considering policy iteration, because there is no explicit fixed-point mapping between the policies in any two consecutive steps, which make it impossible to straightforwardly apply Anderson acceleration to the policy .

Due to the one-to-one mapping between policies and Q-value functions, policy iteration can be accelerated by applying Anderson acceleration to the policy improvement, which establishes a mapping from the current Q-value estimate to the next policy. In this section, our derivation is based on a tabular setting, to enable theoretical analysis. Specifically, for the prototype policy iteration, suppose that estimates have been computed up to iteration , and that in addition to the current estimate , the previous estimates are also known. Then, a linear combination of estimates with coefficients ^{1}

(8) |

Due to this equality constraint, we define combined Bellman operator as follows

(9) |

Then, one searches a coefficient vector that minimizes the following objective function defined as the combined Bellman residuals among the entire state-action space ,

(10) |

In this paper, we will consider the , although a different norm may also be feasible (for example and , in which case the optimization problem becomes a linear program). The solution to this optimization problem is identical to (7) except that with . Detailed derivation can be found in Appendix A.1 of the supplementary material. Then, the new policy improvement steps are given by

(11) |

Meanwhile, Q-value estimate can be obtained by iteratively applying the following policy evaluation operator by starting from some initial function ,

(12) |

In fact, the effect of acceleration can be explained intuitively. The linear combination is a better estimate of Q-value than the last one in terms of combined Bellman residuals. Accordingly, the policy is improved from a better policy baseline corresponding to the better estimate of Q-value.

### 4.2 Regularized variant with function approximation

For RL control tasks with continuous state and action spaces, or high-dimensional state space, we generally consider the case in which Q-value function is approximated by a parameterized function approximator. If the approximation is sufficiently good, it might be appropriate to use it in place of in (8)-(12). However, there are several key challenges when implementing Anderson acceleration with function approximation.

First, notice that the Bellman residuals in (10) are calculated among the entire state-action space. Unfortunately, sweeping entire state-action space is intractable for continuous RL, and a fine grained discretization will lead to the curse of dimensionality. A feasible alternative to avoid this issue is to use a sampled Bellman residuals matrix instead. To alleviate the bias induced by sampling a minibatch, we adopt a large sample size specifically for Anderson acceleration.

Second, function approximation errors are unavoidable and lead to biased solution of Anderson acceleration. The intricacies of this issue will be exacerbated by deep models. Therefore, function approximation errors will induce severe perturbation when implementing Anderson acceleration to policy iteration with function approximation. In addition to the perturbation, the solution (7) contains the inverse of a squared Bellman residuals matrix, which may suffer from ill-conditioning when the squared Bellman residuals matrix is rank-deficient, and this is a major source of numerical instability in vanilla Anderson acceleration. In other words, even if the perturbation is small, its impact on the solution can be arbitrarily large.

Under the above observations, we scale the idea underlying RAA to the policy iteration with function approximation in this section. Then, the coefficient vector (10) is now adjusted to that minimizes the perturbed objective function added with a Tikhonov regularization term,

(13) |

where represents the perturbation induced by function approximation errors. The solution to this regularized optimization problem can be obtained analytically similar to (10),

(14) |

where is a positive scalar representing the scale of regularization. is the sampled Bellman residuals matrix with .

In fact, the regularization term controls the norm of coefficient vector produced by RAA and reduces the impact of perturbation induced by function approximation errors, as shown analytically by the following proposition.

###### Proposition 1.

Consider two identical policy iterations and with function approximation. is implemented with regularized Anderson acceleration and takes into account approximation errors, whereas is only implemented with vanilla Anderson acceleration. Let and be the coefficient vectors of and respectively. Then, we have the following bounds

(15) |

###### Proof.

See Appendix A.2 of the supplementary material. ∎

From the above bounds, we can observe that regularization allows a better control of the impact of function approximation errors, but also causes an inevitable gap between and . Qualitatively, large regularization scale means less impact of function approximation errors. On the other hand, overlarge leads to very small norm of coefficient vector , which means the coefficients for previous estimates is nearly identical. However, according to (10), equal coefficients are probably far away from the optima and thus result in great performance loss of Anderson acceleration.

### 4.3 Implementation on off-policy deep reinforcement learning

As discussed in last section, it is impossible to directly use policy iteration in very large continuous domains. To that end, most off-policy deep RL algorithms apply the mechanism underlying policy iteration to learn approximations to both the Q-value function and the policy. Instead of iterating policy evaluation and policy improvement to convergence, these off-policy algorithms alternate between optimizing two networks with stochastic gradient descent. For example, actor-critic method is a well-known implementation of this mechanism. In this section, we show that RAA for policy iteration can be readily extended to existing off-policy deep RL algorithms for both discrete and continuous control tasks, with only a few modifications to the update of critic.

#### Regularized Anderson acceleration for actor-critic

Consider a parameterized Q-value function and a tractable policy , the parameters of these networks are and . In the following, we first give the main results of RAA for actor-critic. Then, RAA is combined with Dueling-DQN and TD3 respectively.

Under the paradigm of off-policy deep RL (actor-critic), RAA variant of policy iteration (11)-(12) degrades into the following Bellman equation

(16) |

where is the parameters of target network before update steps. Furthermore, to mitigate the instability resulting from drastic update step of Anderson acceleration, the following progressive Bellman equation (or progressive update) with RAA is used practically,

(17) |

where is a small positive coefficient.

Generally, the loss function of critic is then formulated as the following squared consistency error of Bellman equation,

(18) |

where is the distribution of previously sampled transitions, or a replay buffer. The target value of Q-value function or critic is represented by .

**Raa-Td3.** For the case of TD3 where an actor and two critics are learned for deterministic policy and Q-value function respectively, the implementation of RAA is more complicated. Specifically, two critics are simultaneously trained with clipped double Q-learning. Then, the target values for RAA-TD3 are given by

(19) |

where .

#### Adaptive restart

The idea of restarting an algorithm is well known in the numerical analysis literature. Vanilla Anderson acceleration has shown substantial improvements by incorporating with periodic restarts Henderson and Varadhan (2019), where one periodically starts the acceleration scheme anew by only using information from the most recent iteration. In this section, to alleviate the problem that deep RL is notoriously prone to be trapped in local optimum, we propose an adaptive restart strategy for our RAA method.

Among the training steps of actor-critic with RAA, periodic restart checking steps are enforced to clear the memory immediately before the iteration completely crashes. More explicitly, the iteration is restarted whenever the average squared residual of current period exceeds the average squared residual of last period. Complete description of RAA-Dueling-DQN is summarized in Algorithm 1. And RAA-TD3 is given in Appendix B of the supplementary material.

## 5 Experiments

In this section, we present our experimental results and discuss their implications. We first give a detailed description of the environments (Atari 2600 and MuJoCo) used to evaluate our methods. Then, we report results on both discrete and continuous control tasks. Finally, we provide an ablative analysis for the proposed methodology. All default hyperparameters used in these experiments are listed in Appendix C of the supplementary material.

### 5.1 Experimental setup

**Atari 2600.** For discrete control tasks, we perform experiments in the Arcade Learning Environment. We select four games (Breakout, Enduro, Qbert and SpaceInvaders) varying in their difficulty of convergence. The agent receives stacked grayscale images as inputs, as described in Mnih et al. (2015).

**MuJoCo.** For continuous control tasks, we conduct experiments in environments built on the MuJoCo physics engine. We select a number of control tasks to evaluate the performance of the proposed methodology and the baseline methods. In each task, the agent takes a vector of physical states as input, and generates an action to manipulate the robots in the environment.

### 5.2 Comparative evaluation

To evaluate our RAA variant method, we select Dueling-DQN and TD3 as the baselines for discrete and continuous control tasks, respectively. Please note that we do not select DDPG as the baseline for continuous control tasks, as DDPG shows bad performance in difficult control tasks such as robotic manipulation. Figure 1 shows the total average return of evaluation rollouts during training for Dueling-DQN, TD3 and their RAA variants. We train five and seven different instances of each algorithm for Atari 2600 and MuJoCo, respectively. Besides, each baseline and corresponding RAA variant are trained with same random seeds set and evaluated every 10000 environment steps, where each evaluation reports the average return over ten different rollouts.

The results in Figure 1 show that, overall, RAA variants outperform to corresponding baseline on most tasks with a large margin such as HalfCheetah-v2 and perform comparably to them on the easier tasks such as Enduro in terms of learning speed, which indicate that RAA is a feasible method to make existing off-policy RL algorithms more sample efficient. In addition to the direct benefit of acceleration mentioned above, we also observe that our RAA variants demonstrate superior or comparable final performance to the baseline methods in all tasks. In fact, RAA-Dueling-DQN can be seen as a weighted variant of Average-DQN Anschel et al. (2017), which can effectively reduce the variance of approximation error in the target values and thus shows improved performance. In summary, our approach brings an improvement in both the learning speed and final performance.

### 5.3 Ablation studies

The results in the previous section suggest that our RAA method can improve the sample efficiency of existing off-policy RL algorithms. In this section, we further examine how sensitive our approach is to the scaling of regularization. We also perform ablation studies to understand the contribution of each individual component: progressive update and adaptive restart. Additionally, we analyze the impact of different number of previous estimates and compare the behavior of our proposed RAA method over different learning rates.

**Regularization scale.** Our approach is sensitive to the scaling of regularization , because it control the norm of the coefficient vector and reduces the impact of approximation error. According to the conclusions of Proposition 1, larger regularization magnitude implies less impact of approximation error, but overlarge regularization will make the coefficients nearly identical and thus result in substantial degradation of acceleration performance. Figure 2 shows how learning performance changes on discrete control tasks when the regularization scale is varied, and consistent conclusion as above can be drawn from Figure 2. For continuous control tasks, it is difficult to obtain same conclusion due to the dominant effect of bias induced by sampling a minibatch relative to function approximation errors. Additional learning curves on continuous control tasks can be found in Appendix D of the supplementary material.

**Progressive update and adaptive restart.** This experiment compares our proposed approach with: (i) RAA without using progressive update (no progressive); (ii) RAA without adding adaptive restart (no restart); (iii) RAA without using progressive update and adding adaptive restart (no progressive and no restart). Figure 3 shows comparative learning curves on continuous control tasks. Although the significance of each component varies task to task, we see that using progressive update is essential for reducing the variance on all four tasks, consistent conclusion can also be drawn from Figure 1. Moreover, adding adaptive restart marginally improves the performance. Additional results on discrete control tasks can be found in Appendix D of the supplementary material.

**The number of previous estimates .** In our experiments, the number of previous estimates is set to 5. In fact, there is a tradeoff between performance and computational cost. Fig.4 shows the results of RAA-TD3 using different on Walker2d task. Overall, we can conclude that larger leads to faster convergence and better final performance, but the improvement becomes small when exceeds a threshold. In practice, we suggest to take into account available computing resource and sample efficiency when applying our proposed RAA method to other works.

**Learning rate.** To compare the behavior of our proposed RAA method over different learning rates (), we perform additional experiments on Walker2d task, and the results of TD3 and our RAA-TD3 are shown in Fig.5. Overall, the improvement of our method is consistent across all learning rates, though the performance of both TD3 and our RAA-TD3 is bad under the setting with non-optimal learning rates, and the improvement is more significant when the learning rate is smaller. Moreover, consistent improvement of performance means that our proposed RAA method is effective and robust.

## 6 Conclusion

In this paper, we presented a general acceleration method for existing deep reinforcement learning (RL) algorithms. The main idea is drawn from regularized Anderson acceleration (RAA), which is an effective approach to speeding up the solving of fixed point problems with perturbations. Our theoretical results explain that vanilla Anderson acceleration can be directly applied to policy iteration under a tabular setting. Furthermore, RAA is extended to model-free deep RL by introducing an additional regularization term. Two rigorous bounds about coefficient vector demonstrate that the regularization term controls the norm of the coefficient vector produced by RAA and reduces the impact of perturbation induced by function approximation errors. Moreover, we verified that the proposed method can significantly accelerate off-policy deep RL algorithms such as Dueling-DQN and TD3. The ablation studies show that progressive update and adaptive restart strategies can enhance the performance. For future work, how to combine Anderson acceleration or its variants with on-policy deep RL is an exciting avenue.

## Acknowledgments

Gao Huang is supported in part by Beijing Academy of Artificial Intelligence (BAAI) under grant BAAI2019QN0106 and Tencent AI Lab Rhino-Bird Focused Research Program under grant JR201914. This research is supported by the National Science Foundation of China (NSFC) under grant 41427806.

Supplementary Material

## Appendix A A Proofs

### a.1 Solution to Anderson Acceleration

### a.2 Proof to Proposition 1

###### Proof.

We begin by the bound on . Indeed, with (14),

(22) | ||||

(23) | ||||

(24) | ||||

(25) | ||||

(26) | ||||

(27) |

where the last inequality is because , we have .

We will bound from now on. Let be the dual variable of the equality constraint in (13), then and should satisfy the KKT system

(28) |

Expanding the LHS of (28), we obtain

(29) |

The explicit solution is obtained by inverting the block matrix, and is written

(31) |

Then, we can bound the norm of by

(32) | ||||

(33) | ||||

(34) |

which is the desired result. ∎

## Appendix B B Raa-Td3

## Appendix C C Hyperparameters

Hyperparameters | Value |
---|---|

Network | |

channels | 32, 64, 64 |

filter size | |

stride | 4, 2, 1 |

Val: (hidden units, output units) | (512, 1) |

Adv: (hidden units, output units) | (512, action dimensions) |

Shared | |

optimizer | RMSprop |

start time steps | 5 |

discount factor | 0.99 |

replay buffer size | 10 |

batch size | 32 |

frames stacked | 4 |

action repetitions | 4 |

learning rate | 0.00025 |

RAA-Dueling-DQN | |

progressive coefficient () | 0.05 |

sample size for RAA () | 128 |

regularization scale | 0.1 |

number of previous estimates | 5 |

target update interval | 2000 |

Dueling-DQN | |

target update interval | 10000 |

Hyperparameters | Value |

Network | |

Critic: hidden units | 400, 300 |

output units | 1 |

Actor: hidden units | 400, 300 |

output units | action dimensions |

Shared | |

optimizer | Adam |

start time steps | 10 |

discount factor | 0.99 |

replay buffer size | 10 |

batch size | 100 |

exploration noise | 0.1 |

target update rate () | |

actor update frequency | 2 |

exploration policy | |

RAA-TD3 | |

progressive coefficient () | 0.1 |

sample size for RAA () | 400 |

regularization scale | 0.001 |

number of previous estimates | 5 |

TD3 |

## Appendix D D Additional Learning Curves

### Footnotes

- Notice that we don’t impose a positivity condition on the coefficients.

### References

- Averaged-dqn: variance reduction and stabilization for deep reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 176–185. Cited by: §5.2.
- Neuro-dynamic programming. Vol. 5, Athena Scientific Belmont, MA. Cited by: §1.
- Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §2.
- Path integral guided policy search. In 2017 IEEE International Conference on Robotics and Automation, pp. 3381–3388. Cited by: §2.
- PILCO: a model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning, pp. 465–472. Cited by: §2.
- Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 1587–1596. Cited by: §1, §3.1.
- Anderson acceleration for reinforcement learning. arXiv preprint arXiv:1809.09501. Cited by: §1, §2, §3.2.
- Fixed point theory. Springer Science & Business Media. Cited by: §1, §2.
- Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5 (Nov), pp. 1471–1530. Cited by: §2.
- Q-prop: sample-efficient policy gradient with an off-policy critic. In Proceedings of the International Conference on Learning Representations, Cited by: §2.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 1861–1870. Cited by: §1.
- Damped anderson acceleration with restarts and monotonicity control for accelerating em and em-like algorithms. Journal of Computational and Graphical Statistics, pp. 1–42. Cited by: §2, §4.3.2.
- Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071–1079. Cited by: §2.
- Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, Cited by: §1.
- Reinforcement learning for robots using neural networks. Technical report Carnegie-Mellon Univ Pittsburgh PA School of Computer Science. Cited by: §2.
- Learning to navigate in complex environments. In Proceedings of the International Conference on Learning Representations, Cited by: §1.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §1, §5.1.
- Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3303–3313. Cited by: §1.
- High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations, Cited by: §2.
- Regularized nonlinear acceleration. In Advances In Neural Information Processing Systems, pp. 712–720. Cited by: §1, §2.
- Multi pseudo q-learning-based deterministic policy gradient for tracking control of autonomous underwater vehicles. IEEE Transactions on Neural Networks and Learning Systems, pp. 3534–3546. Cited by: §1.
- Soft policy gradient method for maximum entropy deep reinforcement learning. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 3425–3431. Cited by: §1.
- Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484. Cited by: §1.
- Learning to predict by the methods of temporal differences. Machine learning 3 (1), pp. 9–44. Cited by: §2.
- Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1.
- Convergence analysis for anderson acceleration. SIAM Journal on Numerical Analysis 53 (2), pp. 805–819. Cited by: §1, §2.
- Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
- Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis 49 (4), pp. 1715–1735. Cited by: §1, §2.
- Sample efficient actor-critic with experience replay. In Proceedings of the International Conference on Learning Representations, Cited by: §2.
- Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, Vol. 48, pp. 1995–2003. Cited by: §1, §1.
- Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation, pp. 1714–1721. Cited by: §2.
- Interpolatron: interpolation or extrapolation schemes to accelerate optimization for deep neural networks. arXiv preprint arXiv:1805.06753. Cited by: §2.