# Mean Actor-Critic

###### Abstract

We propose a new algorithm, Mean Actor-Critic (MAC), for discrete-action continuous-state reinforcement learning. MAC is a policy gradient algorithm that uses the agent’s explicit representation of all action values to estimate the gradient of the policy, rather than using only the actions that were actually executed. We prove that this approach reduces variance in the policy gradient estimate relative to traditional actor-critic approaches. We show empirical results on two control domains and six Atari games, where MAC is competitive with state-of-the-art policy search methods.

Mean Actor-Critic

Cameron Allen^{†}^{†}thanks: These authors contributed equally. Please send correspondence to Cameron Allen <csal@brown.edu> and Kavosh Asadi <kavosh@brown.edu>. Kavosh Asadi Melrose Roderick Abdel-rahman Mohamed
George Konidaris Michael Littman
Brown University
Amazon
Providence, RI
Seattle, WA

## Introduction

In reinforcement learning (RL), two important classes of algorithms are value-function-based methods and policy search methods. Value-function-based methods maintain an estimate of the value of performing each action in each state, and choose the actions associated with the most value in their current state (?). By contrast, policy search algorithms maintain an explicit policy, and agents draw actions directly from that policy to interact with their environment (?). A subset of policy search algorithms, policy gradient methods, represent the policy using a differentiable parameterized function approximator (for example, a neural network) and use stochastic gradient ascent to update its parameters to achieve more reward.

To facilitate gradient ascent, the agent interacts with its environment according to the current policy and keeps track of the outcomes of its actions. From these (potentially noisy) sampled outcomes, the agent estimates the gradient of the objective function. A critical question here is how to compute an accurate gradient using these samples, which may be costly to acquire, while using as few sample interactions as possible.

Actor-critic algorithms compute the policy gradient using a learned value function to estimate expected future reward (?; ?). Since the expected reward is a function of the environment’s dynamics, which the agent does not know, it is typically estimated by executing the policy in the environment. Existing algorithms compute the policy gradient using the value of states the agent visits, and critically, these methods take into account only the actions the agent actually executes during environmental interaction.

We propose a new policy gradient algorithm, Mean Actor-Critic (or MAC), for the discrete-action continuous-state case. MAC uses the agent’s policy distribution to average the value function over all actions, rather than using the action-values of only the sampled actions. We prove that, under modest assumptions, this approach reduces variance in the policy gradient estimates relative to traditional actor-critic approaches. We implement MAC using deep neural networks, and we show empirical results on two control domains and six Atari games, where MAC is competitive with state-of-the-art policy search methods.

We note that the core idea behind MAC has also been independently and concurrently explored by ? (?). However, their results mainly focus on continuous action spaces and are more theoretical. We introduce a simpler proof of variance reduction that makes fewer assumptions, and we also show that the algorithm works well in discrete-action domains.

## Background

In RL, we train an agent to select actions in its environment so that it maximizes some notion of long-term reward. We formalize the problem as a Markov decision process (MDP) (?), which we specify by the tuple , where is a set of states, is a fixed initial state, is a set of discrete actions, the functions and respectively describe the reward and transition dynamics of the environment, and is a discount factor representing the relative importance of immediate versus long-term rewards.

More concretely, we denote the expected reward for performing action in state as:

and we denote the probability that performing action in state results in state as:

In the context of policy search methods, the agent maintains an explicit policy denoting the probability of taking action in state under the policy parameterized by . Note that for each state, the policy outputs a probability distribution over the discrete set of actions: . At each timestep , the agent takes an action drawn from its policy , then the environment provides a reward signal and transitions to the next state .

The agent’s goal at every timestep is to maximize the sum of discounted future rewards, or simply return, which we define as:

In a slight abuse of notation, we will also denote the total return for a trajectory as , which is equal to for that same trajectory.

The agent’s policy induces a value function over the state space. The expression for return allows us to define both a state value function, , and a state-action value function, . Here, represents the expected return starting from state , and following the policy thereafter, and represents the expected return starting from , executing action , and then following the policy thereafter:

Note that:

The agent’s goal is to find a policy that maximizes the return for every timestep, so we define an objective function that allows us to score an arbitrary policy parameter :

where denotes a trajectory. Note that the probability of a specific trajectory depends on policy parameters as well as the dynamics of the environment. Our goal is to be able to compute the gradient of with respect to the policy parameters :

(1) | |||||

where is the discounted state distribution. In the second and third lines we rewrite the gradient term using a score function. In the fourth line, we convert the summation to an expectation, and use the notation in place of . Next, we make use of the fact that , given by ? (?). Intuitively this makes sense, since the policy for a given state should depend only on the rewards achieved after that state. Finally, we invoke the definition that .

A nice property of expectation (1) is that, given access to , the expectation can be estimated through implementing policy in the environment. Alternatively, we can estimate using the return , which is an unbiased (and usually a high variance) sample of . This is essentially the idea behind the REINFORCE algorithm (?), which uses the following gradient estimator:

(2) |

Alternatively, we can estimate using some sort of function approximation: , which results in variants of actor-critic algorithms. Perhaps the simplest actor-critic algorithm approximates (1) as follows:

(3) |

Note that value function approximation can, in general, bias the gradient estimation (?).

One way of reducing variance in both REINFORCE and actor-critic algorithms is to use an additive control variate as a baseline (?; ?; ?). The baseline function is typically a function that is fixed over actions, and so subtracting it from either the sampled returns or the estimated Q-values does not bias the gradient estimation. We refer to techniques that use such a baseline as advantage variations of the basic algorithms, since they approximate the advantage of choosing action over some baseline representing “typical” performance for the policy in state (?). The update performed by advantage REINFORCE is:

where is a scalar baseline measuring the performance of the policy, such as a running average of the observed return over the past few episodes of interaction.

Advantage actor-critic uses an approximation of the expected value of each state as its baseline: , which leads to the following update rule:

Another way of estimating the advantage function is to use the TD-error signal . This approach is convenient, because it only requires estimating one set of parameters, namely for . However, because the TD-error is a sample of the advantage function , this approach has higher variance (due to the environmental dynamics) than methods that explicitly compute . Moreover, given and , can easily be computed as , so in practice, it is still only necessary to estimate one set of parameters (for ).

## Mean Actor-Critic

An overwhelming majority of recent actor-critic papers have computed the policy gradient using an estimate similar to Equation (3) (?; ?; ?). This estimate samples both states and actions from trajectories executed according to the current policy in order to compute the gradient of the objective function with respect to the policy weights.

Instead of using only the sampled actions, Mean Actor-Critic (MAC) explicitly computes the probability-weighted average over all Q-values, for each state sampled from the trajectories. In doing so, MAC is able to produce an estimate of the policy gradient where the variance due to action sampling is reduced to zero. This is exactly the difference between computing the sample mean (whose variance is inversely proportional to the number of samples), and calculating the mean directly (which is simply a scalar with no variance).

MAC is based on the observation that expectation (1), which we repeat here, can be rewritten in the following way:

(4) |

We can estimate (4) by sampling states from a trajectory and using function approximation:

In our implementation, the inner summation is computed by combining two neural networks that represent the policy and state-action value function. The value function can be learned using a variety of methods, such as temporal-difference learning or Monte Carlo sampling. After performing a few updates to the value function, we update the parameters of the policy with the following update rule:

(5) |

To improve stability, repeated updates to the value and policy networks are interleaved, as in Generalized Policy Iteration (?).

In traditional actor-critic approaches, which we refer to as sampled-action actor-critic, the only actions involved in the computation of the policy gradient estimate are those that were actually executed in the environment. In MAC, computing the policy gradient estimate will frequently involve actions that were not actually executed in the environment. This results in a trade-off between bias and variance. In domains where we can expect accurate Q-value predictions from our function approximator, despite not actually executing all of the relevant state-action pairs, MAC results in lower variance gradient updates and increased sample-efficiency. In domains where this assumption is not valid, MAC may perform worse than sampled-action actor-critic due to increased bias.

In some ways, MAC is similar to Expected Sarsa (?). Expected Sarsa considers all next-actions , then computes the expected TD-error, , and uses the resulting error signal to update the function. By contrast, MAC considers all current-actions , and uses the corresponding values to update the policy directly.

It is natural to consider whether MAC could be improved by subtracting an action-independent baseline, as in sampled-action actor-critic and REINFORCE:

However, we can simplify the expectation as follows:

In doing so, we see that both and the gradient operator can be moved outside of the summation, leaving just the sum of the action probabilities, which is always 1, and hence the gradient of the baseline term is always zero. This is true regardless of the choice of baseline, since the baseline cannot be a function of the actions or else it will bias the expectation. Thus, we see that subtracting a baseline is unnecessary in MAC, since it has no effect on the policy gradient estimate.

## Analysis of Bias and Variance

In this section we prove that MAC does not increase variance over sampled-action actor-critic (AC), and also, that given a fixed , both algorithms have the same bias. We start with the bias result.

### Theorem 1

If the estimated Q-values, , for both MAC and AC are the same in expectation, then the bias of MAC is equal to the bias of AC.

### Proof

See Appendix A.

This result makes sense because in expectation, AC will choose all of the possible actions with some probability according to the policy. MAC simply calculates this expectation over actions explicitly. We now move to the variance result.

### Theorem 2

If the estimated Q-values, , for both MAC and AC are the same in expectation, and if is independent of for , then . For deterministic policies, there is equality, and for stochastic policies the inequality is strict.

### Proof

See Appendix B.

Intuitively, we can see that for cases where the policy is deterministic, MAC’s formulation of the policy gradient is exactly equivalent to AC, and hence we can do no better than AC. For high-entropy policies, MAC will beat AC in terms of variance.

## Experiments

This section presents an empirical evaluation of MAC across three different problem domains. We first evaluate the performance of MAC versus popular policy gradient benchmarks on two classic control problems. We then evaluate MAC on a subset of Atari 2600 games and investigate its performance compared to state-of-the-art policy search methods.

### Classic Control Experiments

In order to determine whether MAC’s lower variance policy gradient estimate translates to faster learning, we chose two classic control problems, namely Cart Pole and Lunar Lander, and compared MAC’s performance against four standard sampled-action policy gradient algorithms. We used the open-source implementations of Cart Pole and Lunar Lander provided by OpenAI Gym (?), in which both domains have continuous state spaces and discrete action spaces. Screenshots of the two domains are provided in Figure 1.

Algorithm | Cart Pole | Lunar Lander |
---|---|---|

REINFORCE | ||

Adv. REINFORCE | ||

Actor-Critic | ||

Adv. Actor-Critic | ||

MAC |

For each problem domain, we implemented MAC using two independent neural networks, representing the policy and Q function. We then performed a hyperparameter search to determine the best network architectures, optimization method, and learning rates. Specifically, the hyperparameter search considered: 0, 1, 2, or 3 hidden layers; 50, 75, 100, or 300 neurons per layer; ReLU, Leaky ReLU (with leak factor 0.3), or tanh activation; SGD, RMSProp, Adam, or Adadelta as the optimization method; and a learning rate chosen from 0.0001, 0.00025, 0.0005, 0.001, 0.005, 0.01, or 0.05. To find the best setting, we ran 10 independent trials for each combination of hyperparameters and chose the setting with the best asymptotic performance over the 10 trials. We terminated each episode after 200 and 1000 timesteps (in Cart Pole and Lunar Lander, respectively), regardless of the state of the agent.

We compared MAC against four standard benchmarks: REINFORCE, advantage REINFORCE, actor-critic, and advantage actor-critic. We implemented the REINFORCE benchmarks using just a single neural network to represent the policy, and we implemented the actor-critic benchmarks using two networks to represent both the policy and Q function. For each benchmark algorithm, we then performed the same hyperparameter search that we had used for MAC.

In order to keep the variance as low as possible for the advantage actor-critic benchmark, we explicitly computed the advantage function , where , rather than sampling it using the TD-error (see Section 2).

Once we had determined the best hyperparameter settings for MAC and each of the benchmark algorithms, we then ran each algorithm for 100 independent trials. Figure 2 shows learning curves for the different algorithms, and Table 1 summarizes the results using the mean performance over trials and episodes. On Cart Pole, MAC learns substantially faster than all of the benchmarks, and on Lunar Lander, it performs competitively with the best benchmark algorithm, advantage actor-critic.

### Atari Experiments

To test whether MAC can scale to larger problem domains, we evaluated it on several Atari 2600 games using the Arcade Learning Environment (ALE) (?) and compared MAC’s performance against that of state-of-the-art policy search methods, namely, Trust Region Policy Optimization (TRPO) (?), Evolutionary Strategies (ES) (?), and Advantage Actor-Critic (A2C) (?). Due to the computational load inherent in training deep networks to play Atari games, we limited our experiments to a subset of six Atari games: Beamrider, Breakout, Pong, Q*bert, Seaquest and Space Invaders. These six games are commonly selected for tuning hyperparameters (?; ?; ?), and thus provide a fair comparison against established benchmarks, despite our limited computational resources.

The MAC network architecture was derived from the OpenAI Baselines implementation of A2C (?). It uses three convolutional layers (size/stride/filters: 8/4/32, 4/2/64, 3/1/64), followed by a fully-connected layer (size 512), all with ReLU activation. A final fully-connected layer is split into two batches of N outputs each, where N is the number of actions. One batch uses a linear activation and corresponds to the Q-values; the other batch uses a softmax activation and corresponds to the policy. We used this architecture for both the MAC results and the A2C results. The TRPO and ES results are taken from their respective papers.

Game | Random | TRPO | ES | A2C | MAC |
---|---|---|---|---|---|

Beam Rider | 363.9 | 1425.2 | 744.0 | 5846.0 | 6072.0 |

Breakout | 1.7 | 10.8 | 9.5 | 370.9 | 372.7 |

Pong | -20.7 | 20.9 | 21.0 | 18.0 | 10.6 |

Q*bert | 183.0 | 1973.5 | 147.5 | 1651.5 | 243.4 |

Seaquest | 68.4 | 1908.6 | 1390.0 | 1702.5 | 1703.4 |

Space Invaders | 148.0 | 568.4 | 678.5 | 1201.2 | 1173.1 |

We trained the network using a variation of the multi-part loss function used in A2C (?). The value loss at each timestep was equal to the mean squared error between the observed reward and the Q-value of the selected action. The policy entropy loss was simply the negative entropy of the policy at each timestep. For the A2C experiments, the policy improvement loss was the negative log probability of the selected action times its advantage value. For the MAC experiments, the policy improvement loss became the negative sum of action probabilities times their associated Q-values. The overall loss function was a linear combination of the policy improvement loss (coefficient 0.1), policy entropy loss (coefficient 0.001), and value loss (coefficient 0.5), and the network was trained using RMSProp with a learning rate of 1.5e-3. These coefficients trade off the importance of learning good Q-values, improving the policy, and preventing the policy from converging prematurely. This configuration of hyperparameters was found to perform well experimentally for both methods after a small hyperparameter search. The only difference between the A2C and MAC implementations was to replace A2C’s sampled-action policy improvement loss with MAC’s sum-over-actions loss; the algorithms used exactly the same architecture and hyperparameters.

For A2C and MAC, we trained a network for each game on 50 million frames of play, across 16 parallel threads, pausing every 200K frames to evaluate performance and compute learning curves. In each evaluation, we ran 16 agents in parallel, for 4500 frames (5 minutes) each, or 50 total episodes, whichever came first, and averaged the scores of the completed (or timed-out) episodes. Agents were trained and evaluated under the typical random start condition, where the game is initialized with a random number of no-op ALE actions (between 0 and 30) (?). The A2C and MAC results in Table 2 come from the final evaluation after all 50M frames, and they are averaged across 5 trials involving separately trained networks. Learning curves for each game can be found in Figure 3 in the Appendix. In addition to A2C, we also compared MAC against TRPO (results from a single trial) (?), and ES (results averaged over 30 trials) (?), and found that MAC performed competitively with all three benchmark algorithms.

Note that MAC’s performance on Pong and Q*bert was low relative to A2C. For Pong this was due to one of the five MAC trials obtaining a final score of -20.1 and pulling the average performance down significantly. The individual Pong scores for MAC were {20.5, 19.7, 18.3, 14.7, -20.1}; the scores for A2C were {19.4, 19.4, 19.3, 16.3, 15.6}. For Q*bert, the performance for both algorithms was much more variable. A2C scored 0.0 on 3 out of 5 trials, and MAC scored 0.0 on 2 out of 5 trials. The reason A2C’s average score is so much higher than MAC’s is that it had one lucky trial where it scored 7780.9 points. The individual Q*bert scores for MAC were {557.4, 504.7, 155.1, 0.0, 0.0}; the scores for A2C were {7780.9, 476.6, 0.0, 0.0, 0.0}. Additional hyperparameter tuning might lead to improved performance; however, the purpose of this Atari experiment was mainly to show that MAC is competitive with state-of-the-art policy search algorithms, and these results seem to indicate that it is.

## Discussion

At its core, MAC offers a new way of computing the policy gradient that can substantially reduce variance and increase learning speed. There are a number of orthogonal improvements to policy gradient methods, such as using natural gradients (?; ?), off-policy learning (?; ?; ?), second-order methods (?), and asynchronous exploration (?). We have not investigated how MAC performs with these extensions; however, just as these improvements were added to basic actor-critic methods, they could be added to MAC as well, and we expect they would improve its performance in a similar way.

A typical use-case for actor-critic algorithms is for problem domains with continuous actions, which are awkward for value-function-based methods (?). One approach to dealing with continuous actions is Deterministic Policy Gradients (DPG) (?; ?), which uses a deterministic policy to perform off-policy policy gradient updates. However, in settings where on-policy learning is necessary, using a deterministic policy leads to sub-optimal behavior (?), and hence a stochastic policy is typically used instead. The recently-introduced Expected Policy Gradients (EPG) (?) addresses this problem by generalizing DPG for stochastic policies. However, while EPG has good experimental performance on domains with continuous action spaces, the authors do not provide experimental results for discrete domains. MAC’s discrete results and EPG’s continuous results are in some sense complementary.

## Conclusion

The basic formulation of policy gradient estimators presented here—where the gradient is estimated by averaging the state-action value function across actions—leads to a new family of actor-critic algorithms. This family has the advantage of not requiring an additional variance-reduction baseline, substantially reducing the design effort required to apply them. It is also a natural fit with deep neural network function approximators, resulting in a network architecture that is identical to some sampled-action actor-critic algorithms, but with less variance.

We prove that for stochastic policies, the MAC algorithm (the simplest member of the resulting family), reduces variance relative to traditional actor-critic approaches, while maintaining the same bias. Our neural network implementation of MAC either outperforms, or is competitive with, state-of-the-art policy search algorithms, and our experimental results show that MAC’s lower variance lead to dramatically faster training in some cases. In future work, we aim to develop this family of algorithms further by including typical elaborations of the basic actor-critic architecture like natural or second-order gradients. Our results so far suggest that our new approach is highly promising, and that extensions to it will provide even further improvement in performance.

## References

- [Asadi and Williams 2016] Asadi, K., and Williams, J. D. 2016. Sample-efficient deep reinforcement learning for dialog control. arXiv preprint arXiv:1612.06000.
- [Baird 1994] Baird, L. C. 1994. Reinforcement learning in continuous time: Advantage updating. In Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference on, volume 4, 2448–2453. IEEE.
- [Baxter and Bartlett 2001] Baxter, J., and Bartlett, P. L. 2001. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15:319–350.
- [Bellemare et al. 2013] Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research 47:253–279.
- [Brockman et al. 2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. CoRR abs/1606.01540.
- [Ciosek and Whiteson 2017] Ciosek, K., and Whiteson, S. 2017. Expected policy gradients. arXiv preprint arXiv:1706.05374.
- [Degris, White, and Sutton 2012] Degris, T.; White, M.; and Sutton, R. S. 2012. Off-policy actor-critic. arXiv preprint arXiv:1205.4839.
- [Furmston, Lever, and Barber 2016] Furmston, T.; Lever, G.; and Barber, D. 2016. Approximate newton methods for policy search in markov decision processes. Journal of Machine Learning Research 17(227):1–51.
- [Greensmith, Bartlett, and Baxter 2004] Greensmith, E.; Bartlett, P. L.; and Baxter, J. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5(Nov):1471–1530.
- [Gu et al. 2016] Gu, S.; Lillicrap, T.; Ghahramani, Z.; Turner, R. E.; and Levine, S. 2016. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247.
- [Jensen 1906] Jensen, J. L. W. V. 1906. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica 30(1):175–193.
- [Kakade 2002] Kakade, S. M. 2002. A natural policy gradient. In Advances in neural information processing systems, 1531–1538.
- [Konda and Tsitsiklis 2000] Konda, V. R., and Tsitsiklis, J. N. 2000. Actor-critic algorithms. In Advances in neural information processing systems, 1008–1014.
- [Lillicrap et al. 2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- [Mnih et al. 2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.
- [Mnih et al. 2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning.
- [Peters and Schaal 2008] Peters, J., and Schaal, S. 2008. Natural actor-critic. Neurocomputing 71(7):1180–1190.
- [Puterman 1990] Puterman, M. L. 1990. Markov decision processes. Handbooks in operations research and management science 2:331–434.
- [Salimans et al. 2017] Salimans, T.; Ho, J.; Chen, X.; and Sutskever, I. 2017. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
- [Schulman et al. 2015] Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 1889–1897.
- [Silver et al. 2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 387–395.
- [Sutton and Barto 1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction, volume 1. MIT press Cambridge.
- [Sutton et al. 2000] Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063.
- [Van Seijen et al. 2009] Van Seijen, H.; Van Hasselt, H.; Whiteson, S.; and Wiering, M. 2009. A theoretical and empirical analysis of expected sarsa. In Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL’09. IEEE Symposium on, 177–184. IEEE.
- [Wang et al. 2016] Wang, Z.; Bapst, V.; Heess, N.; Mnih, V.; Munos, R.; Kavukcuoglu, K.; and de Freitas, N. 2016. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224.
- [Williams 1992] Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.
- [Wu et al. 2017] Wu, Y.; Mansimov, E.; Grosse, R. B.; Liao, S.; and Ba, J. 2017. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, 5285–5294.

## Appendix

### A. Proof of Theorem 1

Both AC and MAC are estimators of the true policy gradient (PG). Given a batch of data , we can write the bias of AC and MAC as:

(6) | ||||

(7) |

For clarity, we will rewrite the AC and MAC expectations (1) and (4) to explicitly denote the way that each algorithm estimates the policy gradient, given a batch of data (with size ):

AC | (8) | |||

MAC | (9) |

Substituting (8) and (1) into Eqn. (6) gives:

(10) |

Since is sampled from trajectories that were carried out according to the policy, we can drop the dependence on inside the expectation, and rewrite (10) as follows:

(11) | ||||

(12) | ||||

(13) |

Now we turn our attention to MAC, and substitute (9) and (1) into Eqn. (7), to obtain:

(14) | ||||

(15) | ||||

(16) | ||||

(17) |

Comparing (13) and (17), we see that AC and MAC have the same bias.

### B. Proof of Theorem 2

For any random variable , the variance can be written as:

If we assume the estimated Q-values for MAC and AC are the same in expectation, then the squared expectation’s contribution to the variance of each algorithm will be equal. We are only interested in determining which estimator has lower variance, so we can drop the second term and simply compare , the second moments.

Again we will employ the explicit definitions of the AC and MAC estimators, for a data set , given by (8) and (9), respectively.

For ease of notation, we define the following two functions:

(18) | ||||

(19) |

Here, represents a single parameter of the parameter vector . We consider an arbitrary choice of , so the following proof holds for all .

The above expressions allow us to rewrite the AC and MAC estimators (Eqn. 8 & 9) in terms of and :

(20) | ||||

(21) |

For convenience, we drop the subscript for the rest of this analysis.

Now we are ready to compare vs. .

By the assumption that is independent of for , we can distribute the expectation through in line 3 on the left, to obtain . In the last line, we can drop the second term in each expression, because they are the same. At this point we just need to compare vs. . In order to make this comparison, we make use of Jensen’s Inequality (?), which says that for a convex function and a vector :

We note that is convex, and as such, the following holds:

Thus, we can conclude that . Moreover, since is strictly convex, this inequality is strict as long as is not almost surely constant for a given state. That means for deterministic policies, we have , and for stochastic policies, .