An Alternative to Backpropagation in Deep Reinforcement Learning

An Alternative to Backpropagation in Deep Reinforcement Learning

Abstract

State-of-the-art deep learning algorithms mostly rely on gradient backpropagation to train a deep artificial neural network, which is generally regarded to be biologically implausible. For a network of stochastic units trained on a reinforcement learning task or a supervised learning task, one biologically plausible way of learning is to train each unit by REINFORCE. In this case, only a global reward signal has to be broadcast to all units, and the learning rule given is local, which can be interpreted as reward-modulated spike-timing-dependent plasticity (R-STDP) that is observed biologically. Although this learning rule follows the gradient of return in expectation, it suffers from high variance and cannot be used to train a deep network in practice. In this paper, we propose an algorithm called MAP propagation that can reduce this variance significantly while retaining the local property of learning rule. Different from prior works on local learning rules (e.g. contrastive divergence) which mostly applies to undirected models in unsupervised learning tasks, our proposed algorithm applies to directed models in reinforcement learning tasks. We show that the newly proposed algorithm can solve common reinforcement learning tasks at a speed similar to that of backpropagation when applied to an actor-critic network.

1 Introduction

Gradient backpropagation Rumelhart et al. (1986) efficiently computes the gradient of an objective function with respect to parameters by iterating backward from the last layer of a multi-layer artificial neural network. However, backpropagation is generally regarded as being biologically implausible Crick (1989); Mazzoni et al. (1991); O’Reilly (1996); Bengio et al. (2015); Hassabis et al. (2017). First, the learning rule given by backpropagation is non-local, since it relies on information other than input and output of a neuron-like unit during feed-forward computation, while synaptic plasticity observed biologically depends mostly on local information (e.g. STDP Gerstner et al. (2014)) and possibly some global signals (e.g. R-STDP Gerstner et al. (2014); Florian (2007); Pawlak et al. (2010)). To address this issue, different local learning rules have been proposed. Energy-based models (e.g. the continuous Hopfield model Hopfield (1984)) can be trained by algorithms that are based on a local learning rule (e.g. contrastive Hebbian learning Movellan (1991), contrastive divergence Hinton (2002), equilibrium propagation Scellier and Bengio (2017)). Local learning rules have also been proposed for hierarchical auto-encoder networks Bengio et al. (2015). These proposed algorithms apply to undirected networks in unsupervised learning tasks. A second reason that backpropagation is generally regarded as being biologically implausible is that it relies on symmetric feedback connections for transmitting error signals. However, such symmetric connections have not been widely observed in biological systems. This is also one of the limitations for the learning rules mentioned above, except Bengio et al. (2015), which proposed learning the feedback weights instead of assuming symmetric feedback connections. A third reason backpropagation is considered biologically implausible is that learning by backpropagation requires separate feedforward and backpropagation phases, and it is unclear how a biological system can synchronize an entire network to precisely alternate between these two phases.

For a reinforcement learning task, a multi-layer artificial neural network can be trained efficiently by training the output unit by REINFORCE Williams (1992) while all hidden units are trained by backpropagation. In the case of hidden units being stochastic and continuous, we can still train the hidden units by backpropagation using the re-parameterization trick Weber et al. (2019). Another possible way of training a network of stochastic units is to apply REINFORCE to all units, and it is shown that the learning rule is still an unbiased estimator of the gradient of return Williams (1992). Another interpretation of training each unit by REINFORCE is to treat each unit as an actor Barto et al. (1983), with each of them trying to maximize the global reward. This idea is related to Klopf’s hedonistic neuron hypothesis Klopf (1972, 1982).

REINFORCE, when applied on a Bernoulli-logistic unit, gives a three-factor learning rule which depends on a reinforcement signal, input and output of the unit. This is similar to R-STDP observed biologically, which depends on a neuromodulatory input, presynaptic spikes and postsynaptic spikes. It has been proposed that dopamine, a neurotransmitter, plays the role of neuromodulatory input in R-STDP and corresponds to the TD error concept from reinforcement learning. We refer readers to Chapter 15 of Sutton and Barto (2018) for a review and discussion of the connection between REINFORCE and neuroscience.

Though REINFORCE is simple and biologically plausible, if it is used to train all units in a network, it will suffer from high variance making it impractical to use in practice for all but the smallest networks. The first reason for the high variance is inefficient structural credit assignment. A single reward signal is responsible for guiding the learning of all units, and the correlation of units’ output with the reward signal is difficult to estimate since the reward signal is influenced by all units in the network, and it is unknown how each unit influences the reward signal. The second reason is non-stationary distribution. If we consider a multi-layer network and view each unit as a reinforcement learning unit, such as an actor unit in an actor-critic architecture, then to every individual actor on second or above layers, the dynamic of the environment is non-stationary since the actors on lower levels are also learning. These reasons make REINFORCE much slower than backpropagation.

To address the high variance, we propose a new algorithm that significantly reduces the variance of learning rule when training all units by REINFORCE. The idea is simple: we replace the output of all hidden units by MAP estimation of it conditioned on state and action before applying REINFORCE to all units. The update rule of hidden units is local but still requires symmetrically weighted feedback connections. We call the newly proposed algorithm MAP (Maximum a Posteriori) propagation. We show how it can apply to an actor-critic network with eligibility trace, where both the actor and critic network are trained with similar local learning rules modulated by global signals. Our experiments show that MAP propagation can solve common reinforcement learning tasks at a speed similar to that of backpropagation when applied to an actor-critic network.

2 Background and Notation

We will consider a Markov Decision Process (MDP). A MDP is a tuple given by , where is the finite set of possible states of the environment (but this work extends to MDPs where state set is infinite) and the state at time is denoted by which is in . is the finite set of possible actions and the action at time is denoted by which is in . is the transition function which describes the dynamics of the environment and is defined to be , which has to be a valid probability mass function. is the reward function, defined to be , and is the reward at time which is in . is the discount factor. is the initial state distribution, defined to be and has to be a valid probability mass function.

We are interested in finding a policy function such that if action is sampled from policy function, that is, if , then the expected return is maximized, where return is defined to be . The value function for policy is defined to be .

In this work, we only consider a multi-layer network of stochastic units as the policy function. Let be the number of layers in the network, including output layer. Denote as the value of hidden units of layer at time . For simplicity we also define and , and as sets of all hidden units. The conditional distribution of is given by function , where is the parameter for computing distribution of layer . To sample action from the network, we iterative samples from to . A common choice of is where is the non-linear activation function such as softplus or rectified linear unit (ReLU).

3 Algorithm

3.1 MAP Propagation

Firstly, the gradient of return can be estimated by REINFORCE, which is also known as likelihood ratio estimator:

(1)

We can show that (1) also equals , which is the REINFORCE update rule applied to each hidden unit with the same global reinforcement signal :

Theorem 1.

Under the multi-layer network of stochastic units described above, we have, for :

(2)

The proof is in Appendix A.1. Note that the above theorem is also proved in Williams (1992), but we prove it here again. This shows that we can apply REINFORCE to each unit of the network, and the learning rule is still an unbiased estimator of the gradient of the return. Therefore, denoting as the step size, we can learn the parameters of the network by:

(3)

However, this learning rule suffers from high variance, as shown by the following decomposition of variance (here we assume to be deterministic function of and , and the details can be found in the Appendix):

(4)

For the first term, the inner variance is taken with respect to conditioned on and . This variance is due to the stochastic nature of the hidden units in the network instead of the environment. For the second term, the outer variance is taken with respect to , , and . This variance is due to the stochastic nature of the state, chosen action, and return of the environment. We call the first term internal variance and the second term external variance.

Consider a multi-layer network of deterministic hidden units, with the output units being trained by REINFORCE and the hidden units being trained by backpropagation. We can substitute as and as in equation (4), and the internal variance will vanish while the external variance will still remain. Therefore the internal variance is the main reason contributing to the high variance of training all units with REINFORCE, and the external variance is a common problem for both learning rules. For the external variance, various methods exist to reduce this variance, such as value baseline. In this work, we are more interested in reducing the internal variance which only appears when training a whole network by REINFORCE.

It can be shown that (see the Appendix for the details):

(5)

To reduce the internal variance, we can therefore replace in learning rule (3) by , such that we obtain the following learning rule:

(6)

Since (3) is following gradient of return in expectation, and the expected update value of (3) and (6) is the same, we conclude that (6) is also a valid learning rule in the sense that it follows gradient of return in expectation. It is an ideal learning rule since it removes the internal variance from the learning rule. However, we note that in (6) is generally intractable to compute1. Instead, we propose to use maximum a posteriori (MAP) estimate to approximate this term:

(7)

Denote as and we are interested in estimation of . Essentially, REINFORCE uses a single sample of to estimate , while we propose to use the mode of conditioned on to estimate . Though the former estimate is unbiased, it suffers from a high variance such that learning is extremely slow. The latter estimate is biased but has no variance.

Since is intractable to compute analytically, we can approximate it by running gradient ascent on as a function of for fixed and , with being initialized as the sample sampled from the network when sampling action , such that approaches . This assumes all hidden units to be continuous. Therefore, before applying learning rule (3), we first run gradient ascent on for steps:

(8)

For , we can see that this is equivalent to (see the Appendix for the details):

(9)

Therefore, the update rule to is local: it only depends on and and not other layers. The update rule is maximizing the probability of the value of a hidden unit given the value of units one layer below and above by updating the value of that hidden unit.

We can also add noise to during each update such that it becomes stochastic gradient ascent. After updating for steps by (9), we can apply (3) to learn the weight of network. We call this algorithm MAP propagation and the pseudo-code can be found in Algorithm 1.

We argue that the proposed algorithm is more biologically plausible than backpropagation. Firstly, the updates rules used in MAP propagation are local except the global signal. A local update rule refers to an update rule that only depends on presynaptic spikes (or input of a unit) and postsynaptic spikes (or output of a unit) during the forward computation. In contrast, backpropagation can be formulated as an update rule that depends on the input and output of a unit, and also the change in outgoing weights. However, using the change in outgoing weights to propagate information is more biologically implausible than using only the presynaptic spikes and postsynaptic spikes, since the change in synaptic weights is much slower than spike transmission. For example, in experiments on STDP Bi and Poo (2001), it takes multiple pre-post spike pairs for a neuron to have an observable change in synaptic weights. Also, common learning rules that are observed biologically, such as Hebbian Rule, STDP, R-STDP, are all independent of change in outgoing weights, and only depend on presynaptic spikes and postsynaptic spikes (and a global signal for R-STDP). From this aspect, MAP propagation is more biologically plausible than backpropagation, since its update rules only depend on presynaptic spikes (or input of a unit), postsynaptic spikes (or output of a unit), and a global signal.

Secondly, all update rules in MAP propagation can be done in parallel for each layer, while backpropagation requires iterating from the last layer one by one, so MAP propagation does not require layer-wise synchronization while backpropagation requires such. It remains to be investigated whether the phase synchronization between the three phases in MAP propagation (forward computation, the update of hidden value, and the update of weights) can be removed, such that the whole algorithm can be implemented asynchronously.

Following we consider how to apply the idea to an actor-critic network. The learning rules (9) and (3) can be used to train the actor network with in (3) replaced by TD error computed by critic network. However, for critic networks, the above learning rule cannot be applied directly since it applies to reinforcement learning task only.

Since the function of critic network is to predict a scalar value , we consider how to learn a scalar regression task by MAP propagation in general.

3.2 MAP Propagation for Scalar Regression Task

A scalar regression task can be converted into a single-time-step MDP with the appropriate reward and as the action set. For example, we can set the reward function to be (we dropped the subscript as it only has a single time step), with being the output of the network and being the target output given input . Maximization of reward in the MDP thus minimizes the L2 distance between the predicted value and target value.

But this conversion is inefficient since the information of the optimal output is lost. Should increase or decrease given value of ? To address this issue, we suggest to use a baseline . Therefore, we have as the adjusted reward, and we obtain the following result:

Theorem 2.

Under the multi-layer network of stochastic units described above, we have, for :

(10)

The proof is in Appendix A.2. We note that the last term, , encourages the output of the network to have small variance, which can be removed if we consider adding the same term as regularization for motivating exploration. Therefore, we can use the following learning rule for the task:

(11)

However, is generally intractable to compute exactly. We can use Monte Carlo estimation by running the network a few times, but this is inefficient. Instead, we suggest to use , the mean value of action given value of last hidden layer, to estimate this term. here is obtained during sampling action , before running update rule (9). can usually be computed directly from . For example, when output unit’s distribution is , we have .

Though the use of makes the baseline no longer valid for all but last layers since it depends on , our experiment shows that the result is no worse than Monte Carlo estimation when combined with MAP propagation. Therefore, we suggest the following learning rule for a scalar regression task:

(12)

The pseudo-code is the same as the one in Algorithm 1 but with in line 15 replaced by .

3.3 MAP Propagation for Actor Critic Network

In the following, we use superscript and to denote the values for critic and actor network respectively. For example, denote the action of critic network (that is, the predicted value) at time . We also denote TD error as and mean TD error as , where

In a critic network, the target output is . However, a more stable estimate of target output is . Therefore, we chose , the target output, as . Substituting back into (12), the update rule for critic network is:

(13)

In actor network, we replace with computed from critic network. A more stable estimate of TD error is , which will be used instead of . Substituting back into (3), the update rule for actor network is:

(14)

The only difference between the learning rule of the critic network and the actor network is the additional factor of in the critic network. Two global signals, and , are required for the learning rule of critic network and only a single global signal, , is required for the learning rule of the actor network.

Again, we apply (9) for times before applying (13) and (14), to reduce variance of the learning rule.

We can further generalize the learning rule to use eligibility trace for more efficient temporal credit assignment, by accumulating in (13) and in (14) in a decaying trace. The full algorithm of actor-critic network with eligibility trace trained with MAP propagation can be found in Algorithm 2.

4 Related Works

A major inspiration for our work is Bengio et al. (2015), which proposed training a deep generative model by using MAP estimate to infer the value of latent variables given observed variables. Though the inference process of both algorithms shares some similarity, our algorithm is based on REINFORCE and is used in reinforcement learning tasks while their algorithm is based on variational EM and is used in unsupervised learning tasks. They also proposed to learn the feedback weights such that the constraint of symmetric weight can be removed. This idea can possibly be applied to our algorithm as well to remove the requirement of symmetric weight given the similarity of the inference process. Bengio and Fischer (2015) also explained the relationship of the inference of latent variables in energy-based models with backpropagation.

Another closely related work is Whittington and Bogacz (2017) which propose training a network on supervised learning tasks with local learning rule. They proposed the use of MAP estimate to infer the value of hidden units conditioned on network’s output being the target output. A local Hebbian learning rule is then applied to learn the network parameters after hidden units’ value settled to the MAP estimate. They showed that the proposed algorithm has a close relationship with backpropagation. In our algorithm, we do not fix the value of the network’s output to a target output since the optimal action is not known, and we use a global reward signal in guiding learning instead.

We can also view a multi-level network of stochastic units as a hierarchy or agents. In hierarchical reinforcement learning, Levy et al. (2017) proposed learning a multi-level hierarchy with the use of hindsight action, which replaces the chosen action of the high-level agent by a more likely action instead. This idea is similar to ours as the action of all actors in hidden layers are replaced with the most probable action in our algorithm.

There is a large amount of literature on methods of training a network of stochastic units. A comprehensive review can be found in Weber et al. (2019), which includes re-parametrization trick Kingma and Welling (2014) and REINFORCE Williams (1992). They introduce methods to reduce the variance of update rules such as baseline and critic. These ideas are orthogonal to the use of MAP estimate to reduce the variance of REINFORCE.

5 Experiments

To test the algorithm, we first consider the multiplexer task (Barto, 1985), which can be considered as a single-time-step MDP. This is to test the performance of the algorithm as an actor. Then we consider a scalar regression task to test the performance of the algorithm as a critic. Finally, we consider two popular reinforcement learning tasks, CartPole and LunarLander Brockman et al. (2016), to test the performance of the algorithm as both actor and critic.

In all the below tasks, all networks considered have the same architecture: a two-hidden-layer network, with the first hidden layer having 64 units, second hidden layer having 32 units and output layer having one unit. All hidden units are normally distributed with mean given by a non-linear function on linear transformation of previous layer and a fixed variance. That is, for . We chose , which is the softplus function. In case of the network in multiplexer task or the actor network, the output unit’s distribution is given by softmax function on previous layer. That is, , where is a scalar hyperparameter representing the temperature. In case of the network in scalar regression task or the critic network, the output unit’s distribution is normally distributed with mean given by linear transformation of previous layer and a fixed variance. That is, . We used in MAP propagation. Other hyper-parameters and details of experiments can be found in the Appendix.

5.1 Multiplexer Task

We consider the task of k-bit multiplexer here (similar to the one in Barto (1985) which is a 2-bit multiplexer) which can be considered as a single-time-step MDP. In a multiplexer, there are binary inputs, where the first bits represents the address and last bits represents the data, each of which is associated with an address. The desired output of the multiplexer is given by the value of data associated with the address, which is also binary. The action set is and we give a reward of 1 if the action of agent is the desired output and -1 otherwise. We consider here so the dimension of state is 37.

We trained each network using 2,000,000 training samples and a batch size of 128 with Adam optimizer Kingma and Ba (2014). Batch here means that we compute weight update for each sample in a batch, then we update the weight using the average of these weight updates. Adam optimizer is applied by treating the weight update as gradient and is not applied to the update of hidden value in MAP propagation.

We used Algorithm 1 for MAP propagation. We consider two baselines: 1. Backpropagation, where the output unit is trained with REINFORCE and all hidden units are trained with backpropagation using the re-parametrization trick; 2. REINFORCE, where both output and hidden units are trained with REINFORCE.

The results is shown in Fig 1. We can observe that MAP propagation performs the best, followed closely by backpropagation. For REINFORCE, the learning is very slow due to inefficient structural credit assignment. The result demonstrates that MAP propagation can improve the learning speed of REINFORCE significantly by reducing the variance of weight update, such that its learning speed is comparable to (or better than) that of backpropagation.

5.2 Scalar Regression Task

Following we consider a scalar regression task which can be considered as a single-time-step MDP. The dimension of input is 8 and follows the standard normal distribution. The target output is computed by a one-hidden-layer network with weights chosen randomly. The action set is and the reward is given by negative L2 distance between action and target output.

Similar to the previous task, we trained each network using 100,000 training samples and a batch size of 16 with Adam optimizer. For MAP propagation, we used Algorithm 1 and tested two variants. In the first variant, we directly used negative L2 loss as the reward, which is labeled as ‘MAP Prop (RL)’. In the second variant, we used (12) as the update rule in Algorithm 1, which is labeled as ‘MAP Prop’. For the baseline, we trained the network with backpropagation using the re-parametrization trick by doing gradient descent on L2 loss.

The results is shown in Fig 1. We observe that if we directly use negative L2 loss as the reward, then the learning speed of MAP propagation is significantly lower than that of backpropagation since the information of optimal action is not incorporated. However, if we instead used (12) to incorporate the information of optimal action, then the learning speed of MAP propagation is comparable to that of backpropagation. The result demonstrates that MAP propagation can be used to train a network on scalar regression tasks efficiently

5.3 Reinforcement Learning Task

Following we consider two popular reinforcement learning tasks: CartPole and LunarLander Brockman et al. (2016). In CartPole, the goal is to maintain the balance of the pole. A reward of is given for each environment step. An episode ends if it lasts more than 500 steps or the pole falls. Therefore, the maximum return of an episode is 500. In LunarLander, there are four available actions for controlling the movement of the lander and the goal is to land the lander to the landing pad, which will yield 200 scores or above. We used CartPole-v1 and LunarLander-v2 in OpenAI’s Gym for the implementation of both tasks.

For MAP propagation, we used the actor-critic network with eligibility traces given by Algorithm 2. For the baseline, we used actor-critic with eligibility traces (episodic) from Sutton and Barto (2018). The model of baseline has a similar architecture with the one in MAP propagation but with all hidden units being deterministic, since we found the performance of deterministic units better than that of stochastic units. We trained the entire network with backpropagation and did not use any batch update.

We trained the network for at least 1,000 and 3,000 episodes for CartPole and LunarLander respectively and stopped training once the agent solved the task. We define solving the task as having a running average reward over the last 100 episodes larger or equal to 500 and 200 for CartPole and LunarLander respectively.

The average return over 10 independent runs is shown in Fig 2. Let denote the first episode when the agent achieves a perfect score (500 and 200 for CartPole and LunarLander respectively). The average value of over 10 runs, denoted as , is 60.70 and 288.10 for MAP propagation in CartPole and LunarLander respectively. In addition, let denote the average number of episodes required for solving the task. The average value of over 10 runs, denoted as , is 183.20 and 1997.30 for MAP propagation in CartPole and LunarLander respectively. A comparison with the baseline model can be found in Table 2.

For both CartPole and LunarLander, we observe that MAP propagation has a better performance than the baseline, in terms of both the learning curve and episodes required to solve the task. Although there are other algorithms such as value-based methods that can solve both tasks more efficiently, the present work aims to demonstrate that MAP propagation can be used to train a multi-layer network in a standard reinforcement learning algorithm at a speed comparable to backpropagation. This points to the possibility of using MAP propagation to replace backpropagation in training a multi-layer network in algorithms besides actor-critic networks.

We also notice that for baseline models, entropy regularization Mnih et al. (2016) is necessary for the agent to solve the task, else the agent will get stuck in early episodes. However, we have not employed any forms of regularization in MAP propagation. This may indicate that stochastic hidden units combined with MAP propagation can encourage more exploration such that entropy regularization is unnecessary for the agent to solve the task.

In the experiments above, we plotted the running average of rewards against the number of the episode. However, it should be noted that the computation required for MAP propagation and backpropagation in each step is different since the time complexity of each step in MAP propagation is . However, one advantage of MAP propagation is that the update rule of (9) can be done asynchronously, while backpropagation requires all the upper layers to finish computing the gradient before updating the next layer.

\includegraphics

[width=.5]fig1.png

(a) Multiplexer
\includegraphics

[width=.5]fig2.png

(b) Scalar Regression
Figure 1: Running average rewards over last 10 episodes in multiplexer task and scalar regression task. Results are averaged over 10 independent runs, and shaded area represents standard deviation over the runs.
\includegraphics

[width=.5]fig3.png

(a) CartPole
\includegraphics

[width=.5]fig4.png

(b) LunarLander
Figure 2: Running average rewards over last 100 episodes in CartPole and LunarLander. Results are averaged over 10 independent runs, and shaded area represents standard deviation over the runs.
CartPole LunarLander
Mean Std. Mean Std. Mean Std. Mean Std.
Proposed 60.70 17.37 288.10 71.47 183.20 65.24 1997.30 1092.33
Baseline 101.40 30.47 656.80 124.17 315.90 82.11 3088.30 1030.17
Table 1: Episodes required for learning the task.

6 Conclusion

In this paper, we propose a new algorithm that significantly reduce the variance of the update step when training all units in a network by REINFORCE. The newly proposed algorithm retains the local property of update rule and our experiments show that it can solve reinforcement learning tasks at a speed similar to that of backpropagation when applied to actor-critic networks. This points to the possibility of learning a deep network for more complex tasks such as Atari games without backpropagation. We also argue that the proposed algorithm is more biologically plausible than backpropagation as the update rules of MAP propagation are local except a global signal and MAP propagation does not require layer-wise synchronization during the learning of parameters.

One unsatisfying property of the proposed algorithm is that it relies on symmetric feedback weight, which is not observed biologically. However, this issue can be solved by learning the feedback weight separately by predicting values of the lower layer given values of the upper layer, as is done in Bengio et al. (2015).

The newly proposed algorithm may also give insights on how learning occurs in biological systems. The idea that each neuron in a network learns independently by R-STDP, which is closely related to REINFORCE, is elegant and biologically plausible. However, despite such learning follows gradient ascent on return in expectation, it will result in a very inefficient structural credit assignment since a single global signal is responsible for assigning credits to millions of neurons in the network. In MAP propagation, we solve this issue by having all hidden units settled to equilibrium while fixing the value of output units, which allows more efficient structural credit assignment. Given a large number of feedback connections in biological systems, this may be one of the methods that biological systems employ to facilitate structural credit assignment. For example, Roelfsema and Holtmaat (2018) shows that the activity of neurons in the visual system that is responsible for the selected action will be enhanced by feedback connection, which is analogous to the updates of hidden units in MAP propagation.

As shown by Whittington and Bogacz (2017), MAP inference has a close relationship with backpropagation. Is MAP propagation equivalent to backpropagation under certain conditions? Can MAP propagation prevent local optima better than backpropagation? The relationship and comparison between MAP propagation and backpropagation deserve further analysis. In addition, it remains to be investigated how the phase synchronization required in MAP propagation can be removed, such that the entire algorithm can be implemented asynchronously.

7 Acknowledgment

We thank Andy Barto who inspired this research and provided valuable insight and comments.

1 Input: differentiable policy function: for ;
2 Algorithm Parameter: step size , ; update step number ;
3 Initialize policy parameter: for ;
4 Loop forever (for each episode): 
5       Generate an episode following for ;
6       Loop for each step of the episode : 
7             ;
             /* Gradient ascent on probability of hidden value */
8             for  do
9                   for  do
10                             ;
11                         Optional: Inject noise to ;
12                        
13                   end for
14                  
15             end for
            /* Learn parameter using updated */
16             for  do
17                  
18             end for
19            
20      
21      
22
Algorithm 1 MAP Propagation: Monte-Carlo Policy-Gradient Control
1 Input: differentiable policy function for critic and actor respectively: and for ;
2 Algorithm Parameter: step size , , , ; update step number ; trace decay rate , ;
3 Initialize policy parameter: for ;
4 Loop forever (for each episode): 
5       Initialize (first state of episode) ;
6       (initialize eligibility trace for critic and actor);
7       (initialize TD error and reward);
8       (initialize last predicted value and last mean value) ;
9       Loop while is not terminal (for each time step): 
10             ;
11             ;
             /* Sample action and predicted value . */
12             for  do
13                   Sample from ;
14                   Sample from ;
15                  
16             end for
17            ;
18             ;
19             ;
             /* Learn parameter from TD error */
20             (compute TD error; if S is terminal, then replace with 0);
21             for  do
22                   ;
23                   ;
24                  
25             end for
            /* Gradient ascent on probability of hidden value */
26             for  do
27                   for  do
28                         ;
29                         ;
30                         Optional: Inject noise to and ;
31                        
32                   end for
33                  
34             end for
            /* Accumulate eligibility trace */
35             for  do
36                   ;
37                   ;
38                  
39             end for
40             (record last predicted value);
41             (record last mean predicted value);
42             Take action , observe ;
43            
44      
45      
46
Algorithm 2 MAP Propagation: Actor-critic with Eligibility Trace

Appendix A Proof

a.1 Proof of Theorem 1

(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)

(16) to (17) uses the fact that:

(25)

which marginalizes over all possible value of hidden units .
(17) to (18) uses the fact that:

(26)
(27)
(28)

For (27), we first move the term out from inner summation since it is the only term that depends on . Then we multiply back to get (27). (28) uses the fact that the inner summation in (27) is just .
(18) to (19) uses the fact that .
(19) to (20) uses the definition of expectation.
(20) to (21) uses the fact that, for any random variables and , . In our case, , and .
(21) to (22) uses the fact that is conditional independent with