RUDDER: Return Decomposition for Delayed Rewards
Abstract
We propose a novel reinforcement learning approach for finite Markov decision processes (MDPs) with delayed rewards. In this work, biases of temporal difference (TD) estimates are proved to be corrected only exponentially slowly in the number of delay steps. Furthermore, variances of Monte Carlo (MC) estimates are proved to increase the variance of other estimates, the number of which can exponentially grow in the number of delay steps. We introduce RUDDER, a return decomposition method, which creates a new MDP with same optimal policies as the original MDP but with redistributed rewards that have largely reduced delays. If the return decomposition is optimal, then the new MDP does not have delayed rewards and TD estimates are unbiased. In this case, the rewards track values so that the future expected reward is always zero. We experimentally confirm our theoretical results on bias and variance of TD and MC estimates. On artificial tasks with different lengths of reward delays, we show that RUDDER is exponentially faster than TD, MC, and MC Tree Search (MCTS). RUDDER outperforms rainbow, A3C, DDQN, Distributional DQN, Dueling DDQN, Noisy DQN, and Prioritized DDQN on the delayed reward Atari game Venture in only a fraction of the learning time. RUDDER considerably improves the stateoftheart on the delayed reward Atari game Bowling in much less learning time. Source code is available at https://github.com/mljku/baselinesrudder and demonstration videos at https://goo.gl/EQerZV.
2212
1 Introduction
Assigning the credit for a received reward to actions that were performed is one of the central tasks in reinforcement learning Sutton:17book (). Long term credit assignment has been identified as one of the largest challenges in reinforcement learning Sahni:18 (). Current reinforcement learning methods are still significantly slowed down when facing longdelayed rewards Rahmandad:09 (); Luoma:17 (). To learn delayed rewards there are three phases to consider: (1) discovering the delayed reward, (2) keeping information about the delayed reward, (3) learning to receive the delayed reward to secure it for the future. Recent successful reinforcement learning methods provide solutions to one or more of these phases. Most prominent are Deep Networks (DQNs) Mnih:13 (); Mnih:15 (), which combine learning with convolutional neural networks for visual reinforcement learning Koutnik:13 (). The success of DQNs is attributed to experience replay Lin:93 (), which stores observed statereward transitions and then samples from them. Prioritized experience replay Schaul:15 (); Horgan:18 () advanced the sampling from the replay memory. Different policies perform exploration in parallel for the ApeX DQN and share a prioritized experience replay memory Horgan:18 (). DQN was extended to double DQN (DDQN) Hasselt:10 (); Hasselt:16 () which helps exploration as the overestimation bias is reduced. Noisy DQNs Fortunato:18 () explore by a stochastic layer in the policy network (see Hochreiter:90 (); Schmidhuber:90diff ()). Distributional learning Bellemare:17 () profits from noise since means that have high variance are more likely selected. The dueling network architecture Wang:15 (); Wang:16 () separately estimates state values and action advantages, which helps exploration in unknown states. Policy gradient approaches Williams:92 () explore via parallel policies, too. A2C has been improved by IMPALA through parallel actors and correction for policylags between actors and learners Espeholt:18 (). A3C with asynchronous gradient descent Mnih:16 () and ApeX DPG Horgan:18 () also rely on parallel policies. Proximal policy optimization (PPO) extends A3C by a surrogate objective and a trust region optimization that is realized by clipping or a KullbackLeibler penalty Schulman:17 ().
Recent approaches aim to solve learning problems caused by delayed rewards. Function approximations of value functions or critics Mnih:15 (); Mnih:16 () bridge time intervals if states associated with rewards are similar to states that were encountered many steps earlier. For example, assume a function that has learned to predict a large reward at the end of an episode if a state has a particular feature. The function can generalize this correlation to the beginning of an episode and predict already high reward for states possessing the same feature. Multistep temporal difference (TD) learning Sutton:88td (); Sutton:17book () improved both DQNs and policy gradients Hessel:17 (); Mnih:16 (). AlphaGo and AlphaZero learned to play Go and Chess better than human professionals using Monte Carlo Tree Search (MCTS) Silver:16 (); Silver:17 (). MCTS simulates games from a time point until the end of the game or an evaluation point and therefore captures long delayed rewards. Recently, world models using an evolution strategy were successful Ha:18 (). These forward view approaches are not feasible in probabilistic environments with a high branching factor of state transition. Backward view approaches trace back from known goal states Edwards:18 () or from high reward states Goyal:18 (). However, a stepbystep backward model has to be learned.
We propose learning from a backward view, which is constructed from a forward model. The forward model predicts the return, while the backward analysis identifies states and actions which have caused the return. We apply Long ShortTerm Memory (LSTM) Hochreiter:91 (); Hochreiter:97 () to predict the return of an episode. LSTM was already used in reinforcement learning Schmidhuber:15 () for advantage learning Bakker:02 () and learning policies Hausknecht:15 (); Mnih:16 (); Heess:16 (). However, sensitivity analysis by “backpropagation through a model” Munro:87 (); Robinson:89 (); RobinsonFallside:89 (); Bakker:07 () has major drawbacks: local minima, instabilities, exploding or vanishing gradients in the world model, proper exploration, actions are only regarded by sensitivity but not their contribution (relevance) Hochreiter:90 (); Schmidhuber:90diff ().
Since sensitivity analysis substantially hinders learning, we use contribution analysis for backward analysis like contributionpropagation Landecker:13 (), contribution approach Poulin:06 (), excitation backprop Zhang:16 (), layerwise relevance propagation (LRP) Bach:15 (), Taylor decomposition Bach:15 (); Montavon:17taylor (), or integrated gradients (IG) Sundararajan:17 (). Using contribution analysis, a predicted return can be decomposed into contributions along the stateaction sequence. Substituting the prediction by the actual return, we obtain a redistributed reward leading to a new MDP with the same optimal policies as for the original MDP. Redistributing the reward is fundamentally different from reward shaping Ng:99 (); Wiewiora:03 (), which changes the reward as a function of states but not of actions. Reward shaping and “lookback advice” Wiewiora:03icml () both keep the original reward, which may still have long delays that cause an exponential slowdown of learning. We propose RUDDER, which performs reward redistribution by return decomposition and, therefore, overcomes problems of TD and MC stemming from delayed rewards. RUDDER vastly decreases the variance of MC and largely avoids the exponentially slow bias corrections of TD — for optimal return decomposition TD is even unbiased.
2 BiasVariance for MDP Estimates
We perform a biasvariance analysis for temporal difference (TD) and Monte Carlo (MC) estimators of the actionvalue function. A finite Markov decision process (MDP) is 6tuple of finite sets of states (random variable at time ), of actions (random variable ), and of rewards (random variable ). Furthermore, has transitionreward distributions conditioned on stateactions, a policy given as action distributions conditioned on states, and a discount factor . The marginals are and . The expected reward is . The return is . We often consider finite horizon MDPs with sequence length and giving . The actionvalue function for policy is . Goal of learning is to maximize the expected return at time , that is .
BiasVariance Analysis for MDP Estimates. MC estimates by an arithmetic mean of the return, while TD methods like SARSA or learning estimate by an exponential average of the return. When using Monte Carlo for learning a policy, we use an exponential average, too, since the policy steadily changes. The th update of actionvalue at stateaction is (constant MC). Assume samples from a distribution with mean and variance . For these samples, we compute bias and variance of the arithmetic mean and the exponential average with as initial value and . We obtain and as well as and (see Appendix A1.2.1 for more details). Both variances are proportional to , which is the variance when sampling a return from the MDP .
Using , and analog and , the next theorem gives mean and variance of sampling returns from an MDP.
Theorem 1.
The mean and variance of sampled returns from an MDP are
(1) 
The proof is given after Theorem A1 in the appendix. The theorem extends the deterministic reward case Sobel:82 (); Tamar:12 (). The variance consists of three parts: (i) The immediate variance stemming from the probabilistic reward . (ii) The local variance caused by probabilistic state transitions and probabilistic policy. (iii) The expected variance of the next values, which is zero for TD since it replaces by a fixed . Therefore TD has less variance than MC which uses the complete future return. See Appendix A1.2.2 for more details.
Delayed Reward Aggravates Learning. The th temporal difference update with learning rate of the actionvalue is
(2) 
with (learning), (expected SARSA), sample from (SARSA). The next theorem states that TD has an exponential decay for value updates even for eligibility traces Klopf:72 (); BartoSutton:81 (); Sutton:81towards (); Singh:96 ().
Theorem 2.
For initialization and delayed reward with for , receives its first update not earlier than at episode via , where is the reward of episode 1. Eligibility traces with lead to an exponential decay of when the reward is propagated steps back.
The proof is given after Theorem A2 in the appendix. To correct the bias by a certain amount, TD requires exponentially many updates with the number of delay steps.
For Monte Carlo the variance of a single delayed reward can increase the variance of actionvalues of all previously visited stateactions. We define the “onsite” variance
(3) 
is the vector with value at position and the transition matrix from states to with entries at position . For finite time horizon, the “backward induction algorithm” Puterman:90 (); Puterman:05 () gives with , , and rowstochastic matrix :
(4) 
where we define and . We are interested in the number of actionvalues which variances are affected through the increase of the variance of a single delayed reward. Let be the number of all states that are reachable after time steps of an episode. Let be the random average connectivity of a state in to states in . Let be number of states in that are affected by for with (only one actionvalue with delayed reward at time ). Next theorem says that the onsite variance can have large effects on the variance of actionvalues of all previously visited stateactions, which number can grow exponentially.
Theorem 3.
For , onsite variance at step contributes to by the term , where . The number of states affected by is .
The proof can be found after Theorem A3 in the appendix. For small , the number of states affected by onsite variance at step growths exponentially with . For large and after some time , the number of states affected by growths linearly (cf. Corollary A1). Consequently, we aim for decreasing the onsite variance for large , in order to reduce the overall variance. In summary, delayed rewards lead to exponentially slow corrections of biases of temporal difference (TD) and can increase exponentially many variances of Monte Carlo (MC) actionvalue estimates, where the grows is in both cases exponentially in the number of delay steps.
3 Return Decomposition and Reward Redistribution
A Markov decision process (MDP) is stateenriched compared to a MDP if has the same states, actions, transition probabilities, and reward probabilities as but with additional information in their states. Thus, is a homomorphic image of with the same actions. Therefore each optimal policy of has an equivalent optimal policy of , and vice versa, with the same optimal return Ravindran:01 (); Ravindran:03 (). These properties are known from state abstraction and aggregation Li:06 () and from bisimulations Givan:03 (). For more details see Appendix A1.3.1. Two Markov decision processes and are returnequivalent if they differ only in and but for each policy they have the same expected return at : . Returnequivalent decision processes have the same optimal policies. For more details see Appendix A1.3.1.
We assume to have an MDP with immediate reward which is transformed to a stateenriched MDP with delayed reward, where the return is given as the reward at the end of the sequence. The transformed delayed stateenriched MDP has reward , , and . The states are enriched by which records the accumulated already received rewards, therefore , where . We show in Proposition A1 that for . Thus, each immediate reward MDP can be transformed into a delayed reward MDP without changing the optimal policies.
Next we consider the opposite direction, where the delayed reward MDP is given and we want to find an immediate reward MDP . should be returnequivalent to and differ from only by its reward distributions. We have to redistribute the final reward, which is the return, to previous time steps, therefore we have to decompose the return into a sum of rewards at different time steps. To allow for a return decomposition, we predict the return by a function using the stateaction sequence: , where is the stateaction sequence from to . In a next step, we decompose into a sum: , where is the prediction contribution from the backward analysis. Since is an MDP, the reward can be predicted solely from . To avoid this Markov property in the input sequence, we replace by a difference between state and its successor . The difference is assumed to assure statistical independence of from other components in the sequence . The function is decomposed by contribution analysis into a sum of s: . The actual reward redistribution is to ensure .
If for partial sums holds, then the return decomposition is optimal. The rewards for are and . The term introduces variance in .
Theorem 4.
The MDP based on a redistributed reward (I) has the same optimal policies as of the delayed reward, and (II) for an optimal return decomposition, the values are given by .
The proof can be found after Theorem A4 in the appendix. In particular, when starting with zero initialized values, TD learning of is not biased at the beginning. For policy gradients with eligibility traces using for Sutton:17book (), we have the expected updates , where is replaced during learning by a sample from which is the redistributed reward for an episode.
RUDDER: Return Decomposition using LSTM. We introduce RUDDER “RetUrn Decomposition for DElayed Rewards”, which performs return decomposition using a Long ShortTerm Memory (LSTM) network for redistributing the original reward. RUDDER consists of (I) a safe exploration strategy, (II) a lessons replay buffer, and, most importantly, (III) an LSTM with contribution analysis for return decomposition.
(I) Safe exploration. Exploration strategies should assure that LSTM receives training data with delayed rewards. Toward this end we introduce a new exploration strategy which initiates an exploration sequence at a certain time in the episode to discover delayed rewards. To avoid an early stop of the exploration sequence, we perform a safe exploration which avoids actions associated with low values. Low values hint at states with zero future reward where the agent gets stuck. Exploration parameters are starting time, length, and the action selection strategy with safety constraints.
(II) Lessons replay buffer. The lessons replay buffer is an episodic memory, which has been used for episodic control Lengyel:08 () and for episodic backward update to efficiently propagate delayed rewards Lee:18 (). If RUDDER safe exploration discovers an episode with unexpected delayed reward, it is secured in a lessons replay buffer Lin:93 (). Unexpected is indicated by a large prediction error of the LSTM. Episodes with larger error are more often sampled from the lessons replay buffer similar to prioritized experience replay.
(III) LSTM and contribution analysis. LSTM networks Hochreiter:91 (); Hochreiter:97 (), are used to predict the return from an input sequence. LSTM solves the vanishing gradient problem Hochreiter:91 (); Hochreiter:00 (), which severely impedes credit assignment in recurrent neural networks, i.e. the correct identification of input events that are relevant but far in the past. LSTM backward analysis is done through contribution analysis like layerwise relevance propagation (LRP) Bach:15 (), Taylor decomposition Bach:15 (); Montavon:17taylor (), or integrated gradients (IG) Sundararajan:17 (). These methods identify the contributions of the inputs to the final prediction and, thereby, compute the return decomposition.
The LSTM return decomposition is optimal if LSTM predicts at every time step the expected return. To push LSTM toward optimal return decomposition, we introduce continuous return predictions as auxiliary tasks, where the LSTM has to predict the expected return at every time step. Hyperparameters are when and how often LSTM predicts and how continuous prediction errors are weighted. Strictly monotonic LSTM architecture (see Appendix A4.3.1) can also ensure that LSTM decomposition is optimal.
4 Experiments
We set for delayed rewards in MDPs with finite time horizon or absorbing states to avoid discounting of the delayed rewards. Discount factors close to have been confirmed to be suited to long delays by metagradient reinforcement learning Xu:18 ().
Grid World: RUDDER is tested on a grid world with delayed reward at the end of an episode. The MDP should illustrate an environment where an agent must run to a bomb to defuse it and afterwards run away as it may still explode. The sooner the agent defuses the bomb, the further away it can run. An alternative strategy is to directly run away, which, however, leads to less return than defusing the bomb. The Grid World task consists of a quadratic grid with bomb at coordinate and start at , where is the delay of the task. The agent can move in four different directions (up, right, left, and down), where only moves that keep the agent on the grid are allowed. The episode finishes after steps. At the end of the episode the agent receives a reward of 1000 if it has visited bomb. At each time step the agent receives an immediate reward of , where is a factor that depends on the chosen action, is the current time step, and is the Hamming distance to bomb. Each move the agent reduces the Hamming distance to bomb is penalized by the immediate reward using . Each move the agent increases the Hamming distance to bomb is rewarded by the immediate reward using . Due to this distracting reward the agent is forced to learn the values precisely, since the immediate reward hints at a suboptimal policy. This is because the learning process has to determine that visiting bomb leads to larger values than increasing the distance to bomb.
To investigate how the delay affects bias and variance, values are estimated by TD and MC for an greedy optimal policy to assure that all states are visited. After computing the true values by backward induction, we compare the bias, variance, and mean squared error (MSE) of the estimators for the MDP with delayed reward and the new MDP obtained by RUDDER with optimal reward redistribution. Figure 1 shows that RUDDER for MC estimators has a smaller number of values with high variance than the original MDP. Figure 1 also shows that RUDDER for TD estimators corrects the bias faster than TD for the original MDP. After these policy evaluation experiments with a fixed policy, we move on to learning an optimal policy. For learning the optimal policy, we compare learning, Monte Carlo (MC), and Monte Carlo Tree Search (MCTS) on the grid world, where sample updates are used for learning and MC Sutton:17book (). Figure 2 shows the number of episodes required by different methods to learn a policy that achieves 90% of the return of the optimal policy for different delays. Optimal reward redistribution speeds up learning exponentially. More information is available in Appendix A5.1.1.
ChargeDischarge environment: We test RUDDER on another task, the ChargeDischarge environment, which has two states: charged / discharged and two actions charge / discharge . The deterministic reward is , and . The reward is accumulated for the whole episode and given only at time with , which determines the maximal delay of a reward. The deterministic state transitions are and . The optimal policy alternates between charging and discharging to accumulate a reward of 10 every other time step.
For this environment, RUDDER is realized as a monotonic LSTM with layerwise relevance propagation (LRP) for the backward analysis (see Appendix A5.1.2 for more details). The reward redistribution provided by RUDDER served to learn a policy by learning. We compare RUDDER with learning, MC, and MCTS, where learning and MC use sampleupdates. The results are shown in Figure 2. Reward redistribution requires to observe an exponentially smaller number of states than learning, MC, and MCTS to learn the optimal policy.
Atari Games: We investigated the Atari games supported by the Arcade Learning Environment Bellemare:13 () and OpenAI Gym Brockman:16 () for games with delayed reward. Requirements for proper games to demonstrate performance on delayed reward environments are: (I) large delay between an action and the resulting reward, (II) no distractions due to other rewards or changing characteristics of the environment, (III) no skills to be learned to receive the delayed reward. The requirements were met by Bowling and Venture. In Bowling the only reward of the game is given at the end of each turn, i.e. more than frames after the first relevant action. In Venture the first reward has a minimum delay of frames from the first relevant action. Figure 3 shows that RUDDER learns faster than rainbow Hessel:17 (), Prioritized DDQN Schaul:15 (), Noisy DQN Fortunato:18 (), Dueling DDQN Wang:16 (), DQN Mnih:15 (), C51 (Distributional DQN) Bellemare:17 (), DDQN Hasselt:10 (), A3C Mnih:16 (), and ApeX DQN Horgan:18 (). RUDDER sets a new stateoftheart score in Bowling after 12M environment frames. Thus, RUDDER outperforms its competitors in only of their training time, as shown in Table 1. For more details see Appendix A5.2.
Algorithm  Frames  Bowling  Venture  
%  raw  %  raw  
RUDDER  12M  62.10  108.55  96.55  1,147 
rainbow  200M  5.01  30  0.46  5.5 
Prioritized DDQN  200M  28.71  62.6  72.67  863 
Noisy DQN  200M  39.39  77.3  0  0 
Dueling DDQN  200M  30.81  65.5  41.85  497 
DQN  200M  19.84  50.4  13.73  163 
Distributional DQN  200M  37.06  74.1  93.22  1,107 
DDQN  200M  32.7  68.1  8.25  98 
ApeX DQN  22,800M  17.6  4  152.67  1,813 
Random  –  0  23.1  0  0 
Human  –  100  160.7  100  1,187 
RUDDER Implementation for Atari Games. We implemented RUDDER for the proximal policy optimization (PPO) algorithm Schulman:17 (). For policy gradients the expected updates are , where is replaced during learning by the return or its expectation. RUDDER policy gradients replace by the redistributed reward assuming an optimal return decomposition. With eligibility traces using for Sutton:17book (), we have the rewards with and the expected updates . We use integrated gradients Sundararajan:17 () for the backward analysis of RUDDER. The LSTM prediction is decomposed by integrated gradients: . For Atari games, is defined as the pixelwise difference of two consecutive frames. To make static objects visible, we augment the input with the current frame. For more implementation details see Appendix A5.2. Source code is available at https://github.com/mljku/baselinesrudder.
Evaluation Methodology. Agents were trained for 12M frames with noop starting condition, i.e. a random number of up to 30 nooperation actions at the start of a game. Training episodes are terminated when a life is lost or at a maximum of 108K frames. After training, the best model was selected based on training data and evaluated on 200 games Hasselt:16 (). For comparison across games, the normalized humanpercentage scores according to Bellemare:17 () are reported.
Visual Confirmation of Detecting Relevant Events by Reward Redistribution. We visually confirm a meaningful and helpful redistribution of reward in both Bowling and Venture during training. As illustrated in Figure 4, RUDDER is capable of redistributing a reward to key events in a game, drastically shortening the delay of the reward and quickly steering the agent toward good policies. Furthermore, it enriches sequences that were sparse in reward with a dense reward signal. Video demonstrations are available at https://goo.gl/EQerZV.
5 Discussion and Conclusion
Exploration is most critical, since discovering delayed rewards is the first step to exploit them.
Human expert episodes are an alternative to exploration and can serve to fill the lessons replay buffer. Learning can be sped up considerably when LSTM identifies human key actions. Return decomposition will reward human key actions even for episodes with low return since other actions that thwart high returns receive negative reward. Using human demonstrations in reinforcement learning led to a huge improvement on some Atari games like Montezuma’s Revenge Pohlen:18 (); Aytar:18 ().
Conclusion. We have shown that for finite Markov decision processes with delayed rewards TD exponentially slowly corrects biases and MC can increase many variances of estimates exponentially, both in the number of delay steps. We have introduced RUDDER, a return decomposition method, which creates a new MDP that keeps the optimal policies but its redistributed rewards do not have delays. In the optimal case TD for the new MDP is unbiased. On two artificial tasks we demonstrated that RUDDER is exponentially faster than TD, MC, and MC Tree Search (MCTS). For the Atari game Venture with delayed reward RUDDER outperforms all methods except ApeX DQN in much less learning time. For the Atari game Bowling with delayed reward RUDDER improves the stateoftheart and outperforms PPO, Rainbow, and APEX with less learning time.
Acknowledgments
This work was supported by NVIDIA Corporation, Bayer AG with Research Agreement 09/2017, Merck KGaA, Zalando SE with Research Agreement 01/2016, Audi.JKU Deep Learning Center, Audi Electronic Venture GmbH, Janssen Pharmaceutica, IWT research grant IWT150865 (Exaptation), LIT grant LSTM4Drive, and FWF grant P 28660N31.
References
References are provided in Appendix A6.
Contents
 1 Introduction
 2 BiasVariance for MDP Estimates
 3 Return Decomposition and Reward Redistribution
 4 Experiments
 5 Discussion and Conclusion
 A1 Reinforcement Learning and Credit Assignment
 A2 Related Reinforcement Topics
 A3 Backward Analysis Through Contribution Analysis
 A4 Long ShortTerm Memory (LSTM )
 A5 Experiments
 A6 References
Appendix A1 Reinforcement Learning and Credit Assignment
a1.1 Finite Markov Decision Process
We consider a finite Markov decision process (MDP) , which is 6tuple :

is a finite set of states; is the random variable for states at time with values and has a discrete probability distribution.

is a finite set of actions (sometimes statedependent ); is the random variable for actions at time with values and has a discrete probability distribution.

is a finite set of rewards; is the random variable for actions at time with values and has a discrete probability distribution.

are the transitionreward distributions over staterewards conditioned on stateactions,

is the policy, which is a distribution over actions given the state,

is the discount factor.
At time , the random variables give the states, actions, and reward of the MDP, while lowcase letter give possible values. At each time , the environment is in some state . The agent takes an action , which causes a transition of the environment to state and a reward for the agent. Therefore, the MDP creates a sequence
(A1) 
The marginal probabilities are:
(A2)  
(A3)  
(A4) 
We used a sum convention: goes over all possible values of and , that is, all combinations which fulfill the constraints on and . If is a function of (fully determined by ), then .
We denote expectations

is the expectation where the random variable is an MDP sequence of state, actions, and rewards generated with policy .

is the expectation where the random variable is with values .

is the expectation where the random variable is with values .

is the expectation where the random variable is with values .

is the expectation where the random variables are with values , with values , with values , with values , and with values . If more or less random variables are used, the notation is consistently adapted.
The return is the accumulated reward starting from :
(A5) 
The discount factor determines how much immediate rewards are favored over more distant rewards. For the return (the objective) is determined the largest expected immediate reward, while for the return is determined by the expected sum of future rewards if the sum exists.
StateValue and ActionValue Function. The statevalue function for policy and state is defined as
(A6) 
Starting at :
(A7) 
The optimal statevalue function and policy are
(A8)  
(A9) 
The actionvalue function for policy is the expected return when staring from , taking action , and following policy .
(A10) 
The optimal actionvalue function and policy are
(A11)  
(A12) 
The optimal actionvalue function can be expressed via the optimal value function :
(A13) 
Vice versa, the optimal statevalue function can be expressed via the optimal actionvalue function using the optimal policy :
(A14)  
Finite time horizon and no discount. We consider a finite time horizon, that is, we consider only episodes of length but may receive at episode end reward at time . The finite time horizon MDP creates a sequence
(A15) 
Furthermore we do not discount future rewards, that is, we set . The return from time to is the sum of rewards:
(A16) 
The statevalue function for policy is
(A17) 
and the actionvalue function for policy is
(A18)  
From Bellman equation Eq. (A18), we obtain:
(A19)  
(A20) 
The expected return at time for policy is
(A21)  
The agent may start in a particular starting state which is a random variable. Often has only one value .
Learning. The goal of learning is to find the policy that maximizes the expected future discounted reward (the return) if starting in . Thus, the optimal policy is
(A22) 
We consider two learning approaches for values: Monte Carlo and temporal difference.
Monte Carlo (MC). To estimate , MC computes the arithmetic mean of all observed in the data. When using Monte Carlo for learning a policy we use an exponential average, too, since the policy steadily changes. The th update of actionvalue at stateaction is
(A23) 
This update is called constant MC Sutton:17book ().
Temporal difference (TD) methods. TD updates are based on the Bellman equation. If and have been estimated, the values can be updated according to the Bellman equation: