Fixed-Horizon Temporal Difference Methodsfor Stable Reinforcement Learning

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning


We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon , these algorithms bootstrap from the value function for horizon , or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as “the deadly triad”). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and -step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.

1 Temporal Difference Learning

Temporal difference (TD) methods [20] are an important approach to reinforcement learning (RL) that combine ideas from Monte Carlo estimation and dynamic programming. A key view of TD learning is that it incrementally learns testable, predictive knowledge of the environment [18]. The learned values represent answers to questions about how a signal will accumulate over time, conditioned on a way of behaving. In control tasks, this signal is the reward sequence, and the values represent an arbitrarily long sum of rewards an agent expects to receive when acting greedily with respect to its current predictions.

A TD learning agent’s prediction horizon is specified through a discount factor [17]. This parameter adjusts how quickly to exponentially decay the weight given to later outcomes in a sequence’s sum, and allows computational complexity to be independent of span [23]. It’s often set to a constant , but prior work generalizes the discount rate to be a transition-dependent termination function [26]. This allows for (variable) finite-length sums dependent on state transitions, like in episodic tasks.

In this paper, we explore a case of time-dependent discounting, where the sum considers a fixed number of future steps regardless of where the agent ends up. We derive and investigate properties of fixed-horizon TD algorithms, and identify benefits over infinite-horizon algorithms in both prediction and control. Specifically, by storing and updating a separate value function for each horizon, fixed-horizon methods avoid feedback loops when bootstrapping, so that learning is stable even in presence of function approximation. Fixed-horizon agents can approximate infinite-horizon returns arbitrarily well, expand their set of learned horizons freely (computation permitting), and combine forecasts from multiple horizons to make time-sensitive predictions about rewards. We emphasize our novel focus on predicting a fixed-horizon return from each state as a solution method, regardless of the problem setting. Our algorithms can be applied to both finite-horizon and infinite-horizon MDPs.

2 MDPs and One-step TD Methods

The RL problem is usually modeled as a Markov decision process (MDP), in which an agent interacts with an environment over a sequence of discrete time steps.

At each time step , the agent receives information about the environment’s current state, , where is the set of all possible states in the MDP. The agent uses this state information to select an action, , where is the set of possible actions in state . Based on the current state and the selected action, the agent gets information about the environment’s next state, , and receives a reward, , according to the environment model, .

Actions are selected according to a policy, , which gives the probability of taking action given state . An agent is interested in the return:


where and is the final time step in an episodic task, and and for a continuing task.

Value-based methods approach the RL problem by computing value functions. In prediction, or policy evaluation, the goal is to accurately estimate a policy’s expected return, and a state-value function, denoted , is estimated. In control, the goal is to learn a policy that maximizes the expected return, and an action-value function, denoted , is estimated. In each case, the value functions represent a policy’s expected return from state (and action ):


TD methods learn to approximate value functions by expressing Equations 2 and 3 in terms of successor values (the Bellman equations). The Bellman equation for is:


Based on Equation 4, one-step TD prediction estimates the return by taking an action in the environment according to a policy, sampling the immediate reward, and bootstrapping off of the current estimated value of the next state for the remainder of the return. The difference between this TD target and the value of the previous state (the TD error) is then computed, and the previous state’s value is updated by taking a step proportional to the TD error with :


Q-learning [25] is arguably the most popular TD method for control. It has a similar update, but because policy improvement is involved, its target is a sample of the Bellman optimality equation for action-values:


For small finite state spaces where the value function can be stored as a single parameter vector, also known as the tabular case, TD methods are known to converge under mild technical conditions [17]. For large or uncountable state spaces, however, one must use function approximation to represent the value function, which does not have the same convergence guarantees.

Some early examples of divergence with function approximation were provided by \citeauthorboyan1995generalization (\citeyearboyan1995generalization), who proposed the Grow-Support algorithm to combat divergence. \citeauthorbaird1995 (\citeyearbaird1995) provided perhaps the most famous example of divergence (discussed below) and proposed residual algorithms. \citeauthorgordon1995stable (\citeyeargordon1995stable) proved convergence for a specific class of function approximators known as averagers. Convergence of TD prediction using linear function approximation, first proved by \citeauthortsitsiklis1997analysis (\citeyeartsitsiklis1997analysis), requires the training distribution to be on-policy. This approach was later extended to Q-learning [11], but under relatively stringent conditions. In particular, Assumption (7) of Theorem 1 in  \citeauthormelo2007q \shortcitemelo2007q amounts to a requirement that the behaviour policy is already somewhat close to the optimal policy.

The on-policy limitations of the latter two results reflect what has come to be known as the deadly triad [17]: when using   TD methods with   function approximation, training on   an off-policy data distribution can result in instability and divergence. One response to this problem is to shift the optimization target, as done by Gradient TD (GTD) methods [21, 3]; while provably convergent, GTD algorithms are empirically slow [8, 17]. Another approach is to approximate Fitted Value Iteration (FVI) methods [13] using a target network, as proposed by \citeauthordqn2015 (\citeyeardqn2015). Though lacking convergence guarantees, this approach has been empirically successful. In the next section, we propose fixed-horizon TD methods, an alternative approach to ensuring stability in presence of the deadly triad.

3 Fixed-horizon TD Methods

A fixed-horizon return is a sum of rewards similar to Equation 1 that includes only a fixed number of future steps. For a fixed horizon , the fixed-horizon return is defined to be:


which is well-defined for any finite . This formulation allows the agent’s horizon of interest and sense of urgency to be characterized more flexibly and admits the use of in the continuing setting. Fixed-horizon value functions are defined as expectations of the above sum when following policy beginning with state (and action ):


These fixed-horizon value functions can also be written in terms of successor values. Instead of bootstrapping off of the same value function, bootstraps off of the successor state’s value from an earlier horizon [2]:


where for all .

From the perspective of generalized value functions (GVFs) [18], fixed-horizon value functions are compositional GVFs. If an agent knows about a signal up to a final horizon of , it can specify a question under the same policy with a horizon of , and directly make use of existing values. As , the fixed-horizon return converges to the infinite-horizon return, so that a fixed-horizon agent with sufficiently large could solve infinite-horizon tasks arbitrarily well. Indeed, for an infinite-horizon MDP with absolute rewards bounded by and , it is easy to see that .

For a final horizon , there may be concerns about suboptimal control. We explore this empirically in Section 5. For now, we note that optimality is never guaranteed when values are approximated. This can result from function approximation, or even in tabular settings from the use of a constant step size. Further, recent work shows benefits in considering shorter horizons [24] based on performance metric mismatches.

One-step Fixed-horizon TD

One-step fixed-horizon TD (FHTD) learns approximate values by computing, for each :


where for all . The general procedure of one-step FHTD was previously described by \citeauthorsutton1988 (\citeyearsutton1988), but not tested due to the observation that one-step FHTD’s computation and storage scale linearly with the final horizon , as it performs value updates per step. We argue that because these value updates can be parallelized, reasonably long horizons are feasible, and we’ll later show that -step FHTD can make even longer horizons practical. Forms of temporal abstraction [19] can further substantially reduce the complexity.

A key property of FHTD is that bootstrapping is grounded by the 0th horizon, which is exactly known from the start. The 1st horizon’s values estimate the expected immediate reward from each state, which is a stable target. The 2nd horizon bootstraps off of the 1st, which eventually becomes a stable target, and so forth. This argument intuitively applies even with general function approximation, assuming weights are not shared between horizons. To see the implications of this on TD’s deadly triad, consider one-step FHTD with linear function approximation. For each horizon’s weight vector, , the following update is performed:

where is the feature vector of the state at time , . The expected update can be written as:

where and , having expectations over the transition dynamics and the sampling distribution over states. Because the target uses a different set of weights, the vector is non-stationary (but convergent), and the matrix is an expectation over the fixed sampling distribution. Defining to be the matrix containing each state’s feature vector along each row, and to be the diagonal matrix containing the probability of sampling each state on its diagonal, we have that . is always positive definite and so the updates are guaranteed to be stable relative to the current vector. In contrast, TD’s update gives where contains the state-transition probabilities under the target policy. TD’s matrix can be negative definite if and do not match (i.e., off-policy updates), and the weights may diverge. We explore the convergence of FHTD formally in Section 4.

A horizon’s reliance on earlier horizons’ estimates being accurate may be expected to have worse sample complexity. If each horizon’s parameters are identically initialized, distant horizons will match infinite-horizon TD’s updates for the first steps. If , this suggests that FHTD’s sample complexity might be upper bounded by that of TD. See Appendix D for a preliminary result in this vein.

An FHTD agent has strictly more predictive capabilities than a standard TD agent. FHTD can be viewed as computing an inner product between the rewards and a step function. This gives an agent an exact notion of when rewards occur, as one can subtract the fixed-horizon returns of subsequent horizons to get an individual expected reward at a specific future time step.

Multi-step Fixed-horizon TD Prediction

Another way to estimate fixed-horizon values is through Monte Carlo (MC) methods. From a state, one can generate reward sequences of fixed lengths, and average their sums into the state’s expected fixed-horizon return. Fixed-horizon MC is an opposite extreme from one-step FHTD in terms of a bias-variance trade-off, similar to the infinite-horizon setting [9]. This trade-off motivates considering multi-step FHTD methods.

Fixed-horizon MC is appealing when one only needs the expected return for the final horizon , and has no explicit use for the horizons leading up to . This is because it can learn the final horizon directly by storing and summing the last rewards, and performing one value-function update for each visited state. If we use a fixed-horizon analogue to -step TD methods [17], denoted -step FHTD, the algorithm stores and sums the last rewards, and only has to learn value functions. For each , -step FHTD computes:


assuming that is divisible by . If is not divisible by , the earliest horizon’s update will only sum the first of the last rewards. Counting the number of rewards to sum, and the number of value function updates, -step FHTD performs operations per time step. This has a worst case of operations at and , and a best case of operations at . This suggests that in addition to trading off reliance on sampled information versus current estimates, -step FHTD’s computation can scale sub-linearly in .

Fixed-horizon TD Control

The above FHTD methods for state-values can be trivially extended to learn action-values under fixed policies, but it is less trivial with policy improvement involved.

If we consider the best an agent can do from a state in terms of the next steps, it consists of an immediate reward, and the best it can do in the next steps from the next state. That is, each horizon has a separate target policy that’s greedy with respect to its own horizon. Because the greedy action may differ between one horizon and another, FHTD control is inherently off-policy.

Q-learning provides a natural way to handle this off-policyness, where the TD target bootstraps off of an estimate under a greedy policy. Based on this, fixed-horizon Q-learning (FHQ-learning) performs the following updates:


A saddening observation from FHTD control being inherently off-policy is that the computational savings of -step FHTD methods may not be possible without approximations. With policy improvement, an agent needs to know the current greedy action for each horizon. This information isn’t available if -step FHTD methods avoid learning intermediate horizons. On the other hand, a benefit of each horizon having its own greedy policy is that the optimal policy for a horizon is unaffected by the policy improvement steps of later horizons. As a compositional GVF, the unidirectional decoupling of greedy policies suggests that in addition to a newly specified final horizon leveraging previously learned values for prediction, it can leverage previously learned policies for control.

4 Convergence of Fixed-horizon TD

This section sets forth our convergence results, first for linear function approximation (which includes the tabular case), and second for general function approximation.

Linear Function Approximation

For linear function approximation, we prove the convergence of FHQ-learning under some additional assumptions, outlined in detail in Appendix A. Analogous proofs may be obtained easily for policy evaluation. We provide a sketch of the proof below, and generally follow the outline of  \citeauthormelo2007q \shortcitemelo2007q. Full proofs are in the Appendix.

We denote for the feature vector corresponding to the state and action . We will sometimes write for . Assume furthermore that the features are linearly independent and bounded.

We assume that we are learning horizons, and approximate the fixed-horizon -th action-value function linearly.

where . Colon indices denote array slicing, with the same conventions as in NumPy. For convenience, we also define .

We define the feature corresponding to the max action of a given horizon:

The update at time for each horizon is given by

where the step-sizes are assumed to satisfy:

Proposition 1.

For , the following ODE system has an equilibrium.


Denote one such equilibrium by , and define . If, furthermore, we have that for ,


then is a globally asymptotically stable equilibrium of Equation 19.


See Appendix A. The main idea is to explicitly construct an equilibrium of Equation 19 and to use a Lyapunov function along with Equation 20 to show global asymptotic stability. ∎

Equation 20 means that the -th fixed-horizon action-value function must be closer, when taking respective max actions, to its equilibrium point than the -th fixed-horizon action-value function is to its equilibrium, where distance is measured in terms of squared error of the action-values. Intuitively, the functions for the previous horizons must have converged somewhat for the next horizon to converge, formalizing the cascading effect described earlier. This assumption is reasonable given that value functions for smaller horizons have a bootstrap target that is closer to a Monte Carlo target than the value functions for larger horizons. As a result, eq. 20 is somewhat less stringent than the corresponding assumption (7) in Theorem 1 of  \citeauthormelo2007q \shortcitemelo2007q, which requires the behaviour policy already be somewhat close to optimal.

Theorem 1.

Viewing the right-hand side of the ODE system eq. 19 as a single function , assume that is locally Lipschitz in . Assuming also Equation 20, a fixed behaviour policy , and the assumptions in the Appendix, the iterates of FHQ-learning converge with probability 1 to the equilibrium of the ODE system eq. 19.


See Appendix A. The main idea is to use Theorem 17 of  \citeauthorbenveniste1990stochastic \shortcitebenveniste1990stochastic and Proposition 1. ∎

A limitation of Theorem 1 is the assumption of a fixed behaviour policy. It is possible to make a similar claim for a changing policy, assuming that it satisfies a form of Lipschitz continuity with respect to the parameters. See \citeauthormelo2007q \shortcitemelo2007q for discussion on this point.

General Function Approximation

We now address the case where is represented by a general function approximator (e.g., a neural network). As before, the analysis extends easily to prediction (). General non-linear function approximators have non-convex loss surfaces and may have many saddle points and local minima. As a result, typical convergence results (e.g., for gradient descent) are not useful without some additional assumption about the approximation error (cf., the inherent Bellman error in the analysis of Fitted Value Iteration [13]). We therefore state our result for general function approximators in terms of -strongness for :

Definition 1.

A function approximator, consisting of function class and iterative learning algorithm , is -strong with respect to target function class and loss function if, for all target functions , the learning algorithm is guaranteed to produce (within a finite number of steps) an such that .

We consider learning algorithms that converge once some minimum progress can no longer be made:

Assumption 1.

There exists stopping constant such that algorithm is considered “converged” with respect to target function if less than progress would be made by an additional step; i.e., if .

Note that -strongness may depend on stopping constant : a larger naturally corresponds to earlier stopping and looser . Note also that, so long as the distance between the function classes and is upper bounded, say by d, any convergent is “d-strong”. Thus, a -strongness result is only meaningful to the extent that is sufficiently small.

We consider functions parameterized by . Letting denote the Bellman operator , we assume:

Assumption 2.

The target function is Lipschitz continuous in the parameters: there exists constant such that for all , where is a norm on value function space (typically a weighted norm, weighted by the data distribution), which we take to be a Banach space containing both and .

It follows that if , the sequence of target functions converges to in under norm . We can therefore define the “true” loss:


where we drop the square from the usual mean square Bellman error [17] for ease of exposition (the analysis is unaffected after an appropriate adjustment). Since we cannot access , we optimize the surrogate loss:

Lemma 1.

If , and learning has not yet converged with respect to the surrogate loss , then .


Intuitively, enough progress toward a similar enough surrogate loss guarantees progress toward the true loss. Applying the triangle inequality (twice) gives:

where the final inequality uses from Assumption 1. ∎

It follows from Lemma 1 that when is small enough— for some constant —either the true loss falls by at least , or learning has converged with respect to the current target . Since is non-negative (so cannot go to ), it follows that the loss converges to a -strong solution: with . Since there are only a finite number of -sized steps between the current loss at time and (i.e., only a finite number of opportunities for the learning algorithm to have “not converged” with respect to the surrogate ), the parameters must also converge.

Since is stationary, it follows by induction that:

Theorem 2.

Under Assumptions 1 and 2, each horizon of FHQ-learning converges to a -strong solution when using a -strong function approximator.

In contrast to Theorem 1, which applies quite generally to linear function approximators, -strongness and Assumption 1 limit the applicability of Theorem 2 in two important ways.

First, since gradient-based learning may stall in a bad saddle point or local minimum, neural networks are not, in general, -strong for small . Nevertheless, repeat empirical experience shows that neural networks consistently find good solutions [27], and a growing number of theoretical results suggest that almost all local optima are “good” [14, 5, 15]. For this reason, we argue that -strongness is reasonable, at least approximately.

Second, Assumption 1 is critical to the result: only if progress is “large enough” relative to the error in the surrogate target is learning guaranteed to make progress on . Without a lower bound on progress—e.g., if the progress at each step is allowed to be less than regardless of —training might accumulate an error on the order of at every step. In pathological cases, the sum of such errors may diverge even if . As stated, Assumption 1 does not reflect common practice: rather than progress being measured at every step, it is typically measured over several, say , steps. This is because training targets are noisy estimates of the expected Bellman operator and several steps are needed to accurately assess progress. Our analysis can be adapted to this more practical scenario by making use of a target network [12] to freeze the targets for steps at a time. Then, considering each step window as a single step in the above discussion, Assumption 1 is fair. This said, intuition suggests that pathological divergence when Assumption 1 is not satisfied is rare, and our experiments with Deep FHTD Control show that training can be stable even with shared weights and no target networks.

5 Empirical Evaluation

This section outlines several hypotheses concerning fixed-horizon TD methods, experiments aimed at testing them, and the results from each experiment. Pseudo-code, diagrams, more experimental details, and additional experiments can be found in the supplementary material.

Stability in Baird’s Counterexample

We hypothesize that FHTD methods provide a stable way of bootstrapping, such that divergence will not occur under off-policy updating with function approximation. To test this, we used Baird’s counterexample [1], a 7-state MDP where every state has two actions. One action results in a uniform random transition to one of the first 6 states, and the other action results in a deterministic transition to the 7th state. Rewards are always zero, and each state has a specific feature vector for use with linear function approximation. It was presented with a discount rate of , and a target policy which always chooses to go to the 7th state.

In our experiment, we used one-step FHTD with importance sampling corrections [16] to predict up to a horizon of . Each horizon’s weights were initialized to be , based on \citeauthorrlbook2018 \shortciterlbook2018, and we used a step size of . We performed 1000 independent runs of 10,000 steps, and the results can be found in Figure 1.

Figure 1: Weight trajectories of one-step FHTD’s 100th horizon value function on Baird’s counterexample, plotted after each time step. Shaded regions represent one standard error.

We see that one-step FHTD eventually and consistently converges. The initial apparent instability is due to each horizon being initialized to the same weight vector, making early updates resemble the infinite-horizon setting where weight updates bootstrap off of the same weight vector. The results emphasize what TD would have done, and how FHTD can recover from it. Of note, the final weights do give optimal state-values of for each state. In results not shown, FHTD still converges, sometimes quicker, when each horizon’s weights are initialized randomly (and not identically).

Tabular FHTD Control

Figure 2: Mean episode lengths over 100 episodes of FHQ-learning and Q-learning with various step-sizes and horizons of interest. Results are averaged over 100 runs, and shaded regions represent one standard error.

In this section, we evaluate one-step FHQ-learning in a control problem. We hypothesize that when transitions are highly stochastic, predicting too far into the future results in unnecessarily large variance. Using fixed step sizes, we expect this to be an issue even in tabular settings. Both truncating the horizon and constant-valued discounting can address the variability of long term information, so we compare undiscounted FHQ-learning to discounted Q-learning.

We designed the slippery maze environment, a maze-like grid world with 4-directional movement. The agent starts in the center, and hitting walls keep the agent in place. The “slipperiness” involves a chance that the agent’s action is overridden by a random action. A reward of is given at each step. The optimal deterministic path is 14 steps, but due to stochasticity, an optimal policy averages 61.77 steps.

Each agent behaved -greedily with . We swept linearly spaced step-sizes, final horizons for FHQ-learning, and discount rates for Q-learning. The discount rates were selected such that if represented a per-step termination probability of a stochastic process, the expected number of steps before termination matches the tested values of . We performed 100 independent runs, and Figure 2 shows the mean episode length over 100 episodes.

For FHQ-learning, it can be seen that if the final horizon is unreasonably short (), the agent performs poorly. However, does considerably better than if it were to predict further into the future. With Q-learning, each discount rate performed relatively similar to one another, despite discount rates chosen to have expected sequence lengths comparable to the fixed horizons. This may be because they still include a portion of the highly variable information about further steps. For both algorithms, a shorter horizon was preferable over the full episodic return.

Deep FHTD Control

We further expect FHTD methods to perform well in control with non-linear function approximation. FHTD’s derivation assumes weights are separated by horizon. To see the effect of horizons sharing weights, we treated each horizon’s values as linear neural network outputs over shared hidden layers. Use of this architecture along with parallelization emphasizes that the increased computation can be minimal. Due to bootstrapped targets being decoupled by horizon, we also expect deep FHTD methods to not need target networks.

In OpenAI Gym’s LunarLander-v2 environment [4], we compared Deep FHQ-learning (DFHQ) with a final horizon and DQN [12]. We restricted the neural network to have two hidden layers, and swept over hidden layer widths for each algorithm. We used , and behaviour was -greedy with annealing linearly from to over 50,000 frames. RMSprop [22] was used on sampled mini-batches from an experience replay buffer [12], and ignoring that the target depends on the weights, DFHQ minimized the mean-squared-error across horizons:


We performed 30 independent runs of 500,000 frames each (approximately 1000 episodes for each run). Figure 3 shows for each frame, the mean return over the last 10 episodes of each algorithm’s best parameters (among those tested) in terms of area under the curve. Note that the results show DFHQ and DQN without target networks. From additional experiments, we found that target networks slowed DFHQ’s learning more than it could help over a run’s duration. They marginally improve DQN’s performance, but the area under the curve remained well below that of DFHQ.

Figure 3: Mean return over last 10 episodes at each frame of DFHQ and DQN, without target networks, averaged over 30 runs. Shaded regions represent one standard error.

It can be seen that DFHQ had considerably lower variance, and relatively steady improvement. Further, DFHQ was significantly less sensitive to , as DQN with immediately diverged. From the remainder of our sweep, DFHQ appeared relatively insensitive to large hidden layer widths beyond the setting shown, whereas DQN’s performance considerably dropped if the width further increased.

DFHQ’s good performance may be attributed to the representation learning benefit of predicting many outputs [10, 7]; in contrast with auxiliary tasks, however, these outputs are necessary tasks for predicting the final horizon. An early assessment of the learned values can be found in Appendix D.

6 Discussion and future work

In this work, we investigated using fixed-horizon returns in place of the conventional infinite-horizon return. We derived FHTD methods and compared them to their infinite-horizon counterparts in terms of prediction capability, complexity, and performance. We argued that FHTD agents are stable under function approximation and have additional predictive power. We showed that the added complexity can be substantially reduced via parallel updates, shared weights, and -step bootstrapping. Theoretically, we proved convergence of FHTD methods with linear and general function approximation. Empirically, we showed that off-policy linear FHTD converges on a classic counterexample for off-policy linear TD. Further, in a tabular control problem, we showed that greedifying with respect to estimates of a short, fixed horizon could outperform doing so with respect to longer horizons. Lastly, we demonstrated that FHTD methods can scale well to and perform competitively on a deep reinforcement learning control problem.

There are many avenues for future work. Given that using shorter horizons may be preferable (Figure 2), it would be interesting if optimal weightings of horizons could be learned, rather than relying on the furthest horizon to act. Developing ways to handle the off-policyness of -step FHTD control (See Appendix D), incorporating temporal abstraction, and experiments in more complex environments would improve our understanding of the scalability of our methods to extremely long horizon tasks. Finally, the applications to complex and hyperbolic discounting (Appendix B), exploring the benefits of iteratively deepening the final horizon, and the use of fixed-horizon critics in actor-critic methods might be promising.


The authors thank the Reinforcement Learning and Artificial Intelligence research group, Amii, and the Vector Institute for providing the environment to nurture and support this research. We gratefully acknowledge funding from Alberta Innovates – Technology Futures, Google Deepmind, and from the Natural Sciences and Engineering Research Council of Canada.

Appendix A Full Proof of Convergence for Linear Function Approximation


We assume throughout a common probability space . Our proof follows the general outline in \citeauthormelo2007q \shortcitemelo2007q.

Assume we have a Markov decision process . , the state-space, is assumed to be compact, the action space is assumed to be finite (with -algebra ), and is a bounded, deterministic function assigning a reward to every transition tuple. Let by a -algebra defined on . The kernel is assumed to be action-dependent, and is defined such that

for all . Throughout, we assume a fixed, measurable behaviour policy such that for all and . With a fixed behaviour policy, we can define a new Markov chain first with a function :

where . It is straightforward to see that is a pre-measure on the algebra which can be extended to a measure on .

To apply the results of  \citeauthorbenveniste1990stochastic \shortcitebenveniste1990stochastic, we must construct another Markov chain so that in  \citeauthor[p.213]benveniste1990stochastic \shortcitebenveniste1990stochastic has access to the TD error at time . In the interests of completeness, we provide the full details below, but the reader may safely skip to the next section.

We employ a variation of a standard approach, as in for example  \citeauthortsitsiklis1999average \shortcitetsitsiklis1999average. Let us define a new process . The process has state space and -algebra , with kernel defined first on

Similar to before, it is straightforward to see that the above function is a pre-measure and can be extended to a measure on for each fixed .

Lemma 2.

is a Markov kernel.


It remains to show that is measurable with respect to for fixed .

We use Dynkin’s theorem. Define

We have by construction of above and the fact that is a kernel. is also a -system (i.e., it is closed under finite intersections). is a monotone class from using basic properties of measurable functions, so that by the theorem. Hence, is a kernel. ∎

The following is a convenient result.

Lemma 3.

Assume that is uniformly ergodic. Then is also uniformly ergodic.


From  \citeauthor[p.389]meyn2012markov \shortcitemeyn2012markov, a Markov chain is uniformly ergodic iff it is -small for some . We will show that is -small for some measure . Since is uniformly ergodic, let and a non-trivial measure on such that for all , ,


For any , and , we can write

where . Using Equation 24, we have

Let us define the bottom expression as a function . We claim that is a measure on . First, . Second, for disjoint, and such are themselves disjoint, otherwise if for , then , which is impossible by assumption of disjointedness of the . Finally, is itself a measure in the second argument.

It remains to check that is not trivial. Set . Then for all , we have . Then

The last inequality is by assumption of being non-trivial. ∎

Note that our corresponds to the in  \citeauthor[p.213]benveniste1990stochastic \shortcitebenveniste1990stochastic. With this construction finished, we will assume in the following that whenever we refer to or the Markov chain , we are actually referring to or the Markov chain respectively.

We assume that the step-sizes of the algorithm satisfy the following:

We write for the feature vector corresponding to state and action . We will sometimes write for . Assume furthermore that the features are linearly independent and bounded.

We assume that we are learning horizons, and approximate the fixed-horizon -th action-value function linearly.

where . Colon indices denote array slicing, with the same conventions as in NumPy. For convenience, we also define .

We define the feature corresponding to the max action of a given horizon:

The update at time for each horizon is given by

Here, .


Proposition 1.

For , the following ODE system has an equilibrium.


Denote one such equilibrium by , and define . If, furthermore, we have that for ,


then is a globally asymptotically stable equilibrium of Equation 25.


First, we show that there is at least one equilibrium of Equation 25. Finding an equilibrium point amounts to solving the following equations for all :

Since we assume that the features are linearly independent, and using the fact that , we can recursively solve these equations to find an equilibrium.

Let be the equilibrium point thus generated. Define and substitute into Equation 25 to obtain the following system.


The equilibrium of Equation 27 corresponds to the equilibrium of Equation 25. By showing global asymptotic stability of 0 for Equation 27 and using the change of variable , we will have global asymptotic stability of for Equation 25.

Let us use the squared Euclidean norm on as our Lyapunov function. Such a function is clearly positive-definite. Let now denote a trajectory of Equation 27. Calculating,

We used Hø̈lder’s inequality for the first inequality and Equation 26 for the last. The claim follows. ∎

Defining , we can rewrite Equation 26 as


Effectively, Equation 28 means that the -th fixed-horizon action-value function must be closer to the corresponding equilibrium, when taking the max action for each function, than the -th fixed-horizon action-value function is to its equilibrium when averaged across states and actions according to the behaviour policy. Intuitively, the functions for the previous horizons must have converged somewhat for the next horizon to converge. This condition can also be more easily satisfied by using a lower value of .

Theorem 1.

Viewing the right-hand side of the ODE system Equation 25 as a single function , assume that is locally Lipschitz in .

Assuming Equation 26, a fixed behaviour policy that results in a Markov chain , and the assumptions in the Preliminaries, the iterates of FHQL converge with probability 1 to the equilibrium of the ODE system Equation 25.


We apply Theorem 17 of \citeauthor[p.239]benveniste1990stochastic \shortcitebenveniste1990stochastic. Conditions (A.1)-(A.2) easily follow from the step-size assumption and from the existence of a transition kernel for our Markov chain, which we write here as to keep to the notation in \citeauthorbenveniste1990stochastic \shortcitebenveniste1990stochastic. We also need to check (A.3)-(A.4). (A.3) and (A.4) (ii)-(iii) will be included into the verification of conditions (1.9.1)-(1.9.6) below, while we have that (A.4) (i) holds by assumption of being locally Lipschitz.

It remains to verify the conditions (1.9.1) to (1.9.6). Let us write the invariant measure of as .

(1.9.1) The (also written as ) in \citeauthor[p.239]benveniste1990stochastic \shortcitebenveniste1990stochastic corresponds in our case to the following matrix with components and with replaced by :


with , , where we suppress the index for clarity. Because we are assuming bounded features, and because depends only linearly on , the bound (1.9.1) is easily seen to be satisfied after, for example, expanding and applying the Cauchy-Schwarz inequality several times.

(1.9.2) The bound is trivially satisfied since in our case, .

(1.9.3) This bound is satisfied since we assume that our state space is a compact subset of Euclidean space and is thus bounded.

(1.9.4) We construct explicitly, given the suggestion in  \citeauthor[p.217]benveniste1990stochastic \shortcitebenveniste1990stochastic. Define

We will show that the above series converges for all . If it does, then it is straightforward to check that A.4 (ii) is satisfied. Since our chain is assumed to be uniformly ergodic, we have the existence of and such that for all ,