# Issues Concerning the Realizability of Blackwell Optimal Policies in Reinforcement Learning

###### Abstract

-discount optimality was introduced as a hierarchical form of policy- and value-function optimality, with Blackwell optimality lying at the top level of the hierarchy [17,3]. We formalize notions of myopic discount factors, value functions and policies in terms of Blackwell optimality in MDPs, and we provide a novel concept of regret, called Blackwell regret, which measures the regret compared to a Blackwell optimal policy. Our main analysis focuses on long horizon MDPs with sparse rewards. We show that selecting the discount factor under which zero Blackwell regret can be achieved becomes arbitrarily hard. Moreover, even with oracle knowledge of such a discount factor that can realize a Blackwell regret-free value function, an -Blackwell optimal value function may not even be gain optimal. Difficulties associated with this class of problems is discussed, and the notion of a policy gap is defined as the difference in expected return between a given policy and any other policy that differs at that state; we prove certain properties related to this gap. Finally, we provide experimental results that further support our theoretical results.^{1}^{1}1work in progress.

oddsidemargin has been altered.

textheight has been altered.

marginparsep has been altered.

paperwidth has been altered.

textwidth has been altered.

marginparwidth has been altered.

marginparpush has been altered.

paperheight has been altered.

The page layout violates the UAI style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layout-changing commands and try again.

Issues Concerning the Realizability of Blackwell Optimal Policies in Reinforcement Learning

Nicholas Denis nick.denis.1983@gmail.com

## 1 Introduction

When is one policy better than another, and how does one arrive at the best policy? Additionally, is there a difference between the theoretical answers to these questions and how they are addressed in practice? Within the reinforcement learning and Markov decision process community, these questions are fundamental and nothing new. Indeed, though these questions have been well defined and well studied, this paper reconsiders important issues with solutions to MDPs and RL problems. Specifically, we explore the role of the discount factor in finding an optimal policy, , and value function . Once is chosen, though an (approximately) optimal solution may be returned by some algorithmic solution, it may still be unsatisfactory in some regards (as demonstrated by OpenAI with the Coastrunners domain). In this paper we explore the relationship between in arriving at an -optimal policy, as well as a researchers preference or evaluation of such a policy. We discuss issues surrounding selecting and without any domain knowledge of the problem, and how even theoretically sound algorithms such as PAC-MDP solution methods can produce policies that, though satisfy being -PAC, are still not even gain optimal. Especially difficult are long-horizon problems (LHPs) with sparse rewards. Motived by such problems we introduce a novel concept of regret, called Blackwell Regret, , which compares the expected return of a given policy to that of a Blackwell optimal policy, evaluated at an appropriate value of . We believe Blackwell regret is more akin to how humans experience regret when comparing oneself to the highest of standards. We formalize the notion of myopic discount factors and policies and introduce a notion of being Blackwell realizable. We discuss how policies that minimize Blackwell regret are fundamentally difficult to solve for, as recent literature has hinted at for long horizon problems (LHP’s)[10]. This is due to the existence of pivot states where discovering the Blackwell optimal policy hinges on discerning the values of a Blackwell optimal policy and a non-Blackwell optimal policy which can be arbitrarily close at a given state. Even with oracle knowledge of , the infimum that can induce a Blackwell optimal and Blackwell regret-free policy, , an -accurate Blackwell optimal policy may not be Blackwell optimal, and in fact may not even be gain optimal. We provide experimental results using PAC-MDP algorithms that demonstrate this phenomenon. Motivated by these findings, we argue the need for progress within three areas of theoretical research: 1) Analytical solution methods for Blackwell optimal policies; 2) provable convergent algorithms for solving n-discount optimal policies; 3) goal based and human preference based RL. Our focus is on the latter.

## 2 Background

### 2.1 Markov Decision Process

Recall that an MDP, , is an n-tuple , where is a finite state space, is a finite action space, is the transition kernel, is the reward function and is the discounted current value associated to one unit of reward to be received one unit of time into the future. This work focusses on deterministic Markovian policies.

### 2.2 Notation

This work considers how plays a role both in learning a policy as well as how it is used in evaluating the value function associated to a policy, perhaps learned with a different discout factor. For this reason, it is important to clearly separate used to learn a policy , and used to evaluate that policy. Hence, by we refer to a policy learned using , whereas refers to the value of a state, when following policy learned using (as defined just previously), however the value function is computed using . Symbolically, .

### 2.3 Optimality in MDP’s

If then we are considering undiscounted rewards, and for any infinite stream of rewards . Since is often infinite, the gain of a policy is defined, where

Using the gain of a policy, an ordering, , is defined on some policy class , where , , with the strict inequality defined similarily. It is worth noting that if we define r as the sequence of expected rewards from following policy , then for any permutation , any policy whose sequence of expected rewards is rr, then . Hence, the temporal ordering of rewards has no bearing on the value or gain of a policy when . This is certainly not true for .

Most commonly . In this setting we can deal with infinite series of expected rewards as the partial sums converge geometrically fast in . The value of a state when considering discount factor , is

Since most frameworks assume rewards are bounded in some interval , then , . Such assumptions and the use of are integral to theoretical bounds for algorithms and solution methods in RL and MDP’s. Similar to the ordering on policies in the undiscounted setting, an ordering of policies for fixed is used to order policies . Unlike the undiscounted setting, under , two policies are not equivalent under permutation of the temporal sequence of rewards. Interestingly, a value of is rarely used in the literature, and is often called myopic. With , the induced policy does not sufficiently account for the future horizon and in doing so is generally viewed to only lead to sub-optimal behaviour.

, we say a policy is optimal if , , where . Despite these notions of optimality being the most common in RL, there are other notions of optimality [13].

#### 2.3.1 Bias Optimality

Bias optimality was introduced to supplement the use of gain optimal policies when . Since the gain of a policy only considers the asymptotic behaviour of a policy, two policies that have the same gain may experience different reward trajectories before arriving at the stationary distribution of the policy. For this reason the bias of a policy, defined as

and was introduced by [3]. For any finite state and action space MDP, a bias optimal policy always exists.

#### 2.3.2 -discount Optimality

-discount optimality [17] introduces a hierarchical view of policy optimality in MDPs. A policy is -discount optimal for if , and

It has been shown [17] that a policy is discount optimal it is gain optimal, and a policy is 0-discount optimal it is bias optimal. Moreover, if a policy is -discount optimal, then it is -discount optimal . The strongest and most selective notion of optimality is that of being -discount optimal . Such a policy is referred to as being -discount optimal.

#### 2.3.3 Blackwell Optimality

A policy is Blackwell optimal if , such that

For finite state spaces such a is attained [13]. Intuitively, a Blackwell optimal policy is one that, upon considering sufficiently far into the future, as encoded as a planning horizon via , no other policy has a higher expected cumulative reward.[17] showed that a policy is -discount optimal it is Blackwell optimal, hence Blackwell optimality implies all other forms of optimality, and for this reason is the focus of this work. Finally, [3] shows that for finite state and action space MDPs, there always exists a stationary and deterministic Blackwell optimal policy.

## 3 Motivation for Blackwell Regret

Consider the infinite horizon MDP in Figure 1, with initial state . Before proceeding, consider what you would do if you were in this MDP? What do you think is the best policy? What sort of solution would you hope that an RL algorithm return to you and how did you come to this conclusion?

In wanting to maximize cumulative reward, it is hard to argue with any other action selection policy for the provided example than to always “move right” towards the state , and upon doing so, remain there. Why might someone consider any other policy? Why might a rational agent, with full oracle knowledge of the MDP, consider staying in to receive a reward of at every time step for perpetuity? It is hard to account for why such a policy would be preferred over the policy that takes the agent to , aside from laziness. Computationally, . Hence, depending on , , the policy induced by can be set appropriately in order to induce the desired policy behaviour.

Returning to , it is widely accepted within the literature that is myopic. We ask if it is possible for to be myopic for ? Is myopic? ? If we abstract what makes myopic, it is the fact that is not sufficiently large so as to provide the agent with the possibility of properly assessing the optimal value of states and actions, where this optimality is, in some sense, not defined with respect to , the used during learning, but rather with respect to some ideal policy or behaviour. Just as a child might seek to maximize immediate gratification (rewards) by eating candy before bed, which may be optimal given , the role of a parent will be to convey the non-optimality of such a policy by noting that the yet to be experienced consequences (poor sleep, fussy behavior the following day), which can only be taken into consideration with . This is paradoxical for the child, as they operate under , and hence is optimal from the perspective of . The lesson the parent tries to impart to the child is to use so that the child can learn . In this way, we intuitively compare to . It is this intuition that we seek to formalize by noting that eating candy before bed does not sufficiently value the future, and for this reason we attempt to resist this myopic behaviour. In order to do so, sufficiently valuing the future means selecting a suitable . We argue that this sufficiency is represented by the as found in the definition of a Blackwell optimal policy and value function.

We argue that the myopic behaviour, intuitively, is defined with respect to the strongest sense of optimality, Blackwell optimality. Note that , [13]. So why, then, is dismissed as myopic for ? It is still, after all, an optimal policy. We believe that this occurs since we intuitively understand that not all optimal policies are equal. It appears that all optimal policies are optimal, but some are more optimal than others. That is, though is optimal under , not all ’s induce the policies or behaviours that a researcher prefers. This clearly highlights a common issue in machine learning, that of using a given objective function as a surrogate representation for what we want the algorithm to do.

The hierarchical nature of policy optimality as expressed by -discount optimality naturally captures this phenomenon, and we revisit this body of literature to help motivate why our sense of being myopic has nothing to do with not being capable of finding , but rather, not finding , the that characterizes a Blackwell optimal policy. We introduce a novel notion of regret, called Blackwell regret, and relate the concept of a myopic and policy to Blackwell regret. Our work looks at a simple class of MDPs, called distracting long horizon MDPs, and show that even for such a simple class of environments, it is arbitrarily hard to select a so as to arrive at a Blackwell optimal policy and value function that achieves zero Blackwell regret.

## 4 Myopic , Blackwell Realizable and Blackwell Regret

Looking at the MDP’s in Figure 1, we intuitively get a sense of what the right policy is, and we agree that is myopic and will not produce the optimal policy. Moreover, we can check that will suffer the same drawback. Since no formal definition of a myopic can be found in the literature, we provide a definition.

##### Definition: Myopic and Blackwell Regret:

Let denote a Blackwell optimal policy. Let be as defined above as for Blackwell optimality, such that , , , . Then for , we say is myopic. Similarly, a policy, , is myopic if it is learned using a myopic . Similarly, we say for that is Blackwell realizable. For , we define Blackwell Regret, . Let . Then for a given policy ,

where the expectation is taken over initial state distribution. Hence, Blackwell regret is the regret accrued for using a given policy learned with , when compared to a Blackwell optimal policy. Since it may be that , to ensure commensurability we require that in the definition, since under non-negative rewards, fixed , , we see that . It immediately follows that if is myopic, and is the optimal policy induced by , then . We see in the following lemma that Blackwell regret captures the very notion exemplified in the child-parent example previously given, in that for , the Blackwell regret is simply the regret computed using . The regret of a given policy, with value evaluated at is defined as

###### Lemma 1.

Let , , as defined in Blackwell optimality. Then .

Previous definitions of regret measure the difference in value of a given policy and the optimal value function, each evaluated with respect to a fixed , since , . As well, the used to learn the policy is typically also used to evaluate the value of that policy, and thus the regret. Blackwell regret differs in that it measures the difference in value of a given policy, , and a Blackwell optimal policy, evaluated at that favors a Blackwell optimal policy and value function. In doing so, a policy that achieves zero Blackwell regret is either itself Blackwell optimal, or when considering a sufficiently long time horizon (as encoded by ), has the same value as a Blackwell optimal policy.

## 5 Difficulty in Selecting Blackwell Realizable

When implementing an RL algorithm that incorporates discounting, typically no reasoning is provided to explain the choice of used, though most often values of are set around 0.9. may be treated as a hyperparameter and a grid-search over values may be performed. However, even under these settings, the probability measure of non-myopic ’s can become vanishingly small for various types of problems such as LHP’s and sparse reward problems. Hence, any randomized selection approach can have a vanishingly small probability of achieving non-zero Blackwell regret, as for the example in Figure 1 as grows. We show that selecting a non-myopic , that is, selecting a Blackwell realizable , is quite difficult without oracle knowledge of the problem. Moreover, even using a Blackwell realizable , an -optimal policy may not even be gain optimal, let alone Blackwell optimal.

Ultimately we would like to consider MDP environments of a particular nature conducive to multi-task RL problems. The environments (problems) we are interested in are those such that for every task assigned to the agent, the optimal policy for that task induces a partition of the state space into non-empty subsets of transient and recurrent states, . This is equivalent to saying that for each task, the optimal policy associated to the task induces a Markov chain on which is unichain, or that the environment is multichain [13]. The intuition is that the environment is sufficiently controllable, in the sense that the agent can direct the environment towards some preferable subset of the state space, and stay there indefinitely if needed, as encoded by the task MDP. For this paper we will consider a particular subset of such environments, where there are only two regions of the state space that produce non-zero rewards, and these two regions are maximally separated from one another. We demonstrate that even for such a simple class of MDPs, selecting Blackwell realizable discount factors can be arbitrarily hard.

More formally, we consider the class of MDPs with finite diameters. That is, , such that

where is the first hitting time of when starting in state , under . Hence, within the class of environments considered, it is possible to reach any state from any other starting state, and do so in a finite number of actions, in expectation, under some policy. Furthermore, denote and two states that realize the diameter . Suppose , and such that and , and all other rewards are zero (e.g. ). Moreover, . Though this structure is quite specific, it abstractly represents two regions of the state space where actions exist that allow the agent to remain in those respective regions, and while remaining in that such region receive, on average, a positive rewards and , respectively. An example of such an MDP can be seen in Figure 1. We call these particular environments distracting long horizon problems, in the sense that due to the nature of the long horizon problem, the high reward region of the environment is many time steps away from an arbitrarily low reward region of the environment, with the rest of the environment producing no rewards. Given a state, such as in Figure 1, under a Blackwell optimal policy, the agent will not be distracted by the nearby, yet miniscule rewards, and will traverse to the high reward region, . This setting is a slight step up in complexity from a simple goal based MDP where only a single state produces a positive reward. We show that with oracle domain knowledge of features of the MDP (which we state below), one can select a Blackwell realizable and solve for in such distracting MDPs. However, even with oracle knowledge of , selected, one may receive an -optimal policy and value function that is not gain optimal, since for LHPs the value of a gain optimal policy and Blackwell optimal policy may differ by less than . Interestingly, these results suggest a multi-step learning process for distracting MDP problems may be possible, which we leave for future work.

We start with a proposition that shows that for this class of MDPs, being Blackwell optimal are exactly those policies that are not distracted, in the sense that they are those policies that act solely to minimize the hitting time of the high reward state .

###### Proposition 2.

Let be a distracting long horizon MDP as described above. Then is Blackwell optimal , .

We now provide results that show with oracle knowledge of we may select for and thus for a Blackwell realizable discount factor.

###### Corollary 3.

For any distracting long horizon MDP , as described above, if is known, then an RL algorithm can select and hence select a Blackwell realizable discount factor.

The following corollary shows that with oracle knowledge of only two of the following properties: and , then after committing to particular there exists a distracting long horizon MDP that is consistent with those MDP properties wherein is not gain optimal, but , is Blackwell optimal.

###### Corollary 4.

Suppose for every distracting long horizon MDP , as described above, only two of is known. Let , , denote the MDP features known with oracle knowledge. Then consistent with , such that is not gain optimal but , is Blackwell optimal.

These corollaries demonstrate that there exists sufficient domain knowledge for distracting long horizon MDPs to allow for the computation and use of a Blackwell realizable , however without complete domain knowledge of , any selected may be myopic and may not even lead to a gain optimal policyl. These results suggest for distracting long horizon MDPs that a multi-step learning approach may be best, where in the first phase the agent learns the , and then in the second phase, uses this knowledge to select for a non-myopic to solve the task, however we leave such results for future work.

The next results show that even under with access to a Blackwell realizable , for distracting long horizon problems, then the value of a policy that is not gain optimal and that of a Blackwell optimal policy may be arbitrarily close (e.g. within ), hence any learning algorithm that returns a policy that is -accurate to a Blackwell optimal policy may not even be gain optimal. Further, we provide empirical results that mirror our theoretical results.

###### Corollary 5.

Let . a distracting MDP, , with Blackwell optimal , and associated Blackwell optimal policy , such that

, where is not gain optimal.

### 5.1 Policy Gaps and Pivot States

Prior work has been done in putting forward measurements that can act as indicators of when learning an optimal policy may be difficult [10, 2] . [2] discuss the notion of an action gap at a given state that is the difference in expected value at that state between the optimal action and the second best action. More formally, let . Then,

[10] introduce the notion of the maximal action-gap (MAG) of a policy as

Both studies argue that if their respective measurement is small, then learning the optimal policy can be hard, as it is hard to discern the value of the optimal action from one that is sub-optimal. While each may be useful, we argue that since the action gap measures the difference in value associated with abstaining just once from taking the optimal action, it doesn’t truly measure the difference in value between two policies, nor the associated difficulty in discerning the value of one policy over another. The maximal action gap suffers from this as well. Moreover, [10] that under certain conditions the maximal action gap collapses to zero, making learning arbitrarily hard. However, in the Appendix section we prove that this condition only occurs in environments where the set of states that receive non-zero rewards must be transient .

We introduce a novel measurement, the policy gap, which is motivated by the action gap, discussed above. For fixed, policy and , we define the policy gap, ,

The policy gap at state is the smallest difference in value at that state between the query policy and any other policy that differs at . Intuitively, if , is large, then the ability to discern the optimal action and thereby learn a Blackwell optimal policy becomes easier. Conversely, if , sucht that then at state , called a pivot state, the ability to discern the value of a Blackwell optimal policy, , and another policy becomes increasingly hard. For an MDP where Blackwell optimal policies are non-trivial, that is not all policies are Blackwell optimal, and therefore , then there exists such a pivot state. For the Theorem below, we use for a Blackwell optimal policy, and for any , we use to represent the value function computed follow the Blackwell optimal policy and with discount factor .

###### Theorem 6.

(Pivot State Existence) Let be a non-trivial Blackwell optimal policy with , where as defined above such that , , . If a pivot state , where , and

Moreover, .

Theorem 7 shows that for values close to , there exists a pivot state such that the value of a Blackwell optimal policy at that state, when computed with , is arbitrarily close to the value of a different non-Blackwell optimal policy at the same state, when computed with . Intuitively, if the policy gap is arbitrarily close to zero, an RL algorithm is expected to have a greater difficulty evaluating the difference in value associated to such policies, and therefore have a greater difficulty in determining which is optimal. These results may suggest that without oracle knowledge of , an algorithm that attempts to search for by increasing iteratively would have increasing difficulty as .

## 6 Experimental Results

In this section we provide experimental results that further illustrate the phenomena discussed in previous sections. We investigate the difficulty of solving for Blackwell optimal policies in distracting long horizon MDPs, similar to those in Figure 1. For these experiments we use the MDP in Figure 2, with initial state . We analytically solve for , and implement the delayed Q-learning PAC-MDP algorithm [16] for our experiments. We use in two sets of experiments, with . For our experiments we use , and error tolerance , where indices for coincide for experiments. We run each set of experiments with a different random seed for 5 runs. The delayed Q-learning algorithms terminates when algorithm either finds itself in state , greedily selecting and the Learn(s,a) boolean flag is False, or the algorithm finds itself in state , greedily selecting and the Learn(s,a) boolean flag is False. Both situations indicate no further learning is possible, and the algorithm has converged on the -optimal policy.

For the first set of experiments with and , the mean sample complexity required for convergence was . In each of the five experiments, the policy learned, , which is not Blackwell optimal. Moreover, the algorithms terminates in state , hence the policy is not even gain optimal. The policy gap at was also measured, and the mean policy gap, . In the second set of experiments, using , the mean sample complexity across 5 runs was . In each of the five experiments the policy learned was the Blackwell optimal policy, and the mean policy gap, .

These results corroborate the theoretical results obtained. First, we see that for the experiments with , which is much closer to , we find that the policy gap at is much smaller than when compared to under , as predicted by the theoretical results stated. More importantly, despite having oracle knowledge of and selecting a Blackwell realizable , and implementing a PAC-MDP algorithm with commonly used values of , no implementation returned a Blackwell optimal policy, and in fact did not even return a gain-optimal policy. These results further support the difficulty in arriving at Blackwell optimal policies for distracting long horizon MDPs. However, for such that , the Blackwell optimal policy was returned in all experiments. Though a positive result in some regards, it is also suggests that for distracting long horizon MDPs where , where the Lebesque measure , having the luxury of randomly selecting such that becomes arbitrarily hard. Finally, these results corroborate the theoretical results showing the existence of a state where the policy gap approaches zero and that even with , with a commonly used error tolerance value, the -optimal policy returned by a PAC-MDP algorithm was not even gain optimal.

## 7 Discussion and Related Work

The topic of effects of selection on policy quality has been of interest for several decades [17, 3, 13] with n-discount optimality and Blackwell optimality providing a global perspective on this relationship. These works recognize that for discounting, an optimal policy may not be Blackwell optimal, and recognize that this problem is alleviated for . However, there do not exist any known convergent algorithms with theoretical guarantees for the undiscounted setting. [3] also showed that for finite MDPs, as , the -discounted value function can be written as a Laurent series expansion, where each of the terms in this series is a scaled notion of optimality, with the first term being the gain, the second the bias, and so on. Using this construction, [13, 17] show there is both a sequence of nested equations for solving for the Laurent series coefficients, as well as a policy iteration method that is provably convergent for such a policy satisfying these equations for any finite term approximation of the Laurent series. More recently [11] utilized an exciting approach in function approximation by constructing value functions using basis functions comprised of terms found within the Laurent series expansion.

[9] studies the relationship of and reward functions with policy quality for goal based MDPs. They argue that , an agent is not risk-averse, and prove that in the undiscounted setting and , , an agent is guaranteed to arrive at the goal state, however with this is not so, as a shorter yet riskier path that may lead to non-goal absorbing state can have higher value than a longer, safer path to the goal. [12, 6, 7] are motivated by showing that using smaller values may be advantageous. Besides having faster convergence rates, they argue smaller values may also have better error. By decomposing the error or value difference between policies induced with different values, these decompositions have error terms dependant on the smaller term, and another term that goes to zero as the two values approach each other. These works argue that the best strategy is to find an intermediate value that trades off the two terms. However, as is often the case in theoretical analysis of RL problems, the bounds are stated in terms of , and for various values of , are vacuous as the bounds are higher than the absolute max error of (e.g. one policy only receiving zero rewards and another always receiving ). However, when the bounds are meaningful, without knowledge of the Blackwell optimal policy and associated value function, as shown in this study, even an -optimal policy may not even be gain optimal.

For [10] define a hypothesis class, and show that

Under this framework, for a family of hypothesis classes , indexed by , we see that as increases, is a monotonically increasing sequence of hypothesis classes. [10, 6] formalize that this also corresponds to an increase in measure of complexity, via the generalized Rademacher complexity, , depends only on . That is,

As examined in [10], long horizon MDPs suffer in that may be arbitrarily close to 1, which implies the complexity of realizable hypothesis classes for LHPs grows non-linearly with the horizon size (and ). [6] argue that using is therefore a mechanism akin to regularization, by selecting for a lower complexity hypothesis class one can prevent overfitting. Their results suggest using smaller earlier in learning; however as discussed here, the quality of being small or large is problem dependant, and without oracle knowledge of the problem is meaningless. Though [6] does not consider Blackwell optimality, an interesting result [Theorem 2] can easily be adapted here which shows that for the loss as measured by for an approximately optimal policy using and samples from each follows with probabiliy :

[6] argue that the tradoff between the two terms involves controlling the complexity of the policy class using a smaller , versus the error induced in the first time when using a smaller . Our results show that even as and the second error term goes to zero, the first error term is fixed for any fixed , and that without strong domain knowledge, even an -optimal approximate policy may not even be gain optimal. [15] recently suggested -nets, a function approximation architecture that trains using a set of discount factors to learn value functions with respect to several timescales. The idea being that the approximation architecture can generalize and approximate the value of a state for any if sufficiently trained.

The work presented here suggests that without considering Blackwell optimality and related concepts, theoretical bounds on value functions in RL may not provide meaningful and interpretable semantics with respect to the optimality of the resulting policy. An apt metaphor is that for a daredevil jumping across a canyon, coming -close to being successful is arbitrarily bad. In that vein, our results show that for LHPs an -Blackwell optimal policy may not even be gain optimal. In contrast, in the supervised learning setting, one may search over a particular hypothesis class and and arrive at some locally or globally optimal hypothesis , which obtains empirical accuracy of on the training and test datasets, respectively. Once a classifier is obtained, though one may not know that the Bayes optimal classifier risk may be, one does know that it can, at most, achieve 100% accuracy, and hence in absolute terms, one can obtain meaning from the test and training accuracy of a classifier returned by some SL algorithm. However, the RL setting is not similar in these regards. Without oracle knowledge of the RL problem, the policy and value function returned by an RL algorithm, parameterized by , and any other parameters , it is hard to say just how optimal such a policy, in fact, is, thereby leaving a researcher in the same boat as the fictitious RL agent: with results that are evaluative not instructive.

Given that discounting has such a strong effect on the induced hypothesis class, one may ask why discounting is even used? Authors often cite concepts from utility theory such as inflation and interest to motivate the use of discounting. Such concepts for temporal valuation may be useful for agents, such as humans, with finite time horizons, however such intuitions may not necessarily be commensurable for infinite horizon agents. The use of discounting in economic models is also of contention [18]. For economic and environmental policies, how should we discount the value of having a clean environment? Is discounting the future ethical in such settings? Might discounting the future lead us to an arbitrarily bad absorbing state? Utility theory has considered several qualities two utilitiy streams, , may posses in forming binary relations used as orderings on value functions (utility streams) [8], including that of anonymity which essentially states that two utility streams are equal under an ordering if they are permutations of one another. Hence, anonymity can only be realized in the RL setting if . These works introduce and argue for the use of Blackwell optimality in economics research. [14] answers the question why discounting is used: because it turns an infinite sum into a finite one. That is, it allows us to consider convergent series and therefore algorithms. It then follows that we are not selecting for , but rather for policies that are representable by convergent algorithms. If RL algorithms are to be used and incorporated in real world processes and products, we raise the rhetorical question: What are the moral and ethical implications of purposefully running a sub-optimal infinite horizon algorithm, in perpetuity?

The results provided in this paper suggest that iterative methods at arriving at are problematic, suggesting a need for analytical methods of computing . However, even with , an approximately optimal policy may not even produce a gain optimal policy. For LHPs, as , since even using shares this unfortunate result, as demonstrated empirically in our experiments, what can be done to ensure solving for the Blackwell optimal policy? Recent advances in PAC-MDP algorithms [4] introduce PAC uniform learning, PAC algorithms that are -optimal simultaneously. Such algorithms must never explore then commit [4] , but rather must never stop learning, as it has been shown that such approaches are necessarily sub-optimal [5]. An interesting direction would be to consider the use of such algorithms for arriving at Blackwell optimal policies.

Though Blackwell optimality is an ideal, for non-trivial LHPs it is possible that Blackwell optimal policies are hard to discern from policies that may not even be gain optimal. With such results being so dire, we suggest three main areas of focus for future research within the RL community. 1) Development of convergent algorithms for solving for n-discount optimal policies, with theoretical bounds, and efficient solution methods for arriving at the Laurent series expansion of a -discounted value function as ; 2) Analytical solutions to ; 3) Human preference and goal based RL.

Our main focus is on the third area of focus mentioned above. For any applied RL solution, for example a commercial product that relies on RL, we argue that ultimately the quality of a policy is judged by human preferences. Those implementing an RL solution method will receive a policy and a value function, and must evaluate if it is a sufficient solution to the given problem, or not. If not, the researcher will experiment with other parameters, including , and repeat until a policy is found that is sufficient. We call such an aproach based on human preference, and may be separate from the value function itself, and solely dependent on the behaviour of the policy. This can be seen by the works and discussions made recently [1] based on results on the CoastRunners domain. CoastRunners is a video game where the policy controls a boat in a racing game. The policy solved for by OpenAI resulted in the boat driving in circles, collecting rewards, rather than racing to the finish line and completing the race. Though OpenAI uses this as an example of a pathological behaviour induced by a faulty reward function, it can viewed as the induced behaviour by using a myopic in a distracting LHP. OpenAI, and most others would agree, that the behaviour observed was pathological, however, what makes it pathological? In fact, it was the optimal policy solved for, given the encoding of the MDP. We argue that what makes this pathological is simply that the policy didn’t do what the researchers wanted it to do, which was to win the race. For this reason, at this current state in RL research, we claim the ultimately, the quality of policies solved for are measured by their being deemed sufficient, as subjectively defined by the researcher. We claim that this is equivalent to the researcher ultimately desiring something from the solved policy, and hence if this can be encoded as an indicator function, then goal based RL problems should be used, being some of the simplest classes of MDP problems.

#### Acknowledgments.

The authors would like to thank Maia Fraser for discussions and thoughtful edits of prior versions of this manuscript.

## 8 References

[1] Openai blog. https://blog.openai.com/faulty- reward-functions/

[2] Bellemare, M., Ostrovski, G., Guez, A., Thomas, P., Munos, R.: Increasing the action gap: New op- erators for reinforcement learning. In: AAAI. pp. 1476â1483 (2016)

[3] Blackwell, D.: Discrete dynammic programming. Annals of Mathematical Stastics 33, 719â726 (1962)

[4] Dann, C., Lattimore, T., Brunskill, E.: Unifying pac and regret: Uniform pac bounds for episodic re- inforcement learning. In: Neural Information Pro- cessing Systems (2016)

[5] Garivier, A., Kaufmann, E.: On explore-then- commit strategies. In: Neural Information Process- ing Systems (2017)

[6] Jiang, N., Kulesza, A., Singh, S., Lewis, R.: The dependence of effective planning horizon on model accuracy. In: AAMAS. vol. 14 (2015)

[7] Jiang, N., Singh, S., Tewari, A.: On structural prop- erties of mdps that bound loss due to shallow plan- ning. In: IJCAI (2016)

[8] Jonsson, A., Voorneveld, M.: The limit of dis- counted utilitarianism. Theoretical Economics 13, 19â37 (2018)

[9] Koenig, S., Liu, Y.: The interaction of representa- tions and planning objectives for decision-theoretic planning tasks. Journal of Experimental and Theo- retical Artificial Intelligence 14, 303â326 (2002)

[10] Lehnert, L., Laroche, R., van Seijen, H.: On value function representation of long horizon problems. In: 32nd AAAI Conference on Artificial Intelli- gence. pp. 3457â3465 (2018)

[11] Mahadevan, S., Liu, B.: Basis construction from power series expansions of value functions. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Advances in Neu- ral Information Processing Systems 23, pp. 1540â 1548. Curran Associates, Inc. (2010)

[12] Petrik, M., Scherrer, B.: Biasing approximate dy- namic programming with a lower discount factor. In: In Advances in Neural Information Processing Systems. pp. 1265â1272 (2009)

[13] Puterman, M.: Markov decision processes: Dis- crete stochastic dynammic programming. John Wi- ley and sons, Inc. (1994)

[14] Schwartz, A.: A reinforcement learning method for maximizing undiscounted rewards. In: ICML (1993)

[15] Sherstan, C., MacGlashan, J., Pilarski, P.: Gener- alizing value estimation over timescale. In: Pre- diction and Generative Modeling in Reinforcement Learning Workshop, FAIM (2018)

[16] Strehl, A., Li, L., Wiewiora, E., Langford, J., Littman, M.: Pac model-free reinforcement learn- ing. In: ICML. pp. 881â888 (2006)

[17] Veinott, A.: Discrete dynammic programming with sensitive discount optimality criteria. Annals of Mathematical Stastics 40, 1635â166 (1969)

[18] Weitzman, M.: Gamma discounting. American Economic Review 91, 260â271 (2001)

## 9 Appendix

##### A Comment on Bounds Related to the Maximal Action Gap:

[10] define as a fully connected subset of the state space. Despite the use of the term fully connected which was intended to describe a subet of the state space that is reachable from anywhere within that subset, a more appropriate term is communicating, as fully connected has connotations that , such that . For this reason we will use the term communicating to describe . From this they define . Lemma 2 of [10] states: MAG, where is the diameter of . From this, it is stated that if is bounded as , then implies that MAG . Though this implication is true, we show that is bounded as if and only if under all policies the expected number of times a non-zero reward is obtained under is finite. This means that under all policies, all non-zero rewards are transient. Hence, such a result applies to a rather vacuous subset of MDPs.

###### Proposition 7.

Let be an MDP such that and . Then .

###### Proof.

Suppose such that as . WLOG, since , we may assume , since otherwise this statement is trivial. Clearly, , since otherwise, as there exists at least one transition that induces a non-zero reward, , then at worst a policy may traverse the entire diameter of to receive a reward and do so for perpetuity. That is,

But clearly,

So for we have . This shows that . Hence, for , it must be that is transient under , since otherwise if such that , , then again by the same argument above, such that , . Hence must be transient under . Now, since , then . Hence for , such that is irreducible and positive recurrent (e.g. absorbing). We claim that there must not be any possible non-zero rewards within . Let be the maximum expected first hitting time of reaching the absorbing subset of the state space under . By a similar argument as above, there cannot be any positive rewards in , since otherwise such that as . If this is true, then such that , and therefore as .

Hence, as , obtains non-zero rewards for only a finite number of time steps. Due to the optimality of , then this must be true for any policy . Hence it must be that all rewards in are transient.

For the reverse implication, suppose that . Let be defined as above, as the maximum expected hitting time of the absorbing subset which contains no non-zero rewards. must exist, by a similar argument as above. Then we have,

Hence is bounded as . ∎

The maximum action gap bounds collapse to zero for an infinite horizon problem, as , but only for environments where all the rewards are transient. [10] argue that representing the value function for such class of MDPs is quite difficult as , however such a class of environments are best solved using episodic MDP approaches, with . Since for any policy the number of time steps where a positive reward is possible is finite, then finding an optimal policy is only relevant for the first time steps, since afterwards the behaviour becomes irrelevant. [13] shows such domains can be converted to undiscounted episodic tasks. In doing so, the hypothesis space is completely different, as only value functions need be considered, which have no dependancy on , hence the Radamacher complexity results stated previously do not apply here. A multi-step learning approach of first learning , then applying an episodic RL algorithmic approach is ideal for such environments.

##### Proof of Lemma 1

###### Proof.

Let be a Blackwell optimal policy with associated . Note that follows from the hypothesis and definition of for Blackwell regret. Then,

∎

##### Proof of Proposition 2

###### Proof.

Let be a Blackwell optimal policy, then it is bias optimal which clearly must minimize the expected hitting time of . For the reverse implication, let be the policy that minimizes the expected hitting time of . Let , with defined in the text. Then it follows that , since otherwise such that . It clearly follows that under any policy , , , and therefore is a Blackwell optimal policy. ∎

##### Proof of Corollary 3

###### Proof.

Let be a distracting MDP as described above, with known to the algorithm. Let be the Blackwell optimal policy learned and evaluated with . By the previous Proposition, then is the policy that takes the shortest path from any state to , and as given in the proof of said Proposition, . Moreover, from Proposition 2 , This follows for any policy that does not minimize the expected first hitting time of . Hence . Hence, with knowledge of , can be computed and therefore a realizable discount factor may be selected. ∎

##### Proof of Corollary 4

###### Proof.

First, given , and let . For as defined above, it suffices to show as in the previous proposition , , for the induced optimal policies , and the Blackwell optimal policy ,

Hence, it suffices to show :

Let = sup , and set . Then satisfy the claim, and with initial state distribution being a point mass at , we have is not gain optimal, as , but , it follows that is Blackwell optimal.

Without loss of generality, the same proof technique can be applied when either are known, and is fixed, as well as if are known, and is fixed. ∎

##### Proof of Corollary 5

###### Proof.

This follows as a Corollary from Theorem 6, and Proposition 2, since a pivot state where the policy gap vanishes. It is easy to see that under Proposition 2 and the previous two Corollaries that followed, is a pivot state. Let equal the Blackwell optimal policy, , at every state except,