On Value Functions and
the AgentEnvironment Boundary
Abstract
When function approximation is deployed in reinforcement learning (RL), the same problem may be formulated in different ways, often by treating a preprocessing step as a part of the environment or as part of the agent. As a consequence, fundamental concepts in RL, such as (optimal) value functions, are not uniquely defined as they depend on where we draw this agentenvironment boundary, causing problems in theoretical analyses that provide optimality guarantees. We address this issue via a simple and novel boundaryinvariant analysis of Fitted QIteration, a representative RL algorithm, where the assumptions and the guarantees are invariant to the choice of boundary. We also discuss closely related issues on state resetting and MonteCarlo Tree Search, deterministic vs stochastic systems, imitation learning, and the verifiability of theoretical assumptions from data.
On Value Functions and
the AgentEnvironment Boundary
Nan Jiang Department of Computer Science University of Illinois at UrbanaChampaign Urbana, IL 61801 nanjiang@illinois.edu
noticebox[b]Preprint. Under review.\end@float
1 Introduction
The entire theory of RL—including that of function approximation—is built on mathematical concepts established in the Markov Decision Process (MDP) literature (Puterman, 1994), such as the optimal state and value functions ( and ) and their policyspecific counterparts ( and ). These functions operate on the state (and action) of the MDP, and classical results tell us that they are always uniquely and well defined.
Are they really well defined?
Consider the following scenario, depicted in Figure 1. In a standard ALE benchmark (Bellemare et al., 2013), rawpixel screens are produced as states (or strictly speaking, observations^{1}^{1}1In Atari games, it is common to include past frames in the state representation to resolve partial observability, which is omitted in Figure 1. In most part of the paper we stick to MDP terminologies for simplicity, but our results and discussions also apply to POMDPs. See Appendix B for details.), and the agent feeds the state into a neural net to predict . Since the original game screen has a high resolution, it is common in practice to downsample the screen as a preprocessing step (Mnih et al., 2015).
There are two equivalent views of this scenario: We can either view the preprocessing step as part of the environment, or as part of the agent. Depending on where we draw this agentenvironment boundary, will be different in general.^{2}^{2}2For concreteness, we provide a minimal example of boundary dependence in Appendix A. It should also be obvious that there may exist many choices of the boundary (e.g., “boundary 0” in Figure 1), some of which we may be even not aware of. When we design an algorithm to learn , which are we talking about?
The good news is that many existing algorithms (with exceptions; see Section 3) are boundaryinvariant, that is, the behavior of the algorithm remains the same however we change the boundary. The bad news is that many existing analyses^{3}^{3}3There are different kinds of theoretical analyses in RL (e.g., convergence analysis). In this paper we focus on analyses that provide nearoptimality guarantees. are boundarydependent, as they make assumptions that may either hold or fail in the same problem depending on the choice of the boundary; for example, in the analyses of approximate value iteration algorithms, it is common to assume that can be represented by the function approximator (“realizability”), and that the function space is closed under Bellman update (“low inherent Bellman error”, Szepesvári and Munos, 2005; Antos et al., 2008). Such a gap between the mathematical theory and the reality also leads to further consequences, such as the theoretical assumptions being fundamentally unverifiable from naturally generated data.
In this paper we systematically study the boundary dependence of RL theory. We ground our discussions in a simple and novel boundaryinvariant analysis of Fitted QIteration (Ernst et al., 2005), in which the correctness of the assumptions and the guarantees do not change with the subjective choice of the boundary (Sections 4 and 5). Within this analysis, we give up on the classical notions of value functions or even the (statewise) Bellman equation, and replace them with weaker conditions that are boundaryinvariant and that naturally come with improved verifiability. We also discuss closely related issues on state resetting and MonteCarlo Tree Search (Section 3), deterministic vs stochastic systems, imitation learning, and the verifiability of theoretical assumptions from data (Section 6).
2 Preliminaries
Markov Decision Processes An infinitehorizon discounted MDP is specified by , where is the finite state space,^{4}^{4}4For the ease of exposition we assume finite , but its cardinality can be arbitrarily large. is the finite action space, is the transition function,^{5}^{5}5 is the probability simplex. is the reward function, is the discount factor, and is the initial state distribution.
A (stationary and deterministic) policy specifies a decisionmaking strategy, and induces a distribution over random trajectories: , , , , , …, where is short for . In later analyses, we will also consider stochastic policies and nonstationary policies formed by concatenating a sequence of stationary ones.
The performance of a policy is measured by its expected discounted return (or value):^{6}^{6}6It is important that the performance of a policy is measured under the initial state distribution. See Appendix F for further discussions.
The value of a policy lies in the range of with . It will be useful to define the value function of : and , the distribution over stateaction pairs induced at time step : Note that , which means .
The goal of the agent is to find a policy that maximizes . In the infinitehorizon discounted setting, there always exists an optimal policy that maximizes the expected discounted return for all states simultaneously (and hence also for ). Let be a shorthand for . It is known that is the greedy policy w.r.t. : For any function , let denote its greedy policy , and we have . Furthermore, satisfies the Bellman equation: , where is the Bellman optimality operator:
(1) 
Valuefunction Approximation In complex problems with highdimensional observations, function approximation is often deployed to generalize over the large state space. In this paper we take a learningtheoretic view of valuefunction approximation: We are given a function space , and for simplicity we assume is finite^{7}^{7}7The only reason that we assume finite is for mathematical convenience in Theorem 3, and removing this assumption only has minor impact on our results. See further comments after Theorem 3’s proof in Appendix E. The goal—stated in the classical, boundarydependent fashion—is to identify a function such that , so that is a nearoptimal policy. This naturally motivates a common assumption, known as realizability, that , which will be useful for our discussions in Section 3.
For most of the paper we will be concerned with batchmode valuefunction approximation, that is, the learner is passively given a dataset and cannot directly interact with the environment. The implications in the exploration setting will be briefly discussed at the end of the paper.
Fitted QIteration (FQI) FQI (Ernst et al., 2005; Szepesvári, 2010) is a batch RL algorithm that solves a sequence of leastsquared regression problems with to approximate each step of value iteration. It is also considered as the prototype for the popular DQN algorithm (Mnih et al., 2015), and often used as a representative of offpolicy valuebased RL algorithms in empirical studies (Fu et al., 2019). We defer a detailed description of the algorithm to Section 5.1.
3 On the Nonuniqueness of
In this section we expand the discussions in Section 1 and introduce a few interesting paradoxes and questions to develop a deeper understanding of the agentenvironment boundary, and to motivate our later analyses with theoretical and practical concerns. We will leave some questions open and revisit them after the technical sections.
Paradox: Uniquely Defined via State Resetting?
While the boundary dependence of should be intuitive from Section 1, one might find a conflict in the following fact: In complex simulated RL environments, we often approxiamte the via MonteCarlo Tree Search (Kearns et al., 2002; Kocsis and Szepesvári, 2006), either for the purpose of generating expert demonstration for imitation learning (Guo et al., 2014) or for code debugging and sanity check. For example, one can collect a regression dataset where is approximated by MCTS, and verify whether by solving the regression problem over . MCTS enjoys nearoptimality guarantees without using any form of function approximation, so how can not be welldefined?
The answer lies in the way that MCTS interacts with the simulator: At each time step, MCTS rolls out multiple trajectories from the current state to determine the optimal action, often by resetting the state after each simulation trajectory is completed. There are many ways of resetting the state: For example, one can clone the RAM configuration of the “real” state and always reset to that (“boundary 0” in Figure 1), as done by Guo et al. (2014). One can also attempt to reproduce the sequence of observations and actions from the beginning of the episode (see POMCP; Silver and Veness, 2010). Both are valid state resetting operations but for different choices of the boundary.^{8}^{8}8Another natural boundary corresponds to resetting the contents of the RAM but leaving the PRG state intact.
Practical Concern The above discussion reveals a practical concern, that when we compute (via MCTS or other methods) for sanity checking the realizability of , we have to explicitly choose a boundary, which may or may not be the best choice for the given . More importantly, not all RL problems come with the resetting functionality, and without resetting it is fundamentally impossible to check certain assumptions, such as realizability; see Appendix C for a formal argument and proof.
Question 1.
When the appropriate boundary is unclear or state resetting is not available, how to empirically verify the theoretical assumptions?
Theoretical Concern Relatedly, since the validity of the common assumptions and the guarantees generally depend on the boundary, given an arbitrary MDP and a function class , we may naturally ask if reexpressing the problem as some equivalent may result in better guarantees, and what is the “best boundary” for stating the theoretical guarantees for a problem and how to characterize it. Intuitions tell us that we might want to choose the “rightmost” boundary (direction defined according to Figure 1), that is, all preprocessing steps should belong to the environment. This corresponds to the state compression scheme, (Sun et al., 2019), but such a definition is very brittle, as even the slightest numerical changes in might cause the boundary to change significantly.^{9}^{9}9If we take an arbitrary function and add arbitrarily small perturbations to for each , might reveal all the information in and the mapping is essentially an isomorphism.
Question 2.
How to robustly define the boundary that provides the best theoretical guarantees?
Solution: BoundaryInvariant Analyses We answer the two questions and address the practical and theoretical concerns via a novel boundaryinvariant analysis, which we derive in the following sections. While it is impossible to cover all existing algorithms, we exemplify the analysis for a representative valuebased algorithm (FQI), and believe that the spirit and the techniques are widely applicable. For FQI, we show that it is possible to relax the common assumptions in literature to their boundaryinvariant counterparts and still provide the same nearoptimality guarantees. This addresses Question 2, as we provide a compelling guarantee compared to that of classical boundarydependent analyses under any boundary, so there is no need to choose a boundary whatsoever. Our assumptions also have improved verifiability than their boundarydependent counterparts, which partially addresses Question 1.
4 Case Study: Batch Contextual Bandit (CB) with Predictable Rewards
We warmstart with the simple problem of fitting a reward function from batch data in contextual bandits, which may be viewed as MDPs with .^{10}^{10}10In Sections 4 and 5, the same symbol carries the same (or similar) meaning. However, there are some inevitable differences and the reader should not confuse the two settings in general. The analysis also applies straightforwardly to learning a policyspecific valuefunction from MonteCarlo rollouts and performing onestep policy improvement (Sutton and Barto, 2018). This section also provides important building blocks for Section 5, and the simplicity of the analysis allows us to thoroughly discuss the intuitions and the conceptual issues, leaving Section 5 focused on the technical aspects.
4.1 Setting and Algorithm
Let be a dataset, where , and . Let denote the joint distribution over , or . For any , define the empirical squared loss
(2) 
and the population version The algorithm fits a reward function by minimizing , that is, and outputs . We are interested in providing a guarantee to the performance of this policy, that is, the expected reward obtained by executing , We will base our analyses on the following inequality:
(3) 
In words, we assume that approximately minimizes the population loss. Such a bound can be obtained via a uniform convergence argument, where will depend on the sample size and the statistical complexity of (e.g., pseudodimension (Haussler, 1992)). We do not include this part as it is standard and orthogonal to the discussions in this paper, and rather focus on how to provide a guarantee on as a function .
4.2 A Sufficient Condition for Boundary Invariance
Before we start the analysis, we first show that the algorithm itself is boundaryinvariant, leading to a sufficient condition for judging the boundary invariance of analyses. The concept central to the algorithm is the squared loss . Although is defined using , the definition refers to exclusively through the evaluation of on , taking expectation over a naturally generated dataset .^{11}^{11}11Here we use “naturally generated” to contrast state resetting operations discussed in Section 3. The data points are generated by an objective procedure (collecting data with policy ), and on every data point , is the same scalar regardless of the boundary, hence the algorithmic procedure is boundaryinvariant.
Inspired by this, we provide the following sufficient condition for boundaryinvariant analyses:
Claim 1.
An analysis is boundaryinvariant if the assumptions and the optimal value are defined in a way that accesses states and actions exclusively through evaluations of functions in , with plain expectations (either empirical or population) over naturally generated data distributions.
A number of pitfalls need to be avoided in the specification of such a condition:

Restricting the functions to is important, as one can define conditional expectations (on a single state) through plain expectations via the use of state indicator functions (or the dirac delta functions for continuous state spaces).

Besides the assumptions, the very notion of optimality also needs to be taken care of, as (the usual notion of optimal value) is also a boundarydependent quantity.^{12}^{12}12For any fixed , is boundaryinvariant. Here the boundary dependence of is due to that of .
That said, this condition is not perfectly rigorous, as we find it difficult to make it mathematically strict without being verbose and/or restrictive. Regardless, we believe it conveys the right intuitions and can serve as a useful guideline for judging the boundary invariance of a theory. Furthermore, the condition provides us with significant mathematical convenience: as long as the condition is satisfied, we can analyze an algorithm under any boundary, allowing us to use the standard MDP formulations and all the objects defined therein (states, actions, their distributions, etc.).
4.3 Classical Assumptions
We now review the classical assumptions in this problem for later references and comparisons. The first assumption is that data is exploratory, often guaranteed by taking randomized actions in the data collection policy (or behavior policy ) and not starving any of the actions:
Assumption 1 ( is exploratory).
There exists a universal constant such that,
,
The second one is the realizability as already discussed in Section 2.
Assumption 2 (Realizability).
. In contextual bandits, , .
Two comments before we move on:

While we consider exact realizability for simplicity, it is possible to allow an approximation error and state a guarantee that degrades gracefully with the violation of the assumption. Such an extension is routine and we omit it for readability.
4.4 BoundaryInvariant Assumptions and Analysis
Additional Notations For any and any distribution , define . To improve readability we will often omit “” when an expectation involves (and “” for Section 5), and use as a shorthand for .
Definition 1 (Admissible distributions (bandit)).
Given a contextual bandit problem and a space of candidate reward functions , we call the space of admissible distributions.
Assumption 3.
There exists a universal constant such that, for any and any admissible ,
Assumption 3 is a direct consequence of Assumption 1, as is an upper bound on the norm of the importance ratio between and . The proof is elementary and omitted.
Assumption 4.
There exists , such that for all admissible ,
(4) 
and for any ,
(5) 
We say that such an is a valid reward function of the CB.
Assumption 4 is implied by Assumption 2, as satisfies both Eq.(4) and Eq.(5): Eq.(4) can be obtained from by taking the expectation of both sides w.r.t. . Eq.(5) is the standard biasvariance decomposition for squared loss regression when is the Bayesoptimal regressor.^{13}^{13}13It is easy to allow an approximation error in Eq.(4) and/or (5). For example, one can measure the violation of Eq.(4) by , and such errors can be easily incorporated in our later analysis. Eq.(4) guarantees that still bears the semantics of reward, although no longer in a pointwise manner. Eq.(5) guarantees that can be reliably identified through squared loss minimization, which is specialized to the batch learning setting with squared loss minimization. In fact, we provide a counterexample in Appendix D showing that dropping Eq.(5) can result in the failure of the algorithm, and also discuss other learning settings where this assumption is not needed.
Now we are ready to state the main theorem of this section, whose proof can be found in Appendix E.
5 Case Study: Fitted QIteration
5.1 Setting and Algorithm
To highlight the differences between boundarydependent and boundaryinvariant analyses, we adopt a simplified setting assuming i.i.d. data. Interested readers can consult prior works for more general analyses on mixing data (e.g., Antos et al., 2008).
Let be a dataset, where , , and . For any , define the empirical squared loss
and the population version . The algorithm initializes arbitrarily, and
The algorithm repeats this for some iterations and outputs . We are interested in providing a guarantee to the performance of this policy.
5.2 Classical Assumptions
Similar to the CB case, there will be two assumptions, one that requires the data to be exploratory, and one that requires to satisify certain representation conditions.
Definition 2 (Admissible distributions (MDP)).
A stateaction distribution is admissible if it takes the form of for any and any (stochastic and/or nonstationary) policy .
Assumption 5 ( is exploratory).
There exists a universal constant such that for any admissible , .
This guarantees that well covers all admissible distributions. The upper bound is known as the concentratability coefficient (Munos, 2003), and here we use the simplified version from a recent analysis by Chen and Jiang (2019). See Farahmand et al. (2010) for a more finegrained characterization of this quantity.
Assumption 6 (No inherent Bellman error).
, .
This assumption states that is closed under the Bellman update operator . It automatically implies (for finite ) hence is stronger than realizability, but replacing this assumption with realizability can cause FQI to diverge (Van Roy, 1994; Gordon, 1995; Tsitsiklis and Van Roy, 1997) or have exponential sample complexity (Dann et al., 2018). We refer the readers to Chen and Jiang (2019) for further discussions on the necessity of this assumption.
It is also possible to relax the assumption and allow an approximation error in the form of , known as the inherent Bellman error (Munos and Szepesvári, 2008). Again we do not consider this extension, and incorporating it in our analysis is straightforward.
5.3 BoundaryInvariant Assumptions
Assumption 7.
There exists a universal constant such that, for any and any admissible stateaction distribution ,
Assumption 8.
, there exists such that for all admissible ,
(6) 
and for any ,
(7) 
Define as the operator that maps to an arbitrary (but systematically chosen) that satisfies the above conditions.
Assumption 8 states that for every , we can define a contextual bandit problem with random reward , and there exists that is a valid reward function for this problem (Assumption 4). In the classical definitions, the true reward function for this problem is , so our operator can be viewed as the boundaryinvariant version of .
5.4 BoundaryInvariant Analysis
In Section 4 for contextual bandits, is defined directly in the assumptions, and we use it to define the optimal value in Theorem 2. In Assumptions 7 and 8, however, no counterpart of is defined. How do we even express the optimal value that we compete with?
We resolve this difficulty by relying on the operator defined in Assumption 8. Recall that in classical analyses, can be defined as the fixed point of , so we define similarly through .
Theorem 3.
Under Assumption 8, there exists s.t. for any admissible .
Lemma 4 (Boundaryinvariant version of contraction).
Under Assumption 8, for any admissible , , let , and denote the distribution of generated as ,
(8) 
Although similar results are also proved in classical analyses, proving Lemma 4 under Assumption 8 is more challenging. For example, a very useful property in the classical analysis is that , and it holds in a pointwise manner for every . In our boundaryinvariant analyses, however, such a handy tool is not available as we only make assumptions on the averagecase properties of the functions, and their pointwise behavior is undefined. We refer the readers to Appendix E for how we overcome this technical difficulty.
With defined in Theorem 3, we state the main theorem of this section, with proof deferred to Appendix E.
Theorem 5.
Let be the sequence of functions obtained by FQI. Let be an universal upper bound on the error incurred in each iteration, that is, ,
Let be the greedy policy of . Then
6 Discussions
We conclude the paper with further discussions and open questions.
Verifiability The ability to verify the correctness of theoretical assumptions is important to the development and the debugging of RL algorithms. In Section 3 we argued that classical realizabilitytype assumptions are not only boundarydependent, but also cannot be verified from naturally generated data without state resetting. One major difficulty is that quantities like are defined via conditional expectations “”, and estimating it requires reproducing the same state multiple times, which is impossible in general. This issue is eliminated in the boundaryinvariant analyses, as the assumptions are stated using plain expectations over data (recall Claim 1), which can be verified (up to any accuracy) via MonteCarlo estimation.
Of course, verifying the boundaryinvariant assumptions still faces significant challenges, as the statements frequently use languages like “” and “ admissible ”, making it computationally expensive to verify them exhaustively. We note, however, that this is likely to be the case for any strict theoretical assumptions, and practitioners often develop heuristics under the guidance of theory to make the verification process tractable. For example, the difficulty related to “” may be resolved by clever optimization techniques, and that related to “” may be addressed by testing the assumptions on a diverse and representative set of distributions designed with domain knowledge. We leave the design of an efficient and effective verification protocol to future work.
“Boundary 0” As we hinted in Figure 1, there exists a choice of the boundary that makes every RL problem deterministic (Ng and Jordan, 2000). This leads to a number of further paradoxes: for example, many difficulties in RL arise due to stochastic transitions, and there are algorithms designed for deterministic systems that avoid these difficulties. Why don’t we always use them since all environments are essentially deterministic? This question, among others, is discussed in Appendix F. In general, we find that investigating this extreme view is helpful in clarifying some of the confusions, and it provides justifications for certain design choices in our theory.
Should we discard boundarydependent analyses? We do not advocate for replacing boundarydependent analyses with their boundaryinvariant counterparts and this is not the intention of this paper. Rather, our purpose is to demonstrate the feasibility of boundaryinvariant analyses, and to use the concrete maths to ground the discussions of the conceptual issues (which can easily go astray and become vacuous given the nature of this topic). On a related note, boundarydependent analyses make stronger assumptions hence are mathematically easier to work with in general.
Boundary invariance in exploration algorithms Boundaryinvariant version of Bellman equation for policy evaluation has appeared in Jiang et al. (2017) who study PAC exploration under function approximation, although they do not discuss its further implications. While our assumptions are inspired by theirs, we have to deal with additional technical difficulties due to offpolicy policy optimization. In Appendix D we discuss the connections and the differences between the two papers on a concrete example.
MCTS meets valuefunction approximation In Section 3 we show that the issue of boundary dependence is not just conceptual puzzles and can have real consequences, especially when MCTS and valuefunction approximation appear together. One can further: When we use MCTS to provide expert demonstration for a valuebased learner (e.g., Guo et al., 2014), how should we choose the boundary (i.e., which notion of state should we reset to in MCTS)?^{14}^{14}14The same question can be asked about the residual algorithms Baird (1995), which minimize Bellman errors via the double sampling trick, i.e., drawing two i.i.d. nextstates from the same . More generally, when the learner is of limited capability in an imitation learning scenario (Ross et al., 2011; Ross and Bagnell, 2014), how to best design the demonstration policy? In fact, we show in Appendix G that demonstration using for a poorly chosen boundary can be completely useless. Answering these questions is beyond the scope of this paper, and we leave the investigation to future work.
References
 Antos et al. (2008) András Antos, Csaba Szepesvári, and Rémi Munos. Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
 Baird (1995) Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.
 Bellemare et al. (2013) Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
 Chen and Jiang (2019) Jinglin Chen and Nan Jiang. Informationtheoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, pages 1042–1051, 2019.
 Dann et al. (2018) Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. On OracleEfficient PAC RL with Rich Observations. In Advances in Neural Information Processing Systems, pages 1429–1439, 2018.
 Ernst et al. (2005) Damien Ernst, Pierre Geurts, and Louis Wehenkel. Treebased batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
 Farahmand et al. (2010) Amirmassoud Farahmand, Csaba Szepesvári, and Rémi Munos. Error Propagation for Approximate Policy and Value Iteration. In Advances in Neural Information Processing Systems, pages 568–576, 2010.
 Fu et al. (2019) Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing bottlenecks in deep qlearning algorithms. In Proceedings of the 36th International Conference on Machine Learning, pages 2021–2030, 2019.
 Gordon (1995) Geoffrey J Gordon. Stable function approximation in dynamic programming. In Proceedings of the twelfth international conference on machine learning, pages 261–268, 1995.
 Guo et al. (2014) Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. Deep learning for realtime atari game play using offline montecarlo tree search planning. In Advances in neural information processing systems, pages 3338–3346, 2014.
 Haussler (1992) David Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and computation, 1992.
 Jiang et al. (2017) Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low Bellman rank are PAClearnable. In International Conference on Machine Learning, 2017.
 Kakade and Langford (2002) Sham Kakade and John Langford. Approximately Optimal Approximate Reinforcement Learning. In Proceedings of the 19th International Conference on Machine Learning, volume 2, pages 267–274, 2002.
 Kearns et al. (2002) Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for nearoptimal planning in large Markov decision processes. Machine Learning, 49(23):193–208, 2002.
 Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. Bandit based montecarlo planning. In Machine Learning: ECML 2006, pages 282–293. Springer Berlin Heidelberg, 2006.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Munos (2003) Rémi Munos. Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567, 2003.
 Munos and Szepesvári (2008) Rémi Munos and Csaba Szepesvári. Finitetime bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
 Ng and Jordan (2000) Andrew Y Ng and Michael Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 406–415. Morgan Kaufmann Publishers Inc., 2000.
 Puterman (1994) ML Puterman. Markov Decision Processes. Jhon Wiley & Sons, New Jersey, 1994.
 Ross and Bagnell (2014) Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive noregret learning. arXiv preprint arXiv:1406.5979, 2014.
 Ross et al. (2011) Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.
 Silver and Veness (2010) David Silver and Joel Veness. MonteCarlo planning in large POMDPs. In Advances in Neural Information Processing Systems, pages 2164–2172, 2010.
 Sun et al. (2019) Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Modelbased RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Modelfree Approaches. In Conference on Learning Theory, 2019.
 Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Szepesvári (2010) Csaba Szepesvári. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1–103, 2010.
 Szepesvári and Munos (2005) Csaba Szepesvári and Rémi Munos. Finite time bounds for sampling based fitted value iteration. In Proceedings of the 22nd international conference on Machine learning, pages 880–887. ACM, 2005.
 Tsitsiklis and Van Roy (1997) John N Tsitsiklis and Benjamin Van Roy. An analysis of temporaldifference learning with function approximation. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 42(5), 1997.
 Van Roy (1994) Benjamin Van Roy. Featurebased methods for large scale dynamic programming. PhD thesis, Massachusetts Institute of Technology, 1994.
 Wolpert (1996) David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341–1390, 1996.
Appendix A A Minimal Example of Boundary Dependence
For concreteness we provide a minimal example where changes with the boundary. Consider a contextual bandit problem with two contexts/states, and , and the context distribution is . There is only 1 action, and yields a deterministic reward of , and yields a deterministic reward of , so and . However, if the agent ignores the context, the problem is equivalent to a multiarmed bandit with 1 arm yielding a random reward that is Bernoulli distributed, so predicting value for both states gives another valid function.
Appendix B Partial Observability
RL environments are partially observable in general, and even if the original environment is Markovian, partial observability can arise if we treat a lossy/noisy preprocessing step as part of the environment. While we restrict ourselves to MDPs in the main text for the ease of exposition, we note that our results and discussions still apply to the partially observable case. We warn, however, that it is very easy to get confused on this topic, and we provide the following comments to help clarify some of the potential confusions.
1. All POMDPs have welldefined valuefunctions, as a POMDP can always be treated as an MDP over histories (alternating observations and actions).
2. Let be the environment without a preprocessing step , and be the environment that includes . Discussions in Section 1 are based on the fact that functions that operate on (e.g., the neural net that takes the downsampled images as input in Figure 1) can also be treated as a function of (e.g., the original image), as .
Now what happens if is an MDP, but is a POMDP, and functions in takes past observations (and actions) as inputs? In this case, may not be a function of , as may have “forgotten” the previous states.
As usual, this can be fixed if we reexpress by treating histories as states (let the new MDP be ), even if this is redundant as is Markovian in . By doing so, is always a function of .
3. Similarly, if is not deterministic but rather a noisy process that depends on exogenous randomness, one can include such randomness in the state (or history) of (again this is redundant for ).
4. While the redundancy introduced in 2 and 3 keep most properties of intact (including , , , , , ), it does affect the concentratability coefficient defined in Assumption 5.
As an extreme example, consider an MDP whose state is always drawn i.i.d. from a fixed distribution independent of the time step or actions taken. Let be generated from taking actions uniformly at random, and generated by always taking the same action. In this case, as the marginal of and on the states are exactly the same. However, when we treat histories (which include past actions) as states, becomes exponentially large in horizon. So in this sense, Assumption 5 is also boundarydependent. In contrast, Assumption 7 is invariant to such a transformation.
As a related side comment, our main text focuses on how the agent processes the sensory information, but there is another place where the agent interfaces with the environment—actions. While we do not discuss the boundary dependence of actions, we simply note that our boundaryinvariant analyses are likely immune to any possible issues.
Appendix C Realizability is Not Verifiable
To show that realizability is not verifiable in general, it suffices to show an example in contextual bandits (Section 4). We further simplify the problem by restricting the number of actions to , which becomes a standard regression problem, and is realizable if it contains the Bayesoptimal predictor. We provide an argument below, inspired by that of the No Free Lunch theorem [Wolpert, 1996].
Consider a regression problem with finite feature space and label space . The hypothesis class only consists of one function, , that takes a constant value . We will construct multiple data distributions in the form of , and is the Bayesoptimal regressor for one of them (hence realizable) but not for the others, and in the latter case realizability will be violated by a large margin. An adversary chooses the distribution in a randomized manner, and the learner draws a finite dataset from the chosen distribution and needs to decide whether is realizable or not. We show that no learner can answer this question better than random guess when goes to infinity.
In all distributions, the marginal of is always uniform, and it remains to specify . For the realizable case, is distributed as a Bernoulli random variable independent of the value of . It will be convenient to refer to a data distribution by its Bayesoptimal regressor, so this distribution is labeled .
For the remaining distributions, the label is always a deterministic and binary function of , and there are in total such functions. When the adversary chooses a distribution from this family, it always draws uniformly randomly, and we refer to the drawn function (and distribution) . Note that regardless of which function is drawn, always violates realizability by a constantly large margin:
The adversary chooses with probability, and with probability. Since the learner only receives a finite sample, as long as there is no collision in , there is no way to distinguish between and . This is because, can be drawn in two steps, where we first draw all the i.i.d. from Unif, and this step does not reveal any information about the identity of the distribution. The second step generates conditioned on . Assuming no collision in , it is easy to verify that the joint distribution over is i.i.d. Bernoulli for both and . Furthermore, fixing the sample size, the collision probability goes to as increases, and the learner cannot do better than a random guess.
Note that this hardness result does not apply when the learner has access to the resetting operations discussed in Section 3, as the learner can drawn multiple ’s from the same to verify if is stochastic () or deterministic () and succeed with high probability.
Appendix D Necessity of the Squaredloss Decomposition Condition
Here we provide an example showing the necessity of Eq.(5) in Assumption 4. In particular, if Eq.(5) is completely removed, the algorithm may fail to learn a valid value function in the limit of infinite data even when contains one.
Consider a simple contextual bandit problem with two contexts (states), and , and . The problem is uncontrolled (i.e., there is only one action and one policy), and both states yield deterministic reward . Let , where , , and . By Assumption 4 (with Eq.(5) removed), is a valid reward function while is not. However, and , and the regression algorithm will pick with accurate estimation of the losses.
Further Comments In the above example, we note that there is nothing wrong in calling a valid reward function (though it is counterintuitive). In fact, if this bandit problem is a part of a larger MDP—say it appears at the end of an episodic task, and is the only possible distribution that can be induced over and , then may well be part of an optimal value function, and a nearoptimal policy can be learned via active exploration using the OLIVE algorithm [Jiang et al., 2017].^{15}^{15}15Formally, does not violate the validity condition in their Definition 3. The reason that should not be considered as a valid reward function in the context of Section 4 is due to the batch learning setting and the squaredloss regression algorithm. So Eq.(5) is a condition that is specific to the setting and the algorithm, and not inherent in our boundaryinvariant definition of reward/valuefunctions.
Appendix E Proofs of Sections 4 and 5
Proof of Theorem 2.
Proof of Lemma 4.
Let be a shorthand for . Also recall that is short for , and for . The first step is to show that
(9) 
To prove this, we start with Eq.(7):
Now we have
(10) 
and by symmetry
(11) 
We are ready to prove Eq.(9): its RHS is
The 1st term is nonnegative, the 2nd term is the LHS of Eq.(9), and the rest two terms are according to Eq.(10) and (11). So Eq.(9) holds.
Now from the RHS of Eq.(9):
∎ 
Proof of Theorem 3.
Since for any by our definition, we can apply repeatedly to a function. Indeed, pick any , we show that for large enough , for any admissible , so will satisfy the definition of . This is because
(Lemma 4)  
where is some admissible distribution. (Its detailed form is not important, but the reader can infer from the derivation above.) Given the boundedness of , becomes arbitrarily close to for all uniformly as increases. Now for each , define . Since is finite, there exists a minimum nonzero value for , so with large enough , will be smaller than such a minimum value and must be . ∎
Comment
In the proof of Theorem 3 we used the fact is finite to show that . This is the only place in this paper where we need the finiteness of . Even if is continuous, we can still use a large enough to upper bound with an arbitrarily small number, which reduces the elegance of the theorem statements and has no impact on our results otherwise.
Proof of Theorem 5.
We first show that is a valuefunction of on any admissible . The easiest way to prove this is to introduce the classical (boundarydependent) notion of as a bridge. Note that we always have
So it suffices to show that , . We prove this using (a slight variant of) the value difference decomposition lemma [Jiang et al., 2017, Lemma 1]:
Here with a slight abuse of notation we use to denote the distribution over induced by , . For each term on the RHS,
The second term is because by the definition of (Assumption 8): is a reward function for random reward under any admissible distribution, including . The first term is also because
(Theorem 3) 
Now
(see e.g., Kakade and Langford [2002, Lemma 6.1])  
()  