Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning

Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning

Dipendra Misra Mikael Henaff Akshay Krishnamurthy John Langford
Abstract

We present an algorithm, , for exploration and reinforcement learning in rich observation environments that are summarizable by an unknown latent state space. The algorithm interleaves representation learning to identify a new notion of kinematic state abstraction with strategic exploration to reach new states using the learned abstraction. The algorithm provably explores the environment with sample complexity scaling polynomially in the number of latent states and the time horizon, and, crucially, with no dependence on the size of the observation space, which could be infinitely large. This exploration guarantee further enables sample-efficient global policy optimization for any reward function. On the computational side, we show that the algorithm can be implemented efficiently whenever certain supervised learning problems are tractable. Empirically, we evaluate on a challenging exploration problem, where we show that the algorithm is exponentially more sample efficient than standard reinforcement learning baselines.

\newfloatcommand

capbtabboxtable[][\FBwidth]

affiliationtext: Microsoft Research, New York, NY{dimisra, mihenaff, akshaykr, jcl}@microsoft.com

1 Introduction

Modern reinforcement learning applications call for agents that operate directly from rich sensory information such as megapixel camera images. This rich information enables representation of detailed, high-quality policies and obviates the need for hand-engineered features. However, exploration in such settings is notoriously difficult and, in fact, statistically intractable in general (Jaksch et al., 2010; Lattimore and Hutter, 2012; Krishnamurthy et al., 2016). Despite this, many environments are highly structured and do admit sample efficient algorithms (Jiang et al., 2017); indeed, we may be able to summarize the environment with a simple state space and extract these states from raw observations. With such structure, we can leverage techniques from the well-studied tabular setting to explore the environment (Hazan et al., 2018), efficiently recover the underlying dynamics (Strehl and Littman, 2008), and optimize any reward function (Kearns and Singh, 2002; Brafman and Tennenholtz, 2002; Strehl et al., 2006; Dann et al., 2017; Azar et al., 2017; Jin et al., 2018). But can we learn to decode a simpler state from raw observations alone?

The main difficulty is that learning a state decoder, or a compact representation, is intrinsically coupled with exploration. On one hand, we cannot learn a high-quality decoder without gathering comprehensive information from the environment, which may require a sophisticated exploration strategy. On the other hand, we cannot tractably explore the environment without an accurate decoder. These interlocking problems constitute a central challenge in reinforcement learning, and a provably effective solution remains elusive despite decades of research  Mccallum (1996); Ravindran (2004); Jong and Stone (2005); Li et al. (2006); Bellemare et al. (2016); Nachum et al. (2019) (See \prefsec:related for a discussion of related work).

In this paper, we provide a solution for a significant sub-class of problems known as Block Markov Decision Processes (MDPs) Du et al. (2019), in which the agent operates directly on rich observations that are generated from a small number of unobserved latent states. Our algorithm, , learns a new reward-free state abstraction called kinematic inseparability, which it uses to drive exploration of the environment. Informally, kinematic inseparability aggregates observations that have the same forward and backward dynamics. Shared backward dynamics crucially implies that a single policy simultaneously maximizes the probability of observing a set of kinematically inseparable observations, which is useful for exploration. Shared forward dynamics is naturally useful for recovering the latent state space and model. Perhaps most importantly, we show that a kinematic inseparability abstraction can be recovered from a bottleneck in a regressor trained on a contrastive estimation problem derived from raw observations.

performs strategic exploration by training policies to visit each kinematically inseparable abstract state, resulting in a policy cover. These policies are constructed via a reduction to contextual bandits (Bagnell et al., 2004), using a dynamic-programming approach and a synthetic reward function that incentivizes reaching an abstract state. Crucially, interleaves learning the state abstraction and policy cover in an inductive manner: we use the policies learned from a coarse abstraction to reach new states, which enables us to refine the state abstraction and learn new policies (See \preffig:interleaving-process for a schematic). Each process is essential to the other. Once the policy cover is constructed, it can be used to efficiently gather the information necessary to find a near-optimal policy for any reward function.

Figure 1: learns a set of exploration policies and a state abstraction function by iterating between exploring using the current state abstraction and refining the state abstraction based on the new experience.

We analyze the statistical and computational properties of in episodic Block MDPs. We prove that learns to visit every latent state and also learns a near-optimal policy for any given reward function with a number of trajectories that is polynomial in the number of latent states, actions, horizon, and the complexity of two function classes used by the algorithm. There is no explicit dependence on the observation space size. The main assumptions are that the latent states are reachable and that the function classes are sufficiently expressive. There are no identifiability or determinism assumptions beyond decodability of the Block MDP, resulting in significantly greater scope than prior work (Du et al., 2019; Dann et al., 2018). On the computational side, operates in a reductions model and can be implemented efficiently whenever cetain supervised learning problems are tractable.

Empirically, we evaluate on a challenging RL problem with high-dimensional observations, precarious dynamics, and sparse, misleading rewards. The problem is googal-sparse: the probability of encountering an optimal reward through random search is . recovers the underlying state abstraction for this problem and consistently finds a near-optimal policy, outperforming popular RL baselines that use naive exploration strategies (Mnih et al., 2016; Schulman et al., 2017) or more sophisticated exploration bonuses (Burda et al., 2019), as well as the recent PAC-RL algorithm of (Du et al., 2019).

2 Preliminaries

We consider reinforcement learning (RL) in episodic Block Markov Decision Processes (Block MDP), first introduced by Du et al. (2019). A Block MDP is described by a large (possibly infinite) observation space , a finite unobservable state space , a finite set of actions , and a time horizon . The process has a starting state distribution 111Du et al. Du et al. (2019) assume the starting state is deterministic, which we generalize here., transition function , emission function , and a reward function . The agent interacts with the environment by repeatedly generating -step trajectories where , , and for all , and all actions are chosen by the agent. We set for all as there is no . We assume that for any trajectory . The agent does not observe the states . As notation, we often denote sequences using the “” operator, e.g., .

Without loss of generality, we partition into subsets , where are the only states reachable at time step . We similarly partition based on time step into . Formally, and when . This partitioning may be internal to the agent as we can simply concatenate the time step to the states and observations. Let be the time step function, associating an observation to the time point where it is reachable.

A policy chooses actions on the basis of observations and defines a distribution over trajectories. We use to denote expectation and probability with respect to this distribution. The goal of the agent is to find a policy that achieves high expected reward. We define the value function and policy value as:

As the observation space is extremely large, we consider a function approximation setting, where the agent has access to a policy class . We further define the class of non-stationary policies to enable the agent to use a different policy for each time step: a policy takes action according to .222We will often consider a -step non-stationary policy when we only plan to execute this policy for steps. The optimal policy in this class is , and our goal is to find a policy with value close to the optimal value, .

Environment assumptions.

The key difference between Block MDPs and general Partially-Observed MDPs is a disjointness assumption, which removes partial observability effects and enables tractable learning.

Assumption 1.

The emission distributions for any two states are disjoint, that is whenever .

This disjointness assumption is used by Du et al. (2019), who argue that it is a natural fit for visual grid-world scenarios such as in \preffig:mdp-example-right, which are common in empirical RL research. The name “Block MDP” arises since each hidden state emits observations from a disjoint block . The assumption allows us to define an inverse mapping such that for each and , we have . The agent does not have access to .

Apart from disjointness, the main environment assumption is that states are reachable with reasonable probability. To formalize this, we define a maximum visitation probability and reachability parameter:

Here is the probability of visiting along the trajectory taken by . As in Du et al. (2019), our sample complexity scales polynomially with , so this quantity should be reasonably large. In contrast with prior work (Du et al., 2019; Dann et al., 2018), we do not require any further identifiability or determinism assumptions on the environment.

We call the policies that visit a particular state with maximum probability homing policies.

{definition}

[Homing Policy] The homing policy for an observation is . Similarly, the homing policy for a state is .

In \prefapp:appendix-homing, we prove some interesting properties for these policies. One key property is their non-compositional nature. We cannot extend homing policies for states in to find homing policies for states in . For example, for the Block MDP in \preffig:mdp-example-left, the homing policy for takes action in but the homing policies for , and do not take action in . Non-compositionality implies that we must take a global policy optimization approach for learning homing policies, which we will do in the sequel.

Reward-free learning.

In addition to reward-sensitive learning, where the goal is to identify a policy with near-optimal value , we also consider a reward-free objective. In this setting, the goal is to find a small set of policies that can be used to visit the entire state space. If we had access to the set of homing policies for every state then this set would suffice. However, in practice we can only hope to learn an approximation. We capture this idea with an -policy cover.

{definition}

[Policy Cover] A finite set of non-stationary policies is called an -policy cover if for every state we have . Intuitively, we hope to find a policy cover of size . By executing each policy in turn, we can collect a dataset of observations and rewards from all states at which point it is relatively straightforward to maximize any reward (Kakade and Langford, 2002; Munos, 2003; Bagnell et al., 2004; Munos and Szepesvári, 2008; Antos et al., 2008; Farahmand et al., 2010; Chen and Jiang, 2019; Agarwal et al., 2019). Thus, constructing a policy cover can be viewed as an intermediate objective that facilitates reward sensitive learning.

(a)
(b)
Figure 2: Two example Block MDPs. Left: An example from the Minecraft domain Johnson et al. (2016). The agent’s observation is given by the history of all observed raw images. The grid on the left shows the latent state space structure. Right: The agent starts deterministically in and can take three different actions . Dashed lines denote stochastic transitions while solid lines are deterministic. The numbers on each dashed arrow depict the transition probabilities. We do not show observations for every state for brevity.

Function classes.

In the Block MDP setting we are considering, the agent may never see the same observation twice, so it must use function approximation to generalize across observations. Our algorithm uses two function classes. The first is simply the policy class , which was used above to define the optimal value and the maximum visitation probabilities. For a simpler analysis, we assume is finite and measure statistical complexity via . However, our results only involve standard uniform convergence arguments, so extensions to infinite classes with other statistical complexity notions is straightforward. We emphasize that is typically not fully expressive.

We also use a family of regression functions with a specific form. To define , first define which maps observations into discrete abstract states. Second, define as another “tabular” regressor class which consists of all functions of the specified type. Then, we set and . For the analysis, we assume that is finite and our bounds scale with , which allows us to search over an exponentially large space of abstraction functions. As is all functions over a discrete domain, it has pointwise entropy growth rate of (see \prefapp:supporting for a formal definition), and these two considerations determine the complexity of the regression class . As above, we remark that our results use standard uniform convergence arguments, so it is straightforward to extend to other notions.

Computational oracles.

As we are working with large function classes, we consider an oracle model of computation where we assume that these classes support natural optimization primitives. This “reductions” approach abstracts away computational issues and addresses the desired situation where these classes are efficiently searchable. Note that the oracle model provides no statistical benefit as the oracles can always be implemented via enumeration; the model simply serves to guide the design of practical algorithms.

Specifically, for the policy class , we assume access to an offline contextual bandit optimization routine:

This is a one-step importance weighted reward maximization problem, which takes as input a dataset of quads, where , , and is the reward for the action , which was chosen with probability . This optimization arises in contextual bandit settings (Strehl et al., 2010; Bottou et al., 2013; Swaminathan and Joachims, 2015), and is routinely implemented via a further reduction to cost sensitive classification Agarwal et al. (2014).333We call this a contextual bandit oracle rather than a cost-sensitive classification oracle because the dataset is specified in contextual bandit format even though the oracle formally solves a cost-sensitive classification problem. The advantage is that in practice, we can leverage statistical improvements developed for contextual bandits, such as doubly-robust estimators Dudík et al. (2014).

For the regression class , we assume that we can solve square loss minimization problems:

Here, the dataset consists of quads where , and is a binary label. Square loss minimization is a standard optimization problem arising in supervised learning, but note that our function class is somewhat non-standard. In particular, even though square loss regression is computationally tractable for convex classes, our class is nonconvex as it involves quantization. On the other hand, latent categorical models are widely used in practice Jang et al. (2016); Hu et al. (2017), which suggests that these optimization problems are empirically tractable.

We emphasize that these oracle assumptions are purely computational and simply guide the algorithm design. In our experiments, we instantiate both and with neural networks, so both oracles solve nonconvex problems. This nonconvexity does not hinder the empirical effectiveness of the algorithm.

For running time calculations, we assume that a single call to and with examples can be solved in and time, respectively.

3 Kinematic Inseparability State Abstraction

The foundational concept for our approach is a new form of state abstraction, called kinematic inseparability. This abstraction has two key properties. First, it can be learned via a reduction to supervised learning, which we will discuss in detail in \prefsec:learning-alg. Second, it enables reward-free exploration of the environment, studied in \prefsec:oracle-kinematic-insep. In this section, we present the key definitions, some interesting properties, and some intuition.

For exploration, a coarser state abstraction, called backward kinematic inseparability, is sufficient.

{definition}

[Backward Kinematic Inseparability] Two observations are backward kinematically inseparable (KI) if for all distributions supported on and we have

is the backward dynamics measuring the probability that the previous observation and action was given that the current observation is and the prior over is .

The significance of backward KI is evident from the following lemma. {lemma} If are backward kinematic inseparable then for all we have .

The proof of this lemma and all mathematical statements in this paper are deferred to the appendices. At a high level, the lemma shows that backward KI observations induce the same ordering over policies with respect to visitation probability. This property is useful for exploration, since a policy that maximizes the probability of visiting an abstract state, also maximizes the probability of visiting each individual observation in that abstract state simultaneously. Thus, if we train a policy to visit backward KI abstract state for each — which we can do in an inductive manner with synthetic reward functions, as we will see in the next subsection — we guarantee that all observations are visited with high probability, so we have a policy cover. In this way, this lemma helps establish that a backward KI abstraction enables sample-efficient reward-free exploration.

While backward KI is sufficient for exploration, it ignores the forward dynamics, which are useful for learning a model. This motivates the definition of forward kinematic inseparability.

{definition}

[Forward Kinematic Inseparability] Two observations are forward kinematically inseparable (KI) if for every and we have

Finally, observations are kinematically inseparable if they satisfy both of these definitions. {definition}[Kinematic Inseparability] Two observations are kinematically inseparable if for every distribution with support over and for every and we have

It is straightforward to verify that all three of these notions are equivalence relations, and hence they partition the observation space. The backward kinematic inseparability dimension, denoted , is the size of the coarsest partition generated by the backward KI equivalence relation, with and defined similarly for the forward KI and KI relations. We also use mappings to denote these abstractions, for example if and only if and are backward KI.

In  \prefapp:appendix-kinematic-inseparability, we collect and prove several useful properties of these state abstractions. Importantly, we show that for Block MDPs, observations emitted from the same state are kinematically inseparable and, hence, . Ideally, we would like so that the abstract states are in correspondence with the real states of the environment, and we could recover the true model by learning the dynamics between abstract states. However, we may have , but only in cases where the true state space is unidentifiable from observations. \preffig:mdp-canonical depicts such an example. From the left panel, if we split into two states with the same forward dynamics and proportional backward dynamics, we obtain the structure in the right panel. Note that these two Block MDPs are indistinguishable from observations, so we say that the simpler one is the canonical form.

{definition}

[Canonical Form] A Block MDP is in canonical form if : if and only if and are kinematically inseparable.

(a)
(b)
Figure 3: Left: A Block MDP with 3 states and 1 action (observations are not depicted). Right: We take the MDP on the left and treat as two states: and where contains observations that are emitted 80% of the time from and contains the rest. There is no way to distinguish between these two MDPs, and we call the left MDP the canonical form.

Note that canonical form is simply a way to identify the notion of state for a Block MDP. It does not restrict this class of environments whatsoever.

3.1 Exploration with an Oracle Kinematic Inseparability Abstraction

We now show that backward KI enables reward-free strategic exploration and reward-sensitive learning, by developing an algorithm, , that assumes oracles access to a backward KI abstraction function . The pseudocode is displayed in \prefalg:oracle-exp.

takes as input a policy class , a backward KI abstraction , and three hyperparameters . The hyperparameter is an estimate of the reachability parameter (we want ), while are the standard PAC parameters. The algorithm operates in two phases: a reward-free phase in which it learns a policy cover (\prefalg:oracle-exp, \prefline:oracle-exp-loop-start-\prefline:oracle-exp-loop-end) and a reward-sensitive phase where it learns a near-optimal policy for the given reward function (\prefalg:oracle-exp, \prefline:oracle-reward-sensitive). In the reward-free phase, the algorithm proceeds inductively, using a policy cover for time steps to learn a policy cover for time step . In the iteration, we define internal reward functions corresponding to each output of . The reward function gives a reward of if the agent observes at time step satisfying and otherwise. The internal reward functions incentivize the agent to reach different backward KI abstract states.

1:Set ,   and
2:for  do
3:     for  to  do
4:         Define  // Define internal reward functions
5:          // Learn an exploration policy using
6:               
7: // Reward Sensitive Learning
8:return ,
Algorithm 1 . Reinforcement learning in a Block MDP with oracle access to a Backward KI abstraction
1:for  do
2:     
3:     for  times do
4:           //  sampling procedure (see text)
5:               
6:      // Solve a contextual bandit problem given dataset
7:return
Algorithm 2 (). Optimizing reward function given policy covers

We find a policy that optimizes this internal reward function using the subroutine , displayed in \prefalg:rl_via_homing_policies. This subroutine is based on Policy Search by Dynamic Programming (Bagnell et al., 2004), which, using an exploratory data-collection policy, optimizes a reward function by solving a sequence of contextual bandit problems (Langford and Zhang, 2008) in a dynamic programming fashion. In our case, we use the policy covers for time steps to construct the exploratory policy (\prefalg:rl_via_homing_policies, \prefline:psdp-sampling-proc).

Formally, at time step of , we solve the following optimization problem

using the previously computed solutions for future time steps. The context distribution is obtained by uniformly sampling a policy in and rolling-in with it until time step . To solve this problem, we first collect a dataset of tuples of size by (1) sampling by rolling-in with a uniformly selected policy in until time step , (2) taking action uniformly at random, (3) setting , and (4) executing , and (5) setting . Then we invoke the contextual bandit oracle with dataset to obtain . Repeating this process we obtain the non-stationary policy returned by .

The learned policy cover for time step is simply the policies identified by optimizing each of the internal reward functions . Once we find the policy covers , we perform reward-sensitive learning via a single invocation of using the external reward function (\prefalg:oracle-exp, \prefline:oracle-reward-sensitive). Of course, in a purely reward free setting, we can omit this last step and simply return the policy covers .

4 Learning Kinematic Inseparability for Strategic Exploration

We now present our main algorithm, , displayed in \prefalg:learn_homing_policy, which learns a kinematic inseparability abstraction while performing reward-free strategic exploration. Given hypothesis classes and , a positive integer , and three hyperparameters , learns a policy cover of size and a state abstraction function for each time step. We assume that and for our theoretical analysis.

1:Set , , and
2:for  do
3:     
4:     for  times do

5:                                           

6:         
7:         if  then
8:              ,    // Add a real transition
9:         else
10:              .    // Add an imposter transition                
11:              // Supervised learning on

12:     for  to  do

13:         Define                              

14:         
15:               

16:Set
17:              // Reward Sensitive Learning
18:return , , ,

Learn State Abstraction

Learn Policy Cover
Algorithm 3 . Reinforcement and abstraction learning in a Block MDP.

The overall structure of is similar to with a reward-free phase preceding the reward-sensitive one. As with , proceeds inductively, learning a policy cover for observations reachable at time step given the learned policy covers for previous steps (\prefalg:learn_homing_policy, \prefline:homer-time-step-start-\prefline:homer-time-step-end).

The key difference with is that, for each iteration , we first learn an abstraction function over . This is done using a form of contrastive estimation and our function class . Specifically in the iteration, collects a dataset of size containing real and imposter transitions. We define a sampling procedure: where is observed after rolling-in with a uniformly sampled policy in until time step , action is taken uniformly at random, and is sampled from (\prefalg:learn_homing_policy, \prefline:homer-sampling-procedure). We sample two independent transitions using this procedure, and we also sample a Bernoulli random variable . If then we add the observed transition to and otherwise we add the imposter transition (\prefalg:learn_homing_policy, \prefline:homer-sample-add-start-\prefline:homer-sample-add-end). Note that the imposter transition may or may not correspond feasible environment outcomes.

We call the subroutine to solve the supervised learning problem induced by with model family (\prefalg:learn_homing_policy, \prefline:homer-learn-encoding-function), and we obtain a predictor . As we will show, is closely related to a backward KI abstraction, so we can use it to drive exploration at time step , just as in . Empirically, we will see that is closely related to a forward KI abstraction, which is useful for auxiliary tasks such as learning the transition dynamics or visualization.

With the rest of the algorithm proceeds similarly to . We invoke with the internal reward functions induced by to find the policy cover (\prefalg:learn_homing_policy, \prefline:homer-policy-c-ver-psdp). Once we have found policy covers for all time steps, we perform reward-sensitive optimization just as before (\prefalg:learn_homing_policy, \prefline:homer-reward-sensitive). As with we can ignore the reward-sensitive phase and operate in a purely reward-free setting by simply returning the policy covers .

We combine the two abstractions as to form the learned KI abstraction, where for any , if and only if and . We define and as there is no backward and forward dynamics information available at these steps, respectively. Empirically, we use for learning the transition dynamics and visualization (see \prefsec:exp).

5 Theoretical Analysis

Our main theoretical contribution is to show that both and compute a policy cover and a near-optimal policy with high probability in a sample-efficient and computationally-tractable manner. The result requires an additional expressivity assumption on our model classes and , which we now state.

Assumption 2 (Realizability).

Let be the set of external and internal reward functions. We assume that satisfies policy completeness for every : for every and , there exists such that

We also assume that is realizable: for any , , and any prior distribution with , there exists such that

Completeness assumptions are common in the analysis of dynamic programming style algorithms for the function approximation setting Antos et al. (2008) (see Chen and Jiang Chen and Jiang (2019) for a detailed discussion). Our exact completeness assumption appears in the work of Dann et al. (2018), who use it to derive an efficient algorithm for a restricted version of our setting with deterministic latent state transitions.

The realizability assumption on is adapted to our learning approach: as we use to distinguish between real and imposter transitions, should contain the Bayes optimal classifier for these problems. In , the sampling procedure that is used to collect data for the learning problem in the iteration induces a marginal distribution and the Bayes optimal predictor for this problem is (See \preflem:p-form in \prefapp:appendix-sample-complexity). It is not hard to see that if are kinematically inseparable then and the same claim holds for the third argument of . Therefore, by the structure of , a sufficient condition to ensure realizability is that contains a kinematic inseparability abstraction, which is reasonable as this is precisely what we are trying to learn.

Theoretical Guarantees.

We first present the guarantee for .

{theorem}

[ Result] For any Block MDP, given a backward KI abstraction such that , and parameters with , outputs policy covers and a reward sensitive policy such that the following holds, with probability at least :

  1. For each , is a -policy cover for ;

  2. .

The sample complexity is , and the running time is , where and are defined in \prefalg:oracle-exp.

\pref

thm:oracle-exp-statement shows that we can learn a policy cover and use it to learn a near-optimal policy, assuming access to a backward KI abstraction. The sample complexity is , which at a coarse level is our desired scaling. We have not attempted to optimize the exponents in the sample complexity or running time.

We also remark that we may be able to achieve this oracle guarantee with other algorithmic approaches. Two natural candidates are (1) a model-based approach where we learn the dynamics models over the backward KI abstract states and plan to visit abstract states, and (2) a value-based approach like Fitted Q-Iteration with the same synthetic reward structure as we use here (Munos and Szepesvári, 2008; Antos et al., 2008; Farahmand et al., 2010; Chen and Jiang, 2019). We have not analyzed these approaches in our precise setting, and they may actually be more sample efficient than our policy-based approach. Despite this, \prefthm:oracle-exp-statement suffices to establish our main conceptual takeway: that a backward KI abstraction can be used for sample efficient exploration and policy optimization.

We next present the result for . Here we show that achieves a similar guarantee to , without prior access to a backward KI abstraction. In other words, provably learns a backward KI abstraction and uses it for exploration and policy optimization.

{theorem}

[Main Result] For any Block MDP and hyperparameters , satisfying and , outputs exploration policies and a reward sensitive policy satisfying:

  1. is an -policy cover of for every

with probability least . The sample complexity of is where are as specified in \prefalg:learn_homing_policy. This evaluates to

The running time is

In comparison with the guarantee for , the main qualitative difference is that the guarantee for also has a dependence on , which is natural as attempts to learn the backward KI function. Crucially, the logarithmic dependence on implies that we can take to be exponentially large and achieve a guarantee that is qualitatively comparable to that of . This demonstrates that we can learn a backward KI abstraction function from an exponentially large class and then use it for exploration and policy optimization.

In terms of computation, the running time is polynomial in our oracle model, where we assume we can solve contextual bandit problems over and regression problems over . In \prefsec:exp, we will see that these problems can be solved effectively in practice.

The closest related result is for the algorithm of Du et al. (2019). provably finds a policy cover in a restricted class of Block MDPs in a sample- and computationally-efficient manner. The precise details of the guarantee differs from ours in several ways (e.g., additive versus multiplicative error in policy cover definition, different computational and expressivity assumptions), so the sample complexity bounds are incomparable. However, \prefthm:main-theorem represents a significant conceptual advance as it eliminates the identifiability assumptions required by and therefore greatly increases the scope for tractable reinforcement learning.

Why does learn kinematic inseparability?

A detailed proof of both theorems is deferred to \prefapp:psdp-appendix-\prefapp:appendix-sample-complexity, but for intuition, we provide a sketch of how learns a kinematic inseparability abstraction. For this discussion only, we focus on asymptotic behavior and ignore sampling issues.

Inductively, assume that is a policy cover of , let be the marginal distribution over real and imposter transitions sampled by in the iteration (\prefline:homer-dataset-step-start–\prefline:homer-dataset-step-end), and let be the marginal distribution over . First observe that as is a policy cover, we have , which is crucial for our analysis. Let be the output of the regression oracle in this iteration. The first observation is that the Bayes optimal regressor for this problem is defined in \prefassum:realizability, and, with realizability, in this asymptotic discussion we have .

Next, we show that for any two observations , if then and are backward kinematically inseparable. If this precondition holds, then

Then, by inspection of the form of , we have

As this identity holds for all and trivially when , it is easy to see that satisfy \prefdef:bki, so they are backward KI. Formally, taking expectation with prior , we have

This implies that is a backward KI abstraction over .

Efficient Implementation of .

As stated, the most computationally expensive component of is the calls to for learning the policy covers. This has a total computational cost of , but in practice, it can be significantly reduced. Empirically, we use two important optimizations: First, we parallelize the calls to for optimizing the internal reward functions in each iteration of the algorithm (\prefline:homer-learn-policy-start–\prefline:homer-learn-policy-end). Second and perhaps more significantly, we can attempt to find compositional policies using a greedy search procedure (). Here, rather than perform full dynamic programming to optimize reward , we find the policy for the last time step, and then we append this policy to the best one from our cover . Formally, we compute , where is the value function with respect to reward function and denotes policy composition. Then, since the optimal value with respect to is at most , we check if . If it is we need not perform a full dynamic programming backup, otherwise we revert to . may succeed even though the policies we are trying to find are non-compositional in general. In the favorable situation where succeeds, actually no further samples are required, since we can re-use the real transitions from the regression step, and we need only solve one contextual bandit problem, instead of . Empirically, both of these optimizations may yield significant statistical and computational savings.

6 Can We Use Existing State Abstraction Oracles?

Our analysis so far verifies the utility of the backward KI state abstraction: it enables efficient exploration via