Near Optimal Behavior via Approximate State Abstraction\@footnotetextA previous version of this paper was published in the Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).
The combinatorial explosion that plagues planning and reinforcement learning (RL) algorithms can be moderated using state abstraction. Prohibitively large task representations can be condensed such that essential information is preserved, and consequently, solutions are tractably computable. However, exact abstractions, which treat only fully-identical situations as equivalent, fail to present opportunities for abstraction in environments where no two situations are exactly alike. In this work, we investigate approximate state abstractions, which treat nearly-identical situations as equivalent. We present theoretical guarantees of the quality of behaviors derived from four types of approximate abstractions. Additionally, we empirically demonstrate that approximate abstractions lead to reduction in task complexity and bounded loss of optimality of behavior in a variety of environments.
[ topline=false, bottomline=false, rightline = false, leftmargin=10pt, rightmargin=0pt, innertopmargin=0pt, innerbottommargin=0pt ]innerproof
- Markov Decision Process
- reinforcement learning
Abstraction plays a fundamental role in learning. Through abstraction, intelligent agents may reason about only the salient features of their environment while ignoring what is irrelevant. Consequently, agents are able to solve considerably more complex problems than they would be able to without the use of abstraction. However, exact abstractions, which treat only fully-identical situations as equivalent, require complete knowledge that is computationally intractable to obtain. Furthermore, often no two situations are identical, so exact abstractions are often ineffective. To overcome these issues, we investigate approximate abstractions that enable agents to treat sufficiently similar situations as identical. This work characterizes the impact of equating “sufficiently similar” states in the context of planning and RL in Markov Decision Processes. The remainder of our introduction contextualizes these intuitions in MDPs.
Solving for optimal behavior in MDPs in a planning setting is known to be P-Complete in the size of the state space [28, 25]. Similarly, many RL algorithms for solving MDPs are known to require a number of samples polynomial in the size of the state space . Although polynomial runtime or sample complexity may seem like a reasonable constraint, the size of the state space of an MDP grows super-polynomially with the number of variables that characterize the domain - a result of Bellman’s curse of dimensionality. Thus, solutions polynomial in state space size are often ineffective for sufficiently complex tasks. For instance, a robot involved in a pick-and-place task might be able to employ planning algorithms to solve for how to manipulate some objects into a desired configuration in time polynomial in the number of states, but the number of states it must consider grows exponentially with the number of objects with which it is working .
Thus, a key research agenda for planning and RL is leveraging abstraction to reduce large state spaces [2, 21, 10, 12, 6]. This agenda has given rise to methods that reduce ground MDPs with large state spaces to abstract MDPs with smaller state spaces by aggregating states according to some notion of equality or similarity. In the context of MDPs, we understand exact abstractions as those that aggregate states with equal values of particular quantities, for example, optimal -values. Existing work has characterized how exact abstractions can fully maintain optimality in MDPs [24, 8].
The thesis of this work is that performing approximate abstraction in MDPs by relaxing the state aggregation criteria from equality to similarity achieves polynomially bounded error in the resulting behavior while offering three benefits. First, approximate abstractions employ the sort of knowledge that we expect a planning or learning algorithm to compute without fully solving the MDP. In contrast, exact abstractions often require solving for optimal behavior, thereby defeating the purpose of abstraction. Second, because of their relaxed criteria, approximate abstractions can achieve greater degrees of compression than exact abstractions. This difference is particularly important in environments where no two states are identical. Third, because the state aggregation criteria are relaxed to near equality, approximate abstractions are able to tune the aggressiveness of abstraction by adjusting what they consider sufficiently similar states.
We support this thesis by describing four different types of approximate abstraction functions that preserve near-optimal behavior by aggregating states on different criteria: , on similar optimal -values, , on similarity of rewards and transitions, , on similarity of a Boltzmann distribution over optimal -values, and , on similarity of a multinomial distribution over optimal -values. Furthermore, we empirically demonstrate the relationship between the degree of compression and error incurred on a variety of MDPs.
This paper is organized as follows. In the next section, we introduce the necessary terminology and background of MDPs and state abstraction. Section 3 surveys existing work on state abstraction applied to sequential decision making. Section 5 introduces our primary result; bounds on the error guaranteed by four classes of approximate state abstraction. The following two sections introduce simulated domains used in experiments (Section 6), and a discussion of experiments in which we apply one class of approximate abstraction to a variety of different tasks to empirically illustrate the relationship between degree of compression and error incurred (Section 7).
An MDP is a problem representation for sequential decision making agents, represented by a five-tuple: . Here, is a finite state space; is a finite set of actions available to the agent; denotes , the probability of an agent transitioning to state after applying action in state ; denotes the reward received by the agent for executing action in state ; is a discount factor that determines how much the agent prefers future rewards over immediate rewards. We assume without loss of generality that the range of all reward functions is normalized to . The solution to an MDP is called a policy, denoted .
The objective of an agent is to solve for the policy that maximizes its expected discounted reward from any state, denoted . We denote the expected discounted reward for following policy from state as the value of the state under that policy, . We similarly denote the expected discounted reward for taking action and then following policy from state forever after as , defined by the Bellman Equation as:
We let RMax denote the maximum reward (which is 1), and QMax denote the maximum value, which is . The value function, , defined under a given policy, denoted , is defined as:
Lastly, we denote the value and functions under the optimal policy as or and or , respectively. For further background, see Kaelbling et al. .
3 Related Work
Several other projects have addressed similar topics.
3.1 Approximate State Abstraction
Dean et al.  leverage the notion of bisimulation to investigate partitioning an MDP’s state space into clusters of states whose transition model and reward function are within of each other. They develop an algorithm called Interval Value Iteration (IVI) that converges to the correct bounds on a family of abstract MDPs called Bounded MDPs.
Several approaches build on Dean et al. . Ferns et al. [14, 15] investigated state similarity metrics for MDPs; they bounded the value difference of ground states and abstract states for several bisimulation metrics that induce an abstract MDP. This differs from our work which develops a theory of abstraction that bounds the suboptimality of applying the optimal policy of an abstract MDP to its ground MDP, covering four types of state abstraction, one of which closely parallels bisimulation. Even-Dar and Mansour  analyzed different distance metrics used in identifying state space partitions subject to -similarity, also providing value bounds (their Lemma 4) for -homogeneity subject to the norm, which parallels our Claim 2. Ortner  developed an algorithm for learning partitions in an online setting by taking advantage of the confidence bounds for and provided by UCRL .
Hutter [18, 17] investigates state aggregation beyond the MDP setting. Hutter presents a variety of results for aggregation functions in reinforcement learning. Most relevant to our investigation is Hutter’s Theorem 8, which illustrates properties of aggregating states based on similar values. Hutter’s Theorem part (a) parallels our Claim: both bound the value difference between ground and abstraction states, and part (b) is analogous to our Lemma 1: both bound the value difference of applying the optimal abstraction policy in the ground, and part (c) is a repetition of the comment given by Li et al.  that abstractions preserve the optimal value function. For Lemma 1, our proof strategies differ from Hutter’s, but the result is the same.
Approximate state abstraction has also been applied to the planning problem, in which the agent is given a model of its environment and must compute a plan that satisfies some goal. Hostetler et al.  apply state abstraction to Monte Carlo Tree Search and expectimax search, giving value bounds of applying the optimal abstract action in the ground tree(s), similarly to our setting. Dearden and Boutilier  also formalize state-abstraction for planning, focusing on abstractions that are quickly computed and offer bounded value. Their primary analysis is on abstractions that remove negligible literals from the planning domain description, yielding value bounds for these abstractions and a means of incrementally improving abstract solutions to planning problems. Jiang et al.  analyze a similar setting, applying abstractions to the Upper Confidence Bound applied to Trees algorithm adapted for planning, introduced by Kocsis and Szepesvári .
Mandel et al.  advance Bayesian aggregation in RL to define Thompson Clustering for Reinforcement Learning (TCRL), an extension of which achieves near-optimal Bayesian regret bounds. Jiang  analyze the problem of choosing between two candidate abstractions. They develop an algorithm based on statistical tests that trades of the approximation error with the estimation error of the two abstractions, yielding a loss bound on the quality of the chosen policy.
3.2 Specific Abstraction Algorithms
Many previous works have targeted the creation of algorithms that enable state abstraction for MDPs. Andre and Russell  investigated a method for state abstraction in hierarchical reinforcement learning leveraging a programming language called ALISP that promotes the notion of safe state abstraction. Agents programmed using ALISP can ignore irrelevant parts of the state, achieving abstractions that maintain optimality. Dietterich  developed MAXQ, a framework for composing tasks into an abstracted hierarchy where state aggregation can be applied. Bakker and Schmidhuber  also target hierarchical abstraction, focusing on subgoal discovery. Jong and Stone  introduced a method called policy-irrelevance in which agents identify (online) which state variables may be safely abstracted away in a factored-state MDP. Dayan and Hinton  develop “Feudal Reinforcement Learning” which presents an early form of hierarchical RL that restructures -Learning to manage the decomposition of a task into subtasks. For a more complete survey of algorithms that leverage state abstraction in past reinforcement-learning papers, see Li et al. , and for a survey of early works on hierarchical reinforcement learning, see Barto and Mahadevan .
3.3 Exact Abstraction Framework
Li et al.  developed a framework for exact state abstraction in MDPs. In particular, the authors defined five types of state aggregation functions, inspired by existing methods for state aggregation in MDPs. We generalize two of these five types, and , to the approximate abstraction case. Our generalizations are equivalent to theirs when exact criteria are used (i.e. ). Additionally, when exact criteria are used our bounds indicate that no value is lost, which is one of core results of Li et al. . Walsh et al.  build on the framework they previously developed by showing empirically how to transfer abstractions between structurally related MDPs.
4 Abstraction Notation
(): Given a , each ground state has associated with it the ground states with which it is aggregated. Similarly, each abstract state has its constituent ground states. We let be the function that retrieves these states:
The abstract reward function and abstract transition dynamics for each abstract state are a weighted combination of the rewards and transitions for each ground state in the abstract state.
(): We refer to the weight associated with a ground state, by . The only restriction placed on the weighting scheme is that it induces a probability distribution on the ground states of each abstract state:
(): The abstract reward function is a weighted sum of the rewards of each of the ground states that map to the same abstract state:
(): The abstract transition function is a weighted sum of the transitions of each of the ground states that map to the same abstract state:
5 Approximate State Abstraction
Here, we introduce our formal analysis of approximate state abstraction, including results bounding the error associated with these abstraction methods. In particular, we demonstrate that abstractions based on approximate similarity (5.2), approximate model similarity (5.3), and approximate similarity between distributions over , for both Boltzmann (5.4) and multinomial (5.5) distributions induce abstract MDPs for which the optimal policy has bounded error in the ground MDP.
We first introduce some additional notation.
(, ): We let and stand for the optimal policies in the abstract and ground MDPs, respectively.
(): Given a state and a state aggregation function, ,
We now define types of abstraction based on functions of state–action pairs.
(): Given a function and a fixed non-negative , we define as a type of approximate state aggregation function that satisfies the following for any two ground states , :
That is, when aggregates states, all aggregated states have values of within of each other for all actions.
Finally, we estliabsh notation to distinguish between the ground and abstract value () and action value () functions.
(, ): Let and denote the optimal Q and optimal value functions in the ground MDP.
(, ): Let and stand for the optimal Q and optimal value functions in the abstract MDP.
5.1 Main Result
We now introduce the main result of the paper.
There exist at least four types of approximate state aggregation functions, , , and , for which the optimal policy in the abstract MDP, applied to the ground MDP, has suboptimality bounded polynomially in :
Where differs between abstraction function families:
For and , we also assume that the difference in the normalizing terms of each distribution is bounded by some non-negative constant, , of :
Naturally, the value bound of Equation 10 is meaningless for , since this is the maximum possible value in any MDP (and we assumed the range of is ). In light of this, observe that for , all of the above bounds are exactly 0. Any value of interpolated between these two points achieves different degrees of abstraction, with different degrees of bounded loss.
We now introduce each approximate aggregation family and prove the theorem by proving the specific value bound for each function type.
5.2 Optimal Q Function:
We consider an approximate version of Li et al. ’s . In our abstraction, states are aggregated together when their optimal -values are within .
(): An approximate function abstraction has the same form as Equation 9:
When a type abstraction is used to create the abstract MDP:
Proof of Lemma 1:
We first demonstrate that -values in the abstract MDP are close to -values in the ground MDP (Claim 1). We next leverage Claim 1 to demonstrate that the optimal action in the abstract MDP is nearly optimal in the ground MDP (Claim 2). Lastly, we use Claim 2 to conclude Lemma 1 (Claim 3).
Consider a non-Markovian decision process of the same form as an MDP, , parameterized by integer an , such that for the first time steps the reward function, transition dynamics and state space are those of the abstract MDP, , and after time steps the reward function, transition dynamics and state spaces are those of . Thus,
The -value of state in for action is:
We proceed by induction on to show that:
where if and otherwise.
When , , so this base case trivially follows.
By definition of , we have that is
Since all co-aggregated states have -values within of one another and induces a convex combination,
We assume as our inductive hypothesis that:
Consider a fixed but arbitrary state, , and fixed but arbitrary action . Since , is . By definition of , , :
Applying our inductive hypothesis yields:
Since all aggregated states have -values within of one another:
Consider a fixed but arbitrary state, and its corresponding abstract state . Let stand for the optimal action in , and stand for the optimal action in :
The optimal action in the abstract MDP has a -value in the ground MDP that is nearly optimal:
By Claim 1,
By the definition of , we know that
Lastly, again by Claim 1, we know
Therefore, Equation 16 follows.
Consider the policy for of following the optimal abstract policy for steps and then following the optimal ground policy in :
For , the value of this policy for in the ground MDP is:
For , is simply .
We now show by induction on that
By definition, when , , so our bound trivially holds in this case.
Consider a fixed but arbitrary state . We assume for our inductive hypothesis that
Applying our inductive hypothesis yields:
Applying Claim 2 yields:
Since was arbitrary, we conclude that our bound holds for all states in for the inductive case. Thus, from our base case and induction, we conclude that
Note that as , by the sum of infinite geometric series and . Thus, we conclude Lemma 1. ∎
5.3 Model Similarity:
Now, consider an approximate version of Li et al. ’s , where states are aggregated together when their rewards and transitions are within .
(): We let define a type of abstraction that, for fixed , satisfies:
When is created using a type:
Proof of Lemma 2:
Let be the maximum -value difference between any pair of ground states in the same abstract state for :
where . First, we expand:
Since difference of rewards is bounded by :
By similarity of transitions under :
Recall that QMax , and we defined :
5.4 Boltzmann over Optimal Q:
Here, we introduce , which aggregates states with similar Boltzmann distributions on -values. This type of abstractions is appealing as Boltzmman distributions balance exploration and exploitation . We find this type particularly interesting for abstraction purposes as, unlike , it allows for aggregation when -value ratios are similar but their magnitudes are different.
(): We let define a type of abstractions that, for fixed , satisfies:
We also assume that the difference in normalizing terms is bounded by some non-negative constant, , of :
When is created using a function of the type, for some non-negative constant :
We use the approximation for , with error:
We let denote the error in approximating and denote the error in approximating .
Proof of Lemma 3:
Either term is positive or negative. First suppose the former. It follows by algebra that:
When is the negative case, it follows that:
By similar algebra that yielded Equation 34: