Optimal and Approximate Qvalue Functions
for Decentralized POMDPs
Abstract
Decisiontheoretic planning is a popular approach to sequential decision making problems, because it treats uncertainty in sensing and acting in a principled way. In singleagent frameworks like MDPs and POMDPs, planning can be carried out by resorting to Qvalue functions: an optimal Qvalue function is computed in a recursive manner by dynamic programming, and then an optimal policy is extracted from . In this paper we study whether similar Qvalue functions can be defined for decentralized POMDP models (DecPOMDPs), and how policies can be extracted from such value functions. We define two forms of the optimal Qvalue function for DecPOMDPs: one that gives a normative description as the Qvalue function of an optimal pure joint policy and another one that is sequentially rational and thus gives a recipe for computation. This computation, however, is infeasible for all but the smallest problems. Therefore, we analyze various approximate Qvalue functions that allow for efficient computation. We describe how they relate, and we prove that they all provide an upper bound to the optimal Qvalue function . Finally, unifying some previous approaches for solving DecPOMDPs, we describe a family of algorithms for extracting policies from such Qvalue functions, and perform an experimental evaluation on existing test problems, including a new firefighting benchmark problem.
322008289–35309/0705/08 \ShortHeadingsOptimal and Approximate QValue Functions for DecPOMDPs Oliehoek, Spaan & Vlassis \firstpageno289
1 Introduction
One of the main goals in artificial intelligence (AI) is the development of intelligent agents, which perceive their environment through sensors and influence the environment through their actuators. In this setting, an essential problem is how an agent should decide which action to perform in a certain situation. In this work, we focus on planning: constructing a plan that specifies which action to take in each situation the agent might encounter over time. In particular, we will focus on planning in a cooperative multiagent system (MAS): an environment in which multiple agents coexist and interact in order to perform a joint task. We will adopt a decisiontheoretic approach, which allows us to tackle uncertainty in sensing and acting in a principled way.
Decisiontheoretic planning has roots in control theory and in operations research. In control theory, one or more controllers control a stochastic system with a specific output as goal. Operations research considers tasks related to scheduling, logistics and work flow and tries to optimize the concerning systems. Decisiontheoretic planning problems can be formalized as Markov decision processes (MDPs), which have have been frequently employed in both control theory as well as operations research, but also have been adopted by AI for planning in stochastic environments. In all these fields the goal is to find a (conditional) plan, or policy, that is optimal with respect to the desired behavior. Traditionally, the main focus has been on systems with only one agent or controller, but in the last decade interest in systems with multiple agents or decentralized control has grown.
A different, but also related field is that of game theory. Game theory considers agents, called players, interacting in a dynamic, potentially stochastic process, the game. The goal here is to find optimal strategies for the agents, that specify how they should play and therefore correspond to policies. In contrast to decisiontheoretic planning, game theory has always considered multiple agents, and as a consequence several ideas and concepts from game theory are now being applied in decentralized decisiontheoretic planning. In this work we apply gametheoretic models to decisiontheoretic planning for multiple agents.
1.1 DecisionTheoretic Planning
In the last decades, the Markov decision process (MDP) framework has gained in popularity in the AI community as a model for planning under uncertainty [\BCAYBoutilier, Dean, \BBA HanksBoutilier et al.1999, \BCAYGuestrin, Koller, Parr, \BBA VenkataramanGuestrin et al.2003]. MDPs can be used to formalize a discrete time planning task of a single agent in a stochastically changing environment, on the condition that the agent can observe the state of the environment. Every time step the state changes stochastically, but the agent chooses an action that selects a particular transition function. Taking an action from a particular state at time step induces a probability distribution over states at time step .
The agent’s objective can be formulated in several ways. The first type of objective of an agent is reaching a specific goal state, for example in a maze in which the agent’s goal is to reach the exit. A different formulation is given by associating a certain cost with the execution of a particular action in a particular state, in which case the goal will be to minimize the expected total cost. Alternatively, one can associate rewards with actions performed in a certain state, the goal being to maximize the total reward.
When the agent knows the probabilities of the state transitions, i.e., when it knows the model, it can contemplate the expected transitions over time and construct a plan that is most likely to reach a specific goal state, minimizes the expected costs or maximizes the expected reward. This stands in some contrast to reinforcement learning (RL) [\BCAYSutton \BBA BartoSutton \BBA Barto1998], where the agent does not have a model of the environment, but has to learn good behavior by repeatedly interacting with the environment. Reinforcement learning can be seen as the combined task of learning the model of the environment and planning, although in practice often it is not necessary to explicitly recover the environment model. In this article we focus only on planning, but consider two factors that complicate computing successful plans: the inability of the agent to observe the state of the environment as well as the presence of multiple agents.
In the real world an agent might not be able to determine what the state of the environment exactly is, because the agent’s sensors are noisy and/or limited. When sensors are noisy, an agent can receive faulty or inaccurate observations with some probability. When sensors are limited the agent is unable to observe the differences between states that cannot be detected by the sensor, e.g., the presence or absence of an object outside a laser rangefinder’s field of view. When the same sensor reading might require different action choices, this phenomenon is referred to as perceptual aliasing. In order to deal with the introduced sensor uncertainty, a partially observable Markov decision process (POMDP) extends the MDP model by incorporating observations and their probability of occurrence conditional on the state of the environment [\BCAYKaelbling, Littman, \BBA CassandraKaelbling et al.1998].
The other complicating factor we consider is the presence of multiple agents. Instead of planning for a single agent we now plan for a team of cooperative agents. We assume that communication within the team is not possible.^{1}^{1}1As it turns out, the framework we consider can also model communication with a particular cost that is subject to minimization [\BCAYPynadath \BBA TambePynadath \BBA Tambe2002, \BCAYGoldman \BBA ZilbersteinGoldman \BBA Zilberstein2004]. The noncommunicative setting can be interpreted as the special case with infinite cost. A major problem in this setting is how the agents will have to coordinate their actions. Especially, as the agents are not assumed to observe the state—each agent only knows its own observations received and actions taken—there is no common signal they can condition their actions on. Note that this problem is in addition to the problem of partial observability, and not a substitution of it; even if the agents could freely and instantaneously communicate their individual observations, the joint observations would not disambiguate the true state.
One option is to consider each agent separately, and have each such agent maintain an explicit model of the other agents. This is the approach as chosen in the Interactive POMDP (IPOMDP) framework [\BCAYGmytrasiewicz \BBA DoshiGmytrasiewicz \BBA Doshi2005]. A problem in this approach, however, is that the other agents also model the considered agent, leading to an infinite recursion of beliefs regarding the behavior of agents. We will adopt the decentralized partially observable Markov decision process (DecPOMDP) model for this class of problems [\BCAYBernstein, Givan, Immerman, \BBA ZilbersteinBernstein et al.2002]. A DecPOMDP is a generalization to multiple agents of a POMDP and can be used to model a team of cooperative agents that are situated in a stochastic, partially observable environment.
The singleagent MDP setting has received much attention, and many results are known. In particular it is known that an optimal plan, or policy, can be extracted from the optimal actionvalue, or Qvalue, function , and that the latter can be calculated efficiently. For POMDPs, similar results are available, although finding an optimal solution is harder (PSPACEcomplete for finitehorizon problems, \citeRPapadimitriou87).
On the other hand, for DecPOMDPs relatively little is known except that they are provably intractable (NEXPcomplete, \citeRBernstein02Complexity). In particular, an outstanding issue is whether Qvalue functions can be defined for DecPOMDPs just as in (PO)MDPs, and whether policies can be extracted from such Qvalue functions. Currently most algorithms for planning in DecPOMDPs are based on some version of policy search [\BCAYNair, Tambe, Yokoo, Pynadath, \BBA MarsellaNair et al.2003b, \BCAYHansen, Bernstein, \BBA ZilbersteinHansen et al.2004, \BCAYSzer, Charpillet, \BBA ZilbersteinSzer et al.2005, \BCAYVarakantham, Marecki, Yabu, Tambe, \BBA YokooVarakantham et al.2007], and a proper theory for Qvalue functions in DecPOMDPs is still lacking. Given the wide range of applications of value functions in singleagent decisiontheoretic planning, we expect that such a theory for DecPOMDPs can have great benefits, both in terms of providing insight as well as guiding the design of solution algorithms.
1.2 Contributions
In this paper we develop theory for Qvalue functions in DecPOMDPs, showing that an optimal Qfunction can be defined for a DecPOMDP. We define two forms of the optimal Qvalue function for DecPOMDPs: one that gives a normative description as the Qvalue function of an optimal pure joint policy and another one that is sequentially rational and thus gives a recipe for computation. We also show that given , an optimal policy can be computed by forwardsweep policy computation, solving a sequence of Bayesian games forward through time (i.e., from the first to the last time step), thereby extending the solution technique of \citeAEmeryMontemerlo04 to the exact setting.
Computation of is infeasible for all but the smallest problems. Therefore, we analyze three different approximate Qvalue functions and that can be more efficiently computed and which constitute upper bounds to . We also describe a generalized form of that includes , and . This is used to prove a hierarchy of upper bounds: .
Next, we show how these approximate Qvalue functions can be used to compute optimal or suboptimal policies. We describe a generic policy search algorithm, which we dub Generalized () as it is a generalization of the algorithm by \citeASzer05MAA, that can be used for extracting a policy from an approximate Qvalue function. By varying the implementation of a subroutine of this algorithm, this algorithm unifies and forwardsweep policy computation and thus the approach of \citeAEmeryMontemerlo04.
Finally, in an experimental evaluation we examine the differences between , , and for several problems. We also experimentally verify the potential benefit of tighter heuristics, by testing different settings of on some well known test problems and on a new benchmark problem involving firefighting agents.
This article is based on previous work by \citeAOliehoek07aamas—abbreviated OV here—containing several new contributions: (1) Contrary to the OV work, the current work includes a section on the sequential rational description of and suggests a way to compute in practice (OV only provided a normative description of ). (2) The current work provides a formal proof of the hierarchy of upper bounds to (which was only qualitatively argued in the OV paper). (3) The current article additionally contains a proof that the solutions for the Bayesian games with identical payoffs given by equation (4.2) constitute Pareto optimal Nash equilibria of the game (which was not proven in the OV paper). (4) This article contains a more extensive experimental evaluation of the derived bounds of , and introduces a new benchmark problem (firefighting). (5) Finally, the current article provides a more complete introduction to DecPOMDPs and existing solution methods, as well as Bayesian games, hence it can serve as a selfcontained introduction to DecPOMDPs.
1.3 Applications
Although the field of multiagent systems in a stochastic, partially observable environment seems quite specialized and thus narrow, the application area is actually very broad. The real world is practically always partially observable due to sensor noise and perceptual aliasing. Also, in most of these domains communication is not free, but consumes resources and thus has a particular cost. Therefore models as DecPOMDPs, which do consider partially observable environments are relevant for essentially all teams of embodied agents.
Example applications of this type are given by \citeAEmeryMontemerlo05phd, who considered multirobot navigation in which a team of agents with noisy sensors has to act to find/capture a goal. \citeABecker04TransIndepJAIR use a multirobot space exploration example. Here, the agents are Mars rovers and have to decide on how to proceed their mission: whether to collect particular samples at specific sites or not. The rewards of particular samples can be sub or superadditive, making this task nontrivial. An overview of application areas in cooperative robotics is presented by \citeAArai02, among which is robotic soccer, as applied in RoboCup [\BCAYKitano, Asada, Kuniyoshi, Noda, \BBA OsawaKitano et al.1997]. Another application that is investigated within this project is crisis management: RoboCup Rescue [\BCAYKitano, Tadokoro, Noda, Matsubara, Takahashi, Shinjoh, \BBA ShimadaKitano et al.1999] models a situation where rescue teams have to perform a search and rescue task in a crisis situation. This task also has been modeled as a partially observable system [\BCAYNair, Tambe, \BBA MarsellaNair et al.2002, \BCAYNair, Tambe, \BBA MarsellaNair et al.2003, \BCAYNair, Tambe, \BBA MarsellaNair et al.2003a, \BCAYOliehoek \BBA VisserOliehoek \BBA Visser2006, \BCAYPaquet, Tobin, \BBA ChaibdraaPaquet et al.2005].
There are also many other types of applications. \citeANair05ndPOMDPs,Lesser03distSensorNetsBook give applications for distributed sensor networks (typically used for surveillance). An example of load balancing among queues is presented by \citeACogill04approxDP. Here agents represent queues and can only observe queue sizes of themselves and immediate neighbors. They have to decide whether to accept new jobs or pass them to another queue. Another frequently considered application domain is communication networks. \citeAPeshkin01PhD treated a packet routing application in which agents are routers and have to minimize the average transfer time of packets. They are connected to immediate neighbors and have to decide at each time step to which neighbor to send each packet. Other approaches to communication networks using decentralized, stochastic, partially observable systems are given by \citeAOoi96,Tao01,Altman02.
1.4 Overview of Article
The rest of this article is organized as follows. In Section 2 we will first formally introduce the DecPOMDP model and provide background on its components. Some existing solution methods are treated in Section 3. Then, in Section 4 we show how a DecPOMDP can be modeled as a series of Bayesian games and how this constitutes a theory of Qvalue functions for BGs. We also treat two forms of optimal Qvalue functions, , here. Approximate Qvalue functions are described in Section 5 and one of their applications is discussed in Section 6. Section 7 presents the results of the experimental evaluation. Finally, Section 8 concludes.
2 Decentralized POMDPs
In this section we define the DecPOMDP model and discuss some of its properties. Intuitively, a DecPOMDP models a number of agents that inhabit a particular environment, which is considered at discrete time steps, also referred to as stages [\BCAYBoutilier, Dean, \BBA HanksBoutilier et al.1999] or (decision) epochs [\BCAYPutermanPuterman1994]. The number of time steps the agents will interact with their environment is called the horizon of the decision problem, and will be denoted by . In this paper the horizon is assumed to be finite. At each stage every agent takes an action and the combination of these actions influences the environment, causing a state transition. At the next time step, each agent first receives an observation of the environment, after which it has to take an action again. The probabilities of state transitions and observations are specified by the DecPOMDP model, as are rewards received for particular actions in particular states. The transition and observation probabilities specify the dynamics of the environment, while the rewards specify what behavior is desirable. Hence, the reward model defines the agents’ goal or task: the agents have to come up with a plan that maximizes the expected long term reward signal. In this work we assume that planning takes place offline, after which the computed plans are distributed to the agents, who then merely execute the plans online. That is, computation of the plan is centralized, while execution is decentralized. In the centralized planning phase, the entire model as detailed below is available. During execution each agent only knows the joint policy as found by the planning phase and its individual history of actions and observations.
2.1 Formal Model
In this section we more formally treat the basic components of a DecPOMDP. We start by giving a mathematical definition of these components.
Definition 2.1.
A decentralized partially observable Markov decision process (DecPOMDP) is defined as a tuple where:

is the number of agents.

is a finite set of states.

is the set of joint actions.

is the transition function.

is the immediate reward function.

is the set of joint observations.

is the observation function.

is the horizon of the problem.

, is the initial state distribution at time .^{2}^{2}2 denotes the set of probability distributions over .
The DecPOMDP model extends singleagent (PO)MDP models by considering joint actions and observations. In particular, we define as the set of joint actions. Here, is the set of actions available to agent . Every time step, one joint action is taken. In a DecPOMDP, agents only know their own individual action; they do not observe each other’s actions. We will assume that any action can be selected at any time. So the set does not depend on the stage or state of the environment. In general, we will denote the stage using superscripts, so denotes the joint action taken at stage , is the individual action of agent taken at stage . Also, we write for a profile of actions for all agents but .
Similarly to the set of joint actions, is the set of joint observations, where is a set of observations available to agent . Every time step the environment emits one joint observation , from which each agent only observes its own component , as illustrated by Figure 1. Notation with respect to time and indices for observations is analogous to the notation for actions. In this paper, we will assume that the action and observation sets are finite. Infinite action and observation sets are very difficult to deal with even in the singleagent case, and to the authors’ knowledge no research has been performed on this topic in the partially observable, multiagent case.
Actions and observations are the interface between the agents and their environment. The DecPOMDP framework describes this environment by its states and transitions. This means that rather than considering a complex, typically domaindependent model of the environment that explains how this environment works, a descriptive stance is taken: A DecPOMDP specifies an environment model simply as the set of states the environment can be in, together with the probabilities of state transitions that are dependent on executed joint actions. In particular, the transition from some state to a next state depends stochastically on the past states and actions. This probabilistic dependence models outcome uncertainty: the fact that the outcome of an action cannot be predicted with full certainty.
An important characteristic of DecPOMDPs is that the states possess the Markov property. That is, the probability of a particular next state depends on the current state and joint action, but not on the whole history:
(2.1) 
Also, we will assume that the transition probabilities are stationary, meaning that they are independent of the stage .
In a way similar to how the transition model describes the stochastic influence of actions on the environment, the observation model describes how the state of the environment is perceived by the agents. Formally, is the observation function, a mapping from joint actions and successor states to probability distributions over joint observations: . I.e., it specifies
(2.2) 
This implies that the observation model also satisfies the Markov property (there is no dependence on the history). Also the observation model is assumed stationary: there is no dependence on the stage .
Literature has identified different categories of observability [\BCAYPynadath \BBA TambePynadath \BBA Tambe2002, \BCAYGoldman \BBA ZilbersteinGoldman \BBA Zilberstein2004]. When the observation function is such that the individual observation for all the agents will always uniquely identify the true state, the problem is considered fully or individually observable. In such a case, a DecPOMDP effectively reduces to a multiagent MDP (MMDP) as described by \citeABoutilier96mmdp. The other extreme is when the problem is nonobservable, meaning that none of the agents observes any useful information. This is modeled by the fact that agents always receive a nullobservation, . Under nonobservability agents can only employ an openloop plan. Between these two extremes there are partially observable problems. One more special case has been identified, namely the case where not the individual, but the joint observation identifies the true state. This case is referred to as jointly or collectively observable. A jointly observable DecPOMDP is referred to as a DecMDP.
The reward function is used to specify the goal of the agents and is a function of states and joint actions. In particular, a desirable sequence of joint actions should correspond to a high ‘longterm’ reward, formalized as the return.
Definition 2.2.
Let the return or cumulative reward of a DecPOMDP be defined as total of the rewards received during an execution:
(2.3) 
where is the reward received at time step .
When, at stage , the state is and the taken joint action is , we have that . Therefore, given the sequence of states and taken joint actions, it is straightforward to determine the return by substitution of by in (2.3).
In this paper we consider as optimality criterion the expected cumulative reward, where the expectation refers to the expectation over sequences of states and executed joint actions. The planning problem is to find a conditional plan, or policy, for each agent to maximize the optimality criterion. In the DecPOMDP case this amounts to finding a tuple of policies, called a joint policy that maximizes the expected cumulative reward.
Note that, in a DecPOMDP, the agents are assumed not to observe the immediate rewards: observing the immediate rewards could convey information regarding the true state which is not present in the received observations, which is undesirable as all information available to the agents should be modeled in the observations. When planning for DecPOMDPs the only thing that matters is the expectation of the cumulative future reward which is available in the offline planning phase, not the actual reward obtained. Indeed, it is not even assumed that the actual reward can be observed at the end of the episode.
Summarizing, in this work we consider DecPOMDPs with finite actions and observation sets and a finite planning horizon. Furthermore, we consider the general DecPOMDP setting, without any simplifying assumptions on the observation, transition, or reward models.
2.2 Example: Decentralized Tiger Problem
Here we will describe the decentralized tiger problem introduced by \citeANair03_JESP. This test problem has been frequently used [\BCAYNair, Tambe, Yokoo, Pynadath, \BBA MarsellaNair et al.2003b, \BCAYEmeryMontemerlo, Gordon, Schneider, \BBA ThrunEmeryMontemerlo et al.2004, \BCAYEmeryMontemerlo, Gordon, Schneider, \BBA ThrunEmeryMontemerlo et al.2005, \BCAYSzer, Charpillet, \BBA ZilbersteinSzer et al.2005] and is a modification of the (singleagent) tiger problem [\BCAYKaelbling, Littman, \BBA CassandraKaelbling et al.1998]. It concerns two agents that are standing in a hallway with two doors. Behind one of the doors is a tiger, behind the other a treasure. Therefore there are two states: the tiger is behind the left door () or behind the right door (). Both agents have 3 actions at their disposal: open the left door (), open the right door () and listen (). But they cannot observe each other’s actions. In fact, they can only receive 2 observations. Either they hear a sound left () or right ().
At the state is or with probability . As long as no agent opens a door the state doesn’t change, when a door is opened, the state resets to or with probability . The full transition, observation and reward model are listed by \citeANair03_JESP. The observation probabilities are independent, and identical for both agents. For instance, when the state is and both perform action , both agents have a 85% chance of observing , and the probability of both hearing the tiger left is .
When the agents open the door for the treasure they receive a positive reward, while they receive a penalty for opening the wrong door. When opening the wrong door jointly, the penalty is less severe. Opening the correct door jointly leads to a higher reward.
Note that, when the wrong door is opened by one or both agents, they are attacked by the tiger and receive a penalty. However, neither of the agents observe this attack nor the penalty and the episode continues. Arguably, a more natural representation would be to have the episode end after a door is opened or to let the agents observe whether they encountered the tiger or treasure, however this is not considered in this test problem.
2.3 Histories
As mentioned, the goal of planning in a DecPOMDP is to find a (near) optimal tuple of policies, and these policies specify for each agent how to act in a specific situation. Therefore, before we define a policy, we first need to define exactly what these specific situations are. In essence such situations are those parts of the history of the process that the agents can observe.
Let us first consider what the history of the process is. A DecPOMDP with horizon specifies time steps or stages . At each of these stages, there is a state , joint observation and joint action . Therefore, when the agents will have to select their th actions (at ), the history of the process is a sequence of states, joint observations and joint actions, which has the following form:
Here is the initial state, drawn according to the initial state distribution . The initial joint observation is assumed to be the empty joint observation: .
From this history of the process, the states remain unobserved and agent can only observe its own actions and observations. Therefore an agent will have to base its decision regarding which action to select on the sequence of actions and observations observed up to that point.
Definition 2.3.
We define the actionobservation history for agent , , as the sequence of actions taken by and observations received by agent . At a specific time step , this is:
The joint actionobservation history, is the actionobservation history for all agents:
Agent ’s set of possible actionobservation histories at time is . The set of all possible actionobservation histories for agent is .^{3}^{3}3Note that in a particular DecPOMDP, it may be the case that not all of these histories can actually be realized, because of the probabilities specified by the transition and observation model. Finally the set of all possible joint actionobservation histories is given by . At , the actionobservation history is empty, denoted by .
We will also use a notion of history only using the observations of an agent.
Definition 2.4.
Formally, we define the observation history for agent , , as the sequence of observations an agent has received. At a specific time step , this is:
The joint observation history, is the observation history for all agents:
The set of observation histories for agent at time is denoted . Similar to the notation for actionobservation histories, we also use and and the empty observation history is denoted .
Similarly we can define the action history as follows.
Definition 2.5.
The action history for agent , , is the sequence of actions an agent has performed. At a specific time step , we write:
Notation for joint action histories and sets are analogous to those for observation histories. Also write etc. to denote a tuple of observation, actionobservation histories, etc. for all agents except . Finally we note that, clearly, an (joint) actionobservation history consists of an (joint) action and an (joint) observation history: .
2.4 Policies
As discussed in the previous section, the actionobservation history of an agent specifies all the information the agent has when it has to decide upon an action. For the moment we assume that an individual policy for agent is a deterministic mapping from actionobservation sequences to actions.
The number of possible actionobservation histories is usually very large as this set grows exponentially with the horizon of the problem. At time step , there are actionobservation histories for agent . As a consequence there are a total of
of such sequences for agent . Therefore the number of policies for agent becomes:
(2.4) 
which is doubly exponential in the horizon .
2.4.1 Pure and Stochastic Policies
It is possible to reduce the number of policies under consideration by realizing that a lot of policies specify the same behavior. This is illustrated by the left side of Figure 2, which clearly shows that under a deterministic policy only a subset of possible actionobservation histories are reached. Policies that only differ with respect to an actionobservation history that is not reached in the first place, manifest the same behavior. The consequence is that in order to specify a deterministic policy, the observation history suffices: when an agent takes its action deterministically, he will be able to infer what action he took from only the observation history as illustrated by the right side of Figure 2.
Definition 2.6.
A pure or deterministic policy, , for agent in a DecPOMDP is a mapping from observation histories to actions, . The set of pure policies of agent is denoted .
Note that also for pure policies we sometimes write . In this case we mean the action that specifies for the observation history contained in . For instance, let , then . We use to denote a joint policy, a profile specifying a policy for each agent. We say that a pure joint policy is an induced or implicit mapping from joint observation histories to joint actions . That is, the mapping is induced by individual policies that make up the joint policy. Also we use , to denote a profile of policies for all agents but .
Apart from pure policies, it is also possible to have the agents execute randomized policies, i.e., policies that do not always specify the same action for the same situation, but in which there is an element of chance that decides which action is performed. There are two types of randomized policies: mixed policies and stochastic policies.
Definition 2.7.
A mixed policy, , for an agent is a set of pure policies, , along with a probability distribution over this set. Thus a mixed policy is an element of the set of probability distributions over .
Definition 2.8.
A stochastic or behavioral policy, , for agent is a mapping from actionobservation histories to probability distributions over actions, .
When considering stochastic policies, keeping track of only the observations is insufficient, as in general all actionobservation histories can be realized. That is why stochastic policies are defined as a mapping from the full space of actionobservation histories to probability distributions over actions. Note that we use and to denote a policy (space) in general, so also for randomized policies. We will only use , and when there is a need to discriminate between different types of policies.
A common way to represent the temporal structure in a policy is to split it in decision rules that specify the policy for each stage. An individual policy is then represented as a sequence of decision rules . In case of a deterministic policy, the form of the decision rule for stage is a mapping from length observation histories to actions .
2.4.2 Special Cases with Simpler Policies.
There are some special cases of DecPOMDPs in which the policy can be specified in a simpler way. Here we will treat three such cases: in case the state is observable, in the singleagent case and the case that combines the previous two: a single agent in an environment of which it can observe the state.
The last case, a single agent in a fully observable environment, corresponds to the regular MDP setting. Because the agent can observe the state, which is Markovian, the agent does not need to remember any history, but can simply specify the decision rules of its policy as mappings from states to actions: . The complexity of the policy representation reduces even further in the infinitehorizon case, where an optimal policy is known to be stationary. As such, there is only one decision rule , that is used for all stages.
The same is true for multiple agents that can observe the state, i.e., a fully observable DecPOMDP as defined in Section 2.1. This is essentially the same setting as the multiagent Markov decision process (MMDP) introduced by \citeABoutilier96mmdp. In this case, the decision rules for agent ’s policy are mappings from states to actions , although in this case some care needs to be taken to make sure no coordination errors occur when searching for these individual policies.
In a POMDP, a DecPOMDP with a single agent, the agent cannot observe the state, so it is not possible to specify a policy as a mapping from states to actions. However, it turns out that maintaining a probability distribution over states, called belief, , is a Markovian signal:
where the belief is defined as
As a result, a single agent in a partially observable environment can specify its policy as a series of mappings from the set of beliefs to actions .
Unfortunately, in the general case we consider, no such spacesaving simplifications of the policy are possible. Even though the transition and observation model can be used to compute a joint belief, this computation requires knowledge of the joint actions and observations. During execution, the agents simply have no access to this information and thus can not compute a joint belief.
2.4.3 The Quality of Joint Policies
Clearly, policies differ in how much reward they can expect to accumulate, which will serve as a criterion of a joint policy’s quality. Formally, we consider the expected cumulative reward of a joint policy, also referred to as its value.
Definition 2.9.
The value of a joint policy is defined as
(2.5) 
where the expectation is over states, observations and—in the case of a randomized —actions.
In particular we can calculate this expectation as
(2.6) 
where is the probability of as specified by , and where is recursively defined as
(2.7) 
with
(2.8) 
a term that is completely specified by the transition and observation model and the joint policy. For stage we have that .
Because of the recursive nature of it is more intuitive to specify the value recursively:
(2.9) 
with . The value of joint policy is then given by
(2.10) 
For the special case of evaluating a pure joint policy , eq. (2.6) can be written as:
(2.11) 
where
(2.12) 
denotes the expected immediate reward. In this case, the recursive formulation (2.9) reduces to
(2.13) 
Note that, when performing the computation of the value for a joint policy recursively, intermediate results should be cached. A particular pair (or pair for a stochastic joint policy) can be reached from states of the previous stage. The value is the same, however, and should be computed only once.
2.4.4 Existence of an Optimal Pure Joint Policy
Although randomized policies may be useful, we can restrict our attention to pure policies without sacrificing optimality, as shown by the following.
3 Overview of DecPOMDP Solution Methods
In order to provide some background on solving DecPOMDPs, this section gives an overview of some recently proposed methods. We will limit this review to a number of finitehorizon methods for general DecPOMDPs that are related to our own approach.
We will not review the work performed on infinitehorizon DecPOMDPs, such as the work by \citeAPeshkin00,Bernstein05,Szer05BestFirstSearch,Amato06msdm,Amato07uai. In this setting policies are usually represented by finite state controllers (FSCs). Since an infinitehorizon DecPOMDP is undecidable [\BCAYBernstein, Givan, Immerman, \BBA ZilbersteinBernstein et al.2002], this line of work, focuses on finding approximate solutions [\BCAYBernsteinBernstein2005] or (near) optimal policies for given a particular controller size.
There also is a substantial amount of work on methods exploiting particular independence assumptions. In particular, transition and observation independent DecMDPs [\BCAYBecker, Zilberstein, Lesser, \BBA GoldmanBecker et al.2004b, \BCAYWu \BBA DurfeeWu \BBA Durfee2006] and DecPOMDPs [\BCAYKim, Nair, Varakantham, Tambe, \BBA YokooKim et al.2006, \BCAYVarakantham, Marecki, Yabu, Tambe, \BBA YokooVarakantham et al.2007] have received quite some attention. These models assume that each agent has an individual state space and that the actions of one agent do not influence the transitions between the local states of another agent. Although such models are easier to solve, the independence assumptions severely restrict their applicability. Other special cases that have been considered are, for instance, goal oriented DecPOMDPs [\BCAYGoldman \BBA ZilbersteinGoldman \BBA Zilberstein2004], eventdriven DecMDPs [\BCAYBecker, Zilberstein, \BBA LesserBecker et al.2004a], DecMDPs with time and resource constraints [\BCAYBeynier \BBA MouaddibBeynier \BBA Mouaddib2005, \BCAYBeynier \BBA MouaddibBeynier \BBA Mouaddib2006, \BCAYMarecki \BBA TambeMarecki \BBA Tambe2007], DecMDPs with local interactions [\BCAYSpaan \BBA MeloSpaan \BBA Melo2008] and factored DecPOMDPs with additive rewards [\BCAYOliehoek, Spaan, Whiteson, \BBA VlassisOliehoek et al.2008].
A final body of related work which is beyond the scope of this article are models and techniques for explicit communication in DecPOMDP settings [\BCAYOoi \BBA WornellOoi \BBA Wornell1996, \BCAYPynadath \BBA TambePynadath \BBA Tambe2002, \BCAYGoldman \BBA ZilbersteinGoldman \BBA Zilberstein2003, \BCAYNair, Roth, \BBA YohooNair et al.2004, \BCAYBecker, Lesser, \BBA ZilbersteinBecker et al.2005, \BCAYRoth, Simmons, \BBA VelosoRoth et al.2005, \BCAYOliehoek, Spaan, \BBA VlassisOliehoek et al.2007b, \BCAYRoth, Simmons, \BBA VelosoRoth et al.2007, \BCAYGoldman, Allen, \BBA ZilbersteinGoldman et al.2007]. The DecPOMDP model itself can model communication actions as regular actions, in which case the semantics of the communication actions becomes part of the optimization problem [\BCAYXuan, Lesser, \BBA ZilbersteinXuan et al.2001, \BCAYGoldman \BBA ZilbersteinGoldman \BBA Zilberstein2003, \BCAYSpaan, Gordon, \BBA VlassisSpaan et al.2006]. In contrast, most approaches mentioned typically assume that communication happens outside the DecPOMDP model and with predefined semantics. A typical assumption is that at every time step the agents communicate their individual observations before selecting an action. \citeAPynadath02_com_MTDP showed that, under assumptions of instantaneous and costfree communication, sharing individual observations in such a way is optimal.
3.1 Brute Force Policy Evaluation
Because there exists an optimal pure joint policy for a finitehorizon DecPOMDP, it is in theory possible to enumerate all different pure joint policies, evaluate them using equations (2.10) and (2.13) and choose the best one. The number of pure joint policies to be evaluated is:
(3.1) 
where and denote the largest individual action and observation sets. The cost of evaluating each policy is . The resulting total cost of bruteforce policy evaluation is
(3.2) 
which is doubly exponential in the horizon .
3.2 Alternating Maximization
\citeANair03_JESP introduced Joint Equilibrium based Search for Policies (JESP). This method guarantees to find a locally optimal joint policy, more specifically, a Nash equilibrium: a tuple of policies such that for each agent its policy is a best response for the policies employed by the other agents . It relies on a process we refer to as alternating maximization. This is a procedure that computes a policy for an agent that maximizes the joint reward, while keeping the policies of the other agents fixed. Next, another agent is chosen to maximize the joint reward by finding its bestresponse to the fixed policies of the other agents. This process is repeated until the joint policy converges to a Nash equilibrium, which is a local optimum. The main idea of fixing some agents and having others improve their policy was presented before by \citeAChades02, but they used a heuristic approach for memoryless agents. The process of alternating maximization is also referred to as hillclimbing or coordinate ascent.
Nair03_JESP describe two variants of JESP, the first of which, ExhaustiveJESP, implements the above idea in a very straightforward fashion: Starting from a random joint policy, the first agent is chosen. This agent then selects its bestresponse policy by evaluating the joint reward obtained for all of its individual policies when the other agents follow their fixed policy.
The second variant, DPJESP, uses a dynamic programming approach to compute the bestresponse policy for a selected agent . In essence, fixing the policies of all other agents allows for a reformulation of the problem as an augmented POMDP. In this augmented POMDP a state consists of a nominal state and the observation histories of the other agents . Given the fixed deterministic policies of other agents , such an augmented state is a Markovian state, and all transition and observation probabilities can easily be derived from .
Like most methods proposed for DecPOMDPs, JESP exploits the knowledge of the initial belief by only considering reachable beliefs in the solution of the POMDP. However, in some cases the initial belief might not be available. As demonstrated by \citeAVarakantham06AAMAS, JESP can be extended to plan for the entire space of initial beliefs, overcoming this problem.
3.3
\citeASzer05MAA introduced a heuristically guided policy search method called multiagent A* (). It performs a guided A*like search over partially specified joint policies, pruning joint policies that are guaranteed to be worse than the best (fully specified) joint policy found so far by an admissible heuristic.
In particular considers joint policies that are partially specified with respect to time: a partial joint policy specifies the joint decision rules for the first stages. For such a partial joint policy a heuristic value is calculated by taking , the actual expected reward achieves over the first stages, and adding , a heuristic value for the remaining stages. Clearly when is an admissible heuristic—a guaranteed overestimation—so is .
starts by placing the completely unspecified joint policy in an open list. Then, it proceeds by selecting partial joint policies from the list and ‘expanding’ them: generating all by appending all possible joint decision rules for next time step (. The left side of Figure (3) illustrates the expansion process. After expansion, all created children are heuristically valuated and placed in the open list, any partial joint policies with less than the expected value of some earlier found (fully specified) joint policy , can be pruned. The search ends when the list becomes empty, at which point we have found an optimal fully specified joint policy.
3.4 Dynamic Programming for DecPOMDPs
incrementally builds policies from the first stage to the last . Prior to this work, \citeAHansen04 introduced dynamic programming (DP) for DecPOMDPs, which constructs policies the other way around: starting with a set of ‘1step policies’ (actions) that can be executed at the last stage, they construct a set of 2step policies to be executed at , etc.
It should be stressed that the policies maintained are quite different from those used by . In particular a partial policy in has the form . The policies maintained by DP do not have such a correspondence to decision rules. We define the timetogo at stage as
(3.3) 
Now denotes a stepstogo subtree policy for agent . That is, is a policy tree that has the same form as a full policy for the horizon problem. Within the original horizon problem is a candidate for execution starting at stage . The set of stepstogo subtree policies maintained for agent is denoted Dynamic programming for DecPOMDPs is based on backup operations: constructing a set of subtree policies from a set . For instance, the right side of Figure 3 shows how , a 3stepstogo subtree policy, is constructed from two . Also illustrated is the difference between this process and expansion (on the left side).
Dynamic programming consecutively constructs for all agents . However, the size of the set is given by
and as a result the sizes of the maintained sets grow doubly exponential with . To counter this source of intractability, \citeAHansen04 propose to eliminate dominated subtree policies. The expected reward of a particular subtree policy depends on the probability over states when is started (at stage ) as well as the probability with which the other agents select their subtree policies . If we let denote a subtree profile for all agents but , and the set of such profiles, we can say that is dominated if it is not maximizing at any point in the multiagent belief space: the simplex over . \citeauthorHansen04 test for dominance over the entire multiagent belief space by linear programming. Removal of a dominated subtree policy of an agent may cause a subtree policy of an other agent to become dominated. Therefore \citeauthorHansen04 propose to iterate over agents until no further pruning is possible, a procedure known as iterated elimination of dominated policies [\BCAYOsborne \BBA RubinsteinOsborne \BBA Rubinstein1994].
Finally, when the last backup step is completed the optimal policy can be found by evaluating all joint policies for the initial belief .
3.5 Extensions on DP for DecPOMDPs
In the last few years several extensions to the dynamic programming algorithm for DecPOMDPs have been proposed. The first of these extensions is due to \citeASzer06_PBDP. Rather than testing for dominance over the entire multiagent belief space, \citeauthorSzer06_PBDP propose to perform pointbased dynamic programming (PBDP). In order to prune the set of subtree policies , the set of all the belief points that can possibly be reached by deterministic joint policies are generated. Only the subtree policies that maximize the value at some are kept. The proposed algorithm is optimal, but intractable because it needs to generate all the multiagent belief points that are reachable through all joint policies. To overcome this bottleneck, \citeauthorSzer06_PBDP propose to randomly sample one or more joint policies and use those to generate .
SeukenZ07IJCAI also proposed a pointbased extension of the DP algorithm, called memorybounded dynamic programming (MBDP). Rather than using a randomly selected policy to generate the belief points, they propose to use heuristic policies. A more important difference, however, lies in the pruning step. Rather than pruning dominated subtree policies , MBDP prunes all subtree policies except a few in each iteration. More specifically, for each agent maxTrees subtree policies are retained, which is a parameter of the planning method. As a result, MBDP has only linear space and time complexity with respect to the horizon. The MBDP algorithm still depends on the exhaustive generation of the sets which now contain subtree policies. Moreover, in each iteration all joint subtree policies have to be evaluated for each of the sampled belief points. To counter this growth, \citeASeuken07IMBDP proposed an extension that limits the considered observations during the backup step to the maxObs most likely observations.
Finally, a further extension of the DP for DecPOMDPs algorithm is given by \citeAAmato07msdm. Their approach, bounded DP (BDP), establishes a bound not on the used memory, but on the quality of approximation. In particular, BDP uses pruning in each iteration. That is, a that is maximizing in some region of the multiagent belief space, but improves the value in this region by at most , is also pruned. Because iterated elimination using  pruning can still lead to an unbounded reduction in value, \citeauthorAmato07msdm propose to perform one iteration of pruning, followed by iterated elimination using normal pruning.
3.6 Other Approaches for FiniteHorizon DecPOMDPs
There are a few other approaches for finitehorizon DecPOMDPs, which we will only briefly describe here. \citeAAras07icaps proposed a mixed integer linear programming formulation for the optimal solution of finitehorizon DecPOMDPs. Their approach is based on representing the set of possible policies for each agent in sequence form [\BCAYRomanovskiiRomanovskii1962, \BCAYKoller, Megiddo, \BBA von StengelKoller et al.1994, \BCAYKoller \BBA PfefferKoller \BBA Pfeffer1997]. In sequence form, a single policy for an agent is represented as a subset of the set of ‘sequences’ (roughly corresponding to actionobservation histories) for the agent. As such the problem can be interpreted as a combinatorial optimization problem, which \citeauthorAras07icaps propose to solve with a mixed integer linear program.
Oliehoek07idc also recognize that finding a solution for DecPOMDPs in essence is a combinatorial optimization problem and propose to apply the CrossEntropy method [\BCAYde Boer, Kroese, Mannor, \BBA Rubinsteinde Boer et al.2005], a method for combinatorial optimization that recently has become popular because of its ability to find nearoptimal solutions in large optimization problems. The resulting algorithm performs a samplingbased policy search for approximately solving DecPOMDPs. It operates by sampling pure policies from an appropriately parameterized stochastic policy, and then evaluates these policies either exactly or approximately in order to define the next stochastic policy to sample from, and so on until convergence.
Finally, \citeAEmeryMontemerlo04,EmeryMontemerlo05 proposed to approximate DecPOMDPs through series of Bayesian games. Since our work in this article is based on the same representation, we defer a detailed explanation to the next section. We do mention here that while \citeauthorEmeryMontemerlo04 assume that the algorithm is run online (interleaving planning and execution), no such assumption is necessary. Rather we will apply the same framework during a offline planning phase, just like the other algorithms covered in this overview.
4 Optimal Qvalue Functions
In this section we will show how a DecPOMDP can be modeled as a series of Bayesian games (BGs). A BG is a gametheoretic model that can deal with uncertainty [\BCAYOsborne \BBA RubinsteinOsborne \BBA Rubinstein1994]. Bayesian games are similar to the more wellknown normal form, or matrix games, but allow to model agents that have some private information. This section will introduce Bayesian games and show how a DecPOMDP can be modeled as a series of Bayesian games (BGs). This idea of using a series of BGs to find policies for a DecPOMDP has been proposed in an approximate setting by \citeAEmeryMontemerlo04. In particular, they showed that using series of BGs and an approximate payoff function, they were able to obtain approximate solutions on the DecTiger problem, comparable to results for JESP (see Section 3.2).
The main result of this section is that an optimal DecPOMDP policy can be computed from the solution of a sequence of Bayesian games, if the payoff function of those games coincides with the Qvalue function of an optimal policy , i.e., with the optimal Qvalue function . Thus, we extend the results of \citeAEmeryMontemerlo04 to include the optimal setting. Also, we conjecture that this form of can not be computed without already knowing an optimal policy . By transferring the gametheoretic concept of sequential rationality to DecPOMDPs, we find a description of that is computable without knowing up front.
4.1 GameTheoretic Background
Before we can explain how DecPOMDPs can be modeled using Bayesian games, we will first introduce them together with some other necessary game theoretic background.
4.1.1 Strategic Form Games and Nash Equilibria
At the basis of the concept of a Bayesian game lies a simpler form of game: the strategic or normal form game. A strategic game consists of a set of agents or players, each of which has a set of actions (or strategies). The combination of selected actions specifies a particular outcome. When a strategic game consists of two agents, it can be visualized as a matrix as shown in Figure 4. The first game shown is called ‘Chicken’ and involves two teenagers who are driving head on. Both have the option to drive on or chicken out. Each teenager’s payoff is maximal () when he drives on and his opponent chickens out. However, if both drive on, a collision follows giving both a payoff of . The second game is the meeting location problem. Both agents want to meet in location A or B. They have no preference over which location, as long as both pick the same location. This game is fully cooperative, which is modeled by the fact that the agents receive identical payoffs.
D  C  

D  
C 
A  B  

A  
B 
Definition 4.1.
Formally, a strategic game is a tuple , where is the number of agents, is the set of joint actions, and with is the payoff function of agent .
Game theory tries to specify for each agent how to play. That is, a gametheoretic solution should suggest a policy for each agent. In a strategic game we write to denote a policy for agent and for a joint policy. A policy for agent is simply one of its actions (i.e., a pure policy), or a probability distribution over its actions (i.e., a mixed policy). Also, the policy suggested to each agent should be rational given the policies suggested to the other agent; it would be undesirable to suggest a particular policy to an agent, if it can get a better payoff by switching to another policy. Rather, the suggested policies should form an equilibrium, meaning that it is not profitable for an agent to unilaterally deviate from its suggested policy. This notion is formalized by the concept of Nash equilibrium.
Definition 4.2.
A pure policy profile specifying a pure policy for each agent is a Nash Equilibrium (NE) if and only if
(4.1) 
This definition can be easily extended to incorporate mixed policies by defining
Nash50 proved that when allowing mixed policies, every (finite) strategic game contains at least one NE, making it a proper solution for a game. However, it is unclear how such a NE should be found. In particular, there may be multiple NEs in a game, making it unclear which one to select. In order to make some discrimination between Nash equilibria, we can consider NEs such that there is no other NE that is better for everyone.
Definition 4.3.
A Nash Equilibrium is referred to as Pareto Optimal (PO) when there is no other NE that specifies at least the same payoff for all agents and a higher payoff for at least one agent:
In the case when multiple Pareto optimal Nash equilibria exist, the agents can agree beforehand on a particular ordering, to ensure the same NE is chosen.
4.1.2 Bayesian Games
A Bayesian game [\BCAYOsborne \BBA RubinsteinOsborne \BBA Rubinstein1994] is an augmented normal form game in which the players hold some private information. This private information defines the type of the agent, i.e., a particular type of an agent corresponds to that agent knowing some particular information. The payoff the agents receive now no longer only depends on their actions, but also on their private information. Formally, a BG is defined as follows:
Definition 4.4.
A Bayesian game (BG) is a tuple , where is the number of agents, is the set of joint actions, is the set of joint types over which a probability function is specified, and is the payoff function of agent .
In a normal form game the agents select an action. Now, in a BG the agents can condition their action on their private information. This means that in BGs the agents use a different type of policies. For a BG, we denote a joint policy , where the individual policies are mappings from types to actions: . In the case of identical payoffs for the agents, the solution of a BG is given by the following theorem:
Theorem 4.1.
For a BG with identical payoffs, i.e., , the solution is given by:
(4.2) 
where is the joint action specified by for joint type . This solution constitutes a Pareto optimal Nash equilibrium.
Proof.
The proof consists of two parts: the first shows that is a Nash equilibrium, the second shows it is Pareto optimal.
Nash equilibrium proof.
It is clear that satisfying 4.2 is a Nash equilibrium by rewriting from the perspective of an arbitrary agent as follows:
which means that is a best response for . Since no special assumptions were made on , it follows that is a Nash equilibrium.
Pareto optimality proof.
Let us write for the payoff agent expects for when performing when the other agents use policy profile . We have that
Now, a joint policy satisfying (4.2) is not Pareto optimal if and only if there is another Nash equilibrium that attains at least the same payoff for all agents and for all types and strictly more for at least one agent and type. Formally is not Pareto optimal when such that:
(4.3) 
We prove that no such can exist by contradiction. Suppose that is a NE such that (4.3) holds (and thus is not Pareto optimal). Because satisfies (4.2) we know that:
(4.4) 
and therefore, for all agents
holds. However, by assumption that satisfies (4.3) we get that
Therefore it must be that
and thus that