Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams
Abstract
Planning for ad hoc teamwork is challenging because it involves agents collaborating without any prior coordination or communication. The focus is on principled methods for a single agent to cooperate with others. This motivates investigating the ad hoc teamwork problem in the context of individual decision making frameworks. However, individual decision making in multiagent settings faces the task of having to reason about other agents’ actions, which in turn involves reasoning about others. An established approximation that operationalizes this approach is to bound the infinite nesting from below by introducing level 0 models. We show that a consequence of the finitelynested modeling is that we may not obtain optimal team solutions in cooperative settings. We address this limitation by including models at level 0 whose solutions involve learning. We demonstrate that the learning integrated into planning in the context of interactive dynamic influence diagrams facilitates optimal team behavior, and is applicable to ad hoc teamwork.
Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams
Muthukumaran Chandrasekaran 
THINC Lab 
University of Georgia 
Athens, GA, USA 30602 
mkran@uga.edu 
Prashant Doshi 
THINC Lab 
University of Georgia 
Athens, GA, USA 30602 
pdoshi@cs.uga.edu 
Yifeng Zeng 
School of Computing 
Teesside University 
Middlesbrough, Tees Valley, UK TS13BA 
y.zeng@tees.ac.uk and 
Yingke Chen 
Dept. of Computer Science 
Queen’s University Belfast 
Belfast, Northern Ireland, UK 
y.chen@qub.ac.uk 
\@float
copyrightbox[b]
\end@floatCategories and Subject Descriptors I.2.11 [Distributed Artificial Intelligence]: Intelligent agents, Multiagent systems

Algorithms, Experimentation

multiagent systems, ad hoc teamwork, sequential decision making and planning, reinforcement learning
Ad hoc teamwork involves a team of agents coming together to cooperate without any prior coordination or communication protocols [?]. The preclusion of prior commonality makes planning in ad hoc settings challenging. For example, wellknown cooperative planning frameworks such as the Communicative multiagent team decision problem [?] and the decentralized partially observable Markov decision process (DECPOMDP) [?] utilize centralized planning and distribution of local policies among agents, which are assumed to have common initial beliefs. These assumptions make the frameworks unsuitable for ad hoc teamwork.
A focus on how an agent should behave online as an ad hoc teammate informs previous approaches toward planning. This includes an algorithm for online planning in ad hoc teams (OPAT) [?] that solves a series of stage games assuming that other agents are optimal with the utility at each stage computed using Monte Carlo tree search. Albrecht and Ramamoorthy [?] model the uncertainty about other agents’ types and construct a HarsanyiBayesian ad hoc game that is solved online using learning. While these approaches take important steps, they assume that the physical states and actions of others are perfectly observable, which often may not apply.
The focus on individual agents’ behaviors in ad hoc teams motivates that we situate the problem in the context of individual decisionmaking frameworks. In this regard, recognized frameworks include the interactive POMDP (IPOMDP) [?], its graphical counterpart, interactive dynamic influence diagram (IDID) [?], and IPOMDP Lite [?]. These frameworks allow considerations of partial observability of the state and uncertainty over the types of other agents with minimal prior assumptions, at the expense of increased computational complexity. Indeed, Albrecht and Ramamoorthy [?] note the suitability of these frameworks to the problem of ad hoc teamwork but find the complexity challenging.
While recent advances on model equivalence [?] allow frameworks such as IDIDs to scale, another significant challenge that merits attention is due to the finitelynested modeling used in these frameworks, which assumes the presence of level 0 models that do not explicitly reason about others [?, ?, ?, ?]. A consequence of this approximation is that we may not obtain optimal solutions in cooperative settings. To address this, we augment the IDID framework by additionally attributing a new type of level 0 model. This type distinguishes itself by utilizing reinforcement learning (RL) either online or in simulation to discover possible collaborative policies that a level 0 agent may execute.
The contributions of this paper are twofold: First, we show the emergence of true team behavior when the reasoning ability of lower level agents is enhanced via learning. We demonstrate globally optimal teammate solutions when agents are modeled in finitelynested augmented IDIDs (Aug. IDIDs) while traditional IDIDs fail. Second, we demonstrate the applicability of Aug. IDIDs to ad hoc settings and show its effectiveness for varying types of teammates. For this, we utilize the ad hoc setting of Wu et al. [?], and experiment with multiple wellknown cooperative domains. We also perform a baseline comparison of Aug. IDIDs with an implementation of a generalized version of OPAT that accounts for the partial observability.
We sketch IDIDs below and refer readers to [?] for more details.
A traditional DID models sequential decision making for a single agent by linking a set of chance, decision and utility nodes over multiple time steps. To consider multiagent interactions, IDIDs introduce a new type of node called the model node (hexagonal node, , in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams()) that represents how another agent acts as the subject agent reasons about its own decisions at level . The model node contains a set of ’s candidate models at level ascribed by . A link from the chance node, , to the model node, , represents agent ’s beliefs over ’s models. Specifically, it is a probability distribution in the conditional probability table (CPT) of the chance node, (in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams()). An individual model of , , where is the level belief, and is the agent’s frame encompassing the decision, observation and utility nodes. Each model, , could be either a level IDID or a DID at level 0. Solutions to the model are the predicted behavior of and are encoded into the chance node, , through a dashed link, called a policy link. Connecting with other nodes in the IDID structures how agent ’s actions influence ’s decisionmaking process.
Expansion of an IDID involves the update of the model node over time as indicated by the model update link  a dotted link from to in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams(). As agent acts and receives observations over time, its models should be updated. For each model at time , its optimal solutions may include all actions and agent may receive any of the possible observations. Consequently, the set of the updated models at contains up to models. Here, is the number of models at time , and and the largest spaces of actions and observations respectively among all the models. The CPT of specifies the function, which is 1 if the belief in the model using the action and observation updates to in a model ; otherwise, it is 0. We may implement the model update link using standard dependency links and chance nodes, as shown in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams(), and transform an IDID into a traditional DID.
A level IDID of agent expanded over time steps is solved in a bottomup manner. To solve agent ’s level IDID, all lower level models of agent must be solved. Solution to a level model, , is ’s policy that is a mapping from ’s observations in to the optimal decision in given its belief, . Subsequently, we may enter ’s optimal decisions into the chance node, , at each time step and expand ’s models in corresponding to each pair of ’s optimal action and observation. We perform this process for each of level models of at each time step, and obtain the fully expanded level model. We outline the algorithm for exactly solving IDIDs in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams.
The computational complexity of solving IDIDs is mainly due to the exponential growth of lower 1 ’s models over time. Although the space of possible models is very large, not all models need to be considered in the model node. Models that are behaviorally equivalent (BE) [?] – whose behavioral predictions for the other agent are identical – could be pruned and a single representative model can be considered. This is because the solution of the subject agent’s IDID is affected by the behavior of the other agent only; thus we need not distinguish between BE models. Let PruneBehavioralEq () be the procedure that prunes BE models from returning the representative models (line 6).
Note that lines 45 (in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams) solve level 1 IDIDs or DIDs and then supply the policies to level IDID. Due to the bounded rationality of level 1 agents, the solutions lead to a suboptimal policy of agent , which certainly compromises agent ’s performance in the interactions particularly in a team setting. Also, note that the level 0 models are DIDs that do not involve learning. We will show in the coming sections that solving IDIDs integrated with RL may generate the expected team behavior among agents and .
Ad hoc teamwork involves multiple agents working collaboratively in order to optimize the team reward. Each ad hoc agent in the team behaves according to a policy, which maps the agent’s observation history or beliefs to the action(s) it should perform. We begin by showing that the finitelynested hierarchy in IDID does not facilitate ad hoc teamwork. However, augmenting the traditional model space with models whose solution is obtained via reinforcement learning provides a way for team behavior to emerge.
Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams shows an ad hoc team setting of a twoagent grid meeting problem [?]. The agents can detect the presence of a wall on its right (), left () or the absence of it on both sides (). Given a specific observation, the agent may choose to either move in one of four directions – south (), north (), east (), or west (), or stay in the same cell (). Each ad hoc agent, or , moves in the grid and collects rewards as the number indicated in the occupied cell. If they move to different cells, the agents get their own individual reward. However, if they move to the same cell allowing them to hold an ad hoc meeting, they will be rewarded with twice the sum of their individual rewards. Initial positions of the two agents are shown in color and we focus on their immediate actions.
If each agent deliberates at its own level, agent modeled at level 0 will choose to move left while a level 0 agent chooses to move down. Each agent will obtain a reward of 15 while the whole team gets 30. Agent modeled at level 1 and modeling at level 0 thinks that will move down, and its own best response to predicted ’s behavior is to move left. Analogously, a level 1 agent would choose to move down. A level 2 agent will predict that a level 1 moves down as mentioned previously, due to which it decides to move left. Analogously, a level 2 agent continues to decide to move down. We may apply this reasoning inductively to conclude that level agents and would move left and down, respectively, thereby earning a joint reward of 30. However, the optimal team behavior in this setting is for to move right and to move up thereby obtaining a team reward of 40.
Clearly, these finite hierarchical systems preclude the agents’ optimal teamwork due to the bounded reasoning of the lowest level (level 0) agents. The following observation states this more formally:
Observation
There exist cooperative multiagent settings in which rational intentional agents each modeled using the finitelynested IDID (or IPOMDP) may not choose the jointly optimal behavior of working together as a team.
Notice that an offline specification of level 0 models in cooperative settings is necessarily incomplete. This is because the true benefit of cooperative actions often hinges on others performing supporting actions, which by themselves may not be highly rewarding to the agent. Thus, despite solving the level 0 models optimally, the agent may not engage in optimal team behavior.
In general, this observation holds for cooperative settings where the selfmaximizing level 0 models result in predictions that are not consistent with team behavior. Of course, settings may exist where the level 0 model’s solution coincides with the policy of a teammate thereby leading to joint teamwork. Nevertheless, the significance of this observation is that we cannot rely on finitelynested IDIDs to generate optimal teammate policies.
We observe that team behavior is challenging in the context we study above because of the bounded rationality imposed by assuming the existence of a level 0. The boundedness precludes modeling others at the same level as one’s own – as an equal teammate. However, at the same time, this imposition is, motivated by reasons of computability, which allow us to operationalize such a paradigm; and allows us to avoid some selfcontradicting, and therefore impossible beliefs, which exist when infinite belief hierarchies are considered [?]. Consequently, this work is of significance because it may provide us a way of generating optimal team behavior in finitelynested frameworks, which so far have been utilized for noncooperative settings, and provides a principled way to solving ad hoc teamwork problems.
We present a principled way to induce team behavior by enhancing the reasoning ability of lowerlevel agents. While it is difficult to a priori discern the benefit of moving up for agent in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams, it could be experienced by the agent. Specifically, it may explore moving in different directions including moving up and learn about its benefit from the ensuing, possibly indirect, team reward.
Subsequently, we may expect an agent to learn policies that are consistent with optimal teammate behavior because the corresponding actions provide large reinforcements. For example, given that agent moves right in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams, may choose to move up in its exploration, and thereby receive a large reinforcing reward. This observation motivates formulating level 0 models that utilize RL to generate the predicted policy for the modeled agent. Essentially, we expect that RL with its explorations would compensate for the lack of teamwork caused by bounded reasoning in finitelynested IDIDs.
Because the level 0 models generate policies for the modeled agent only, we focus on the modeled agent’s learning problem. However, the rewards in the multiagent setting usually depend on actions of all agents due to which the other agent must be simulated as well. The other agent’s actions are a part of the environment and its presence hidden at level 0, thereby making the problem one of singleagent learning as opposed to one of multiagent learning.
We augment the level 0 model space, denoted as , by additionally attributing a new type of level 0 model to the other agent : , where is ’s belief and is the frame of the learning model. The frame, , consists of the learning rate, ; a seed policy, , of planning horizon, , which includes a fair amount of exploration; and the chance and utility nodes of the DID along with a candidate policy of agent , which could be an arbitrary policy from ’s policy space, , as agent ’s actual behavior is not known. This permits a proper simulation of the environment.
This type of model, , differs from a traditional DID based level 0 model in the aspect that does not describe the offline planning process of how agent optimizes its decisions, but allows to learn an optimal policy, , with the learning rate, either online or in a simulated setting. Different models of agent differ not only in their learning rates and seed policies, but also in the ’s candidate policy that is used. In principle, while the learning rate and seed policies may be held fixed, ’s model space could be as large as ’s policy space. Consequently, our augmented model space becomes extremely large.
Learning has been applied to solve decisionmaking problems in both single and multiagent settings. Both model based [?] and model free [?, ?] learning approaches exist for solving POMDPs. In the multiagent context, Banerjee et al. [?] utilized a semimodel based distributed RL for finite horizon DECPOMDPs. Recently, Ng et al. [?] incorporated model learning in the context of IPOMDPs where the subject agent learns the transition and observation probabilities by augmenting the interactive states with frequency counts.
Because the setting in which the learning takes place is partially observable, RL approaches that compute a table of values for stateaction pairs do not apply. We adapt Perkin’s Monte Carlo Exploring Starts for POMDPs (MCESP) [?], which has been shown to learn good policies in fewer iterations while making no prior assumptions about the agent’s models in order to achieve convergence. MCESP maintains a table indexed by observation, , and action, , that gives the value of following a policy, , except when observation, , is obtained at which point action, , is performed. An agent’s policy in MCESP maps a single observation to actions over time horizons. We generalize MCESP so that observation histories of length up to , denoted as , are mapped to actions. A table entry, , is updated over every simulated trajectory of agent , , , , , , , , , , , , where is the team reward received. Specifically, the value is updated as:
(1) where is the learning rate and is the sum of rewards of a portion of the observationaction sequence, , following the first occurrence of in , say at : =, where is the discount factor. Alternate policies are considered by perturbing the action for randomly selected observation histories.
Level 0 agent learns its policy while agent ’s actions are hidden in the environment. In other words, agent needs to reason with unknown behavior of while it learns level 0 policy using the generalized MCESP algorithm. Agent considers the entire policy space of agent , , and a fixed policy of , (), results in one learned ’s policy, .
We show the algorithm for solving level 0 models using the generalized MCESP in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams. The algorithm takes as input agent ’s model whose solution is needed and the policy of , which becomes a part of the environment. We repeatedly obtain a trajectory, , of length either by running the agent online or simulating the environment by sampling the states, actions and observations from the appropriate CPTs (lines 510). The trajectory is used in evaluating the value of the current policy, , of agent (line 11). Initially, we utilize the seed policy contained in agent ’s model. If another action, , for the observation sequence, , is optimal, we update to conditionally use this action, otherwise the policy remains unchanged (lines 1213). This is followed by generating a perturbed policy in the neighborhood of the previous one (line 14), and the evaluation cycle is repeated. If the perturbation is discarded several times, we may terminate the iterations and return the current policy.
Level 0 agent learns its policy while agent ’s actions are a part of the environment. As we mentioned previously, agent ’s level 0 model space is inclusive of the ’s policy space, . As the space of ’s policy becomes large particularly for a large planning horizon, it is intractable for to learn a policy for all ’s policies. In addition, considering that few of ’s policies are actually collaborative, we formulate a principled way to reduce the full space to those policies of , denoted as , that could be collaborative.
We begin by picking a random initial policy of and using it in the frame of a new model of . We apply generalized MCESP to this frame to obtain a candidate agent ’s policy, . Next, both the initial policy of used by MCESP and ’s policy is set to . MCESP then checks for the neighbors of , which would improve on the joint utility of (, ). If successful, an improved neighboring policy, say , is returned. This ensures that the joint utility of (,) is greater than (,). We continue these iterations setting as and using as the seed policy. MCESP may not improve on if is the (local) best response to . Otherwise, both and are added to the set of candidate predictions of level 0 behavior of . The process is restarted with a different random policy of agent . We demonstrate this method on the 33 Grid domain in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams.
Solving augmented IDIDs is similar to solving the traditional IDIDs except that the candidate models of the agent at level 0 may be learning models. We show the revised algorithm in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams. When the level 0 model is a learning model, the algorithm invokes the method Level 0 RL shown in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams. Otherwise, it follows the same procedure as shown in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams to recursively solve the lower level models.
While we consider a reduced space of agent ’s policies in a principled way, and therefore agent ’s learning models, we may further reduce agent ’s policy space by keeping top policies of , 0, in terms of their expected utilities (line 11 in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams). Observe that across models that differ in ’s policy and with the same initial belief, the team behavior(s) is guaranteed to generate the largest utility in a cooperative problem. This motivates focusing on models with higher utilities. Hence, the filtering of ’s policy space may not compromise the quality of IDID solutions at level 1. However, because MCESP may converge to a local optima, the resulting top policies are not guaranteed to include ’s optimal collaborative policies in theory, although as our experiments demonstrate, we often obtain the optimal team behavior. As the number of optimal policies is unknown, we normally use a sufficiently large value.
Agent ’s policy space will be additionally reduced because behaviorally equivalent models – learning and other models with identical solutions – will be clustered [?]. In summary, we take several steps to limit the impact of the increase in ’s model space. Using a subset of ’s policies preempts solving all ’s models at level 0 while the top technique removes possibly noncollaborative policies.
Our experiments show that IDIDs augmented with level 0 models that learn facilitate team behavior, which was previously implausible. In addition, we show the applicability of Aug. IDIDs to ad hoc teamwork in a setting similar to the one used by Wu et al. ([?]). We empirically evaluate the performance in three wellknown cooperative domains involving two agents, and : grid meeting (Grid) [?], boxpushing (BP) [?], and multiaccess broadcast channel (MABC) [?]. In the BP domain, each agent intends to push either a small or large box into the goal region. The agents’ rewards are maximum when both of them cooperatively push the large box into the goal. In the MABC problem, nodes need to broadcast messages to each other over a channel. Only one node may broadcast at a time, otherwise a collision occurs. The nodes attempt to maximize the throughput of the channel.
Domain T Dimension Grid 3 100 32 =9, =81, =3, =5 4 200 100 BP 3 100 32 =50, =5, =4 MABC 3 100 32 =2, =4, =2, =2 4 100 64 5 200 64 Table \thetable: Domain Dimension and Experimental Settings We summarize the domain properties and parameter settings of the Aug. IDID in Table Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams. Note that is the number of initial models of agent at level 0 and is the subset of ’s policies generated using the approach mentioned earlier, allowing us to reduce the full space of ’s policies to those that are possibly collaborative.
Experimental Settings. We implemented the algorithm augmented IDID as shown in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams including an implementation of the generalized MCESP for performing level 0 RL.
We demonstrate the performance of the augmented framework toward generating team behavior. We compare the expected utility of agent ’s policies with the values of the optimal team policies obtained using an exact algorithm, GMAA*ICE, for DecPOMDP formulations of the same problem domains [?]. We also compare with the values obtained by traditional IDIDs. All IDIDs are solved using the exact discriminative model update (DMU) method [?]. For both traditional and Aug. IDIDs, we utilized models at level 0 that differ in the initial beliefs or in the frame. We adopt two model weighting schemes: () Uniform: all policies are uniformly weighted; () Diverse: let policies with larger expected utility be assigned proportionally larger weights. Note that we maintain the top by expected utility (out of ) learning and nonlearning models only while solving Aug. IDIDs. Though the model space is significantly enlarged by the learning policies, Aug. IDIDs become tractable when both top and equivalence techniques are applied.
Aug. IDID Trad IDID Domain K Uniform Diverse Uniform Grid 32 41.875 41.93 (T=3) 16 40.95 41.93 25.70 8 40.95 41.93 DecPOMDP(GMAA*ICE): 43.86 Grid 100 37.15 53.26 (T=4) 64 35.33 53.26 21.55 32 35.33 53.26 DecPOMDP(GMAA*ICE): 58.75 BP 32 73.45 76.51 (T=3) 16 73.45 76.51 4.75 8 71.36 76.51 DecPOMDP(GMAA*ICE): 85.18 MABC 32 2.12 2.30 (T=3) 16 2.12 2.30 1.79 8 2.12 2.30 DecPOMDP(GMAA*ICE): 2.99 MABC 64 3.13 3.17 (T=4) 32 3.13 3.17 2.80 16 3.13 3.17 DecPOMDP(GMAA*ICE): 3.89 MABC 64 4.08 4.16 (T=5) 32 3.99 4.16 3.29 16 3.99 4.16 DecPOMDP(GMAA*ICE): 4.79 Table \thetable: Performance comparison between the trad. IDID, aug. IDID, and GMAA*ICE in terms of the expected utility Performance Evaluation. In Table Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams, we observe that the Aug. IDID significantly outperforms the traditional IDID where level 0 agent does not learn. Aug. IDID’s solutions approach the globally optimal team behavior as generated by GMAA*ICE. In cooperative games, the globally optimal solution is the pareto optimal Nash equilibrium. We observe that the larger weights on the learned policies lead to better quality ’s policies. This restates the importance of the augmented level 0 ’s models that learn. The small gap from the optimal DECPOMDP value is due to the uncertainty over different models of . Note that DECPOMDPs are informed about the initial belief setting (and do not face the issue of bounded rationality) whereas, IDIDs are not and they consider the entire candidate model space of . Furthermore, the Aug. IDID generates the optimal team behavior identical to that provided by GMAA*ICE when ’s belief places probability 1 on the true model of , as is the case for DecPOMDPs. Increasing does not have a significant impact on the performance as is large enough to cover a large fraction of collaborative policies of agent including the optimal teammate.
In Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams, we illustrate the reduction of model space that occurs due to smaller values of , which facilitates efficiency in the solution of the Aug. IDID. Though the augmented level 0 model space is much larger than that of its traditional counterparts, the growth in the number of models is limited due to the top heuristic.
Experimental Settings. We test the performance of the augmented IDIDs in ad hoc applications involving different teammate types particularly when the teammates’ policies are not effective in advancing the joint goal (i.e. ad hoc) and compare it with a wellknown ad hoc planner, OPAT. Teammate types include: () Random  when the teammate plays according to a randomly generated action sequence for the entire length of the run. Some predefined random seeds are used to guarantee that each test has the same action sequences. () Predefined  when the teammate plays according to some predefined patterns which are sequences of random actions with some fixed repeated lengths that are randomly chosen at the beginning of each run. For example, if the action pattern is“1324” and the repetition value is 2, the resulting action sequence will be “11332244”. () Optimal  when the teammate plays rationally and adaptively. OPAT uses an optimal teammate policy for simulations, which is computed offline with the help of a generative model by value iteration. Note that OPAT in its original form assumes perfect observability of the state and joint actions. For comparison, we generalized OPAT to partially observable settings by considering observation sequences.
Additionally, in order to speed up the generation of RL models at level 0, we implemented an approximate version of our generalized MCESP called the Sample Average Approximation (MCESPSAA) that estimates action values by taking a fixed number of sample trajectories and then comparing the sample averages [?]. We used a sample size of =25 trajectories to compute the approximate value of the policy that generated them, for MCESPSAA. We set =0.9, and terminate the RL (line 15 in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams) if no policy changes are recommended after taking samples of the value of each observation sequenceaction pair [?]. We also tested with some domainspecific seed policies to investigate speedup in the convergence of MCESP.
Simulations were run for 20 steps and the average of the cumulative rewards over 10 trials are reported for similar teammate settings for the 3 problems. We show that the augmented IDID solution significantly outperforms OPAT solutions in all problem domains for random and predefined teammates while performing comparably for optimal ones.
Ad Hoc Teammate OPAT Aug. IDID Grid T=20, lookahead=3 Random 12.25 1.26 14.2 0.84 Predefined 11.7 1.63 16.85 1.35 Optimal 28.35 2.4 27.96 1.92 BP T=20, lookahead=3 Random 29.26 2.17 36.15 1.95 Predefined 41.1 1.55 54.43 3.38 Optimal 52.11 0.48 59.2 1.55 MABC T=20, lookahead=3 Random 9.68 1.37 12.13 1.08 Predefined 12.8 0.65 13.22 0.21 Optimal 16.64 0.28 15.97 1.31 Table \thetable: Baseline Comparison with OPAT with different types of teammates. Each datapoint is the average of 10 runs. Performance Evaluation. Table Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams shows that IDIDs significantly outperform OPAT especially when the other agents are random or predefined types in all three problem domains (Student’s ttest, value 0.001 for both) except when the teammate is of type predefined in the MABC problem where the improvement over OPAT was not significant at the 0.05 level (value = 0.0676). Aug. IDID’s better performance is in part due to the sophisticated belief update that gradually increases the probability on the true model if it is present in agent ’s model space as shown in Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams for MABC. As expected, both OPAT and Aug. IDID based ad hoc agent perform better when the other agent in the team is optimal in comparison to random or predefined type. Aug. IDIDs perform significantly better than OPAT when faced with optimal teammates for the BP domain, while the results for the other domains are similar.
In summary, the Aug. IDID maintains a probability distribution over different types of teammates and updates both the distribution and types over time, which differs from OPAT’s focus on a single optimal behavior of teammates during planning. Consequently, Aug. IDIDs allow better adaptivity as examined above. Further experiments on the robustness of Aug. IDIDs in dynamic ad hoc settings showed that agent obtained significantly better average rewards compared to OPAT (value = 0.042) for the setting where the other agent is of type predefined and after 15 steps is substituted by an optimal type for the remaining 15 steps in the MABC domain.
In Fig. Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams, we show the online run times for the Aug. IDID and generalized OPAT approaches on the three problem domains. Expectedly, OPAT takes significantly less time because it approximates the problem by solving a series of stage games modeling the other agent using a single type. In the case of Aug. IDIDs, we observe that generating and solving the added learning models consume the major portion of the total time. We show the learning overhead for Grid, BP, and the MABC domains in red in the figure. To reduce this overhead and speed up Aug. IDIDs, an avenue for future work is to try other RL methods in place of MCESP.
Although we recognize that the learning component (MCESP) is the bottleneck to scaling augmented IDIDs for larger problems, we were successful in obtaining the optimal teammate policies using augmented IDIDs (same as those computed by GMAA*ICE) in the grid, BP for T=4, and MABC for T=5. For these larger problems, we also noticed a significant improvement in the values obtained by augmented IDIDs over their traditional counterparts as shown in Table Team Behavior in Interactive Dynamic Influence Diagrams with Applications to Ad Hoc Teams. In the larger grid domain for =3, the optimal team value of 29.6 is achieved by the augmented IDID compared to 19.82 obtained by solving the traditional IDID. A better substitute for MCESP and other approximation techniques, will allow us to further scaleup augmented IDIDs.
Selfinterested individual decision makers face hierarchical (or nested) belief systems in their multiagent planning problems. In this paper, we explicate one negative consequence of bounded rationality: the agent may not behave as an optimal teammate. In the IDID framework that models individual decision makers who recursively model other agents, we show that reinforcement learning integrated with the planning allows the models to produce sophisticated policies. For the first time, we see the principled and comprehensive emergence of team behavior in IDID solutions facilitating IDIDs’ application to ad hoc team settings for which they are just naturally wellsuited for. We show that integrating learning in the context of IDIDs helps us provide a solution to a few fundamental challenges in ad hoc teamwork – building a single autonomous agent that can plan individually in partially observable environments by adapting to different kinds of teammates while making no assumptions about its teammates’ behavior or beliefs and seeking to converge to their true types. Augmented IDIDs compare well with a standard baseline algorithm, OPAT.
While individual decisionmaking frameworks such as IPOMDPs and IDIDs are thought to be well suited for noncooperative domains, we show that they may be applied to cooperative domains as well. Integrating learning while planning provides a middle ground (or a bridge) between multiagent planning frameworks such as DecPOMDPs and joint learning for cooperative domains [?]. Additionally, augmented IDIDs differentiate themselves from other centralized cooperative frameworks by focusing on the behavior of an individual agent in a multiagent setting. While we recognize that the introduction of learningbased models adds a significant challenge to scaling IDIDs for larger problems, we successfully obtained optimal teammate policies using Aug. IDIDs in the Grid and BP using a combination of intuitive pruning techniques. By allowing models formalized as IDIDs or DIDs to vary in the beliefs and frames, we considered an exhaustive and general space of models during planning. The convergence of RL is not predicated on any prior assumptions about other’s models. Immediate lines of future work involve improving the scalability of the framework, particularly its learning component, by utilizing larger problems.
 1 B. Adam and E. Dekel. Hierarchies of beliefs and common knowledge. International Journal of Game Theory, 1993.
 2 S. Albrecht and S. Ramamoorthy. A gametheoretic model and bestresponse learning method for ad hoc coordination in multiagent systems (extended abstract). In AAMAS, pages 1155–1156, 2013.
 3 R. J. Aumann. Interactive epistemology II: Probability. International Journal of Game Theory, 28:301–314, 1999.
 4 B. Banerjee, J. Lyle, L. Kraemer, and R. Yellamraju. Solving finite horizon decentralized pomdps by distributed reinforcement learning. In AAMAS Workshop on MSDM, pages 9–16, 2012.
 5 D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.
 6 D. S. Bernstein, E. A. Hansen, and S. Zilberstein. Bounded policy iteration for decentralized pomdps. In IJCAI, 2005.
 7 K. Binmore. Essays on Foundations of Game Theory. Pittman, 1982.
 8 C. F. Camerer, T.H. Ho, and J.K. Chong. A cognitive hierarchy model of games. The Quarterly Journal of Economics, 119(3):861–898, 2004.
 9 L. Chrisman. Reinforcement learning with perceptual aliasing: the perceptual distinctions approach. In AAAI, pages 183–188, 1992.
 10 P. Doshi, Y. Zeng, and Q. Chen. Graphical models for interactive pomdps: Representations and solutions. JAAMAS, 18(3):376–416, 2009.
 11 P. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multiagent settings. JAIR, 24:49–79, 2005.
 12 E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In AAAI, pages 709–715, 2004.
 13 T. N. Hoang and K. H. Low. Interactive pomdp lite: Towards practical planning to predict and exploit intentions for interacting with selfinterested agents. In IJCAI, pages 2298–2305, 2013.
 14 J. Mertens and S. Zamir. Formulation of bayesian analysis for games with incomplete information. International Journal of Game Theory, 14:1–29, 1985.
 15 N. Meuleau, L. Peshkin, K. eung Kim, and L. P. Kaelbling. Learning finitestate controllers for partially observable environments. In UAI, pages 427–436, 1999.
 16 B. Ng, K. Boakye, C. Meyers, and A. Wang. Bayesadaptive interactive pomdps. In AAAI, 2012.
 17 L. Panait and S. Luke. Cooperative multiagent learning: The state of the art. JAAMAS, 11(3):387–434, 2005.
 18 T. J. Perkins. Reinforcement learning for pomdps based on action values and stochastic optimization. In AAAI, pages 199–204, 2002.
 19 D. Pynadath and S. Marsella. Minimal mental models. In AAAI, pages 1038–1044, 2007.
 20 D. V. Pynadath and M. Tambe. The communicative multiagent team decision problem: Analyzing teamwork theories and models. Journal of Artificial Intelligence Research, 16:389–423, 2002.
 21 S. Seuken and S. Zilberstein. Improved memorybounded dynamic programming for decentralized pomdps. In UAI, 2007.
 22 M. Spaan and F. Oliehoek. The multiagent decision process toolbox: Software for decisiontheoretic planning in multiagent systems. In AAMAS Workshop on MSDM, pages 107–121, 2008.
 23 P. Stone, G. A. Kaminka, S. Kraus, and J. S. Rosenschein. Ad hoc autonomous agent teams: Collaboration without precoordination. In AAAI, 2010.
 24 F. Wu, S. Zilberstein, and X. Chen. Online planning for ad hoc autonomous agent teams. In IJCAI, pages 439–445, 2011.
 25 Y. Zeng and P. Doshi. Exploiting model equivalences for solving interactive dynamic influence diagrams. JAIR, 43:211–255, 2012.
