Towards Deployment of Robust AI Agents
for HumanMachine Partnerships
Abstract
We study the problem of designing AI agents that can robustly cooperate with people in humanmachine partnerships. Our work is inspired by reallife scenarios in which an AI agent, e.g., a virtual assistant, has to cooperate with new users after its deployment. We model this problem via a parametric MDP framework where the parameters correspond to a user’s type and characterize her behavior. In the test phase, the AI agent has to interact with a user of unknown type. Our approach to designing a robust AI agent relies on observing the user’s actions to make inferences about the user’s type and adapting its policy to facilitate efficient cooperation. We show that without being adaptive, an AI agent can end up performing arbitrarily bad in the test phase. We develop two algorithms for computing policies that automatically adapt to the user in the test phase. We demonstrate the effectiveness of our approach in solving a twoagent collaborative task.
1 Introduction
An increasing number of AI systems are deployed in humanfacing applications like autonomous driving, medicine, and education Yu et al. (2017). In these applications, the humanuser and the AI system (agent) form a partnership, necessitating mutual awareness for achieving optimal results HadfieldMenell et al. (2016); Wilson and Daugherty (2018); Amershi et al. (2019). For instance, to provide high utility to a humanuser, it is important that an AI agent can account for a user’s preferences defining her behavior and act accordingly, thereby being adaptive to the user’s type Nikolaidis et al. (2015, 2017a); Amershi et al. (2019); Tschiatschek et al. (2019); Haug et al. (2018). As a concrete example, an AI agent for autonomous driving applications should account for a user’s preference to take scenic routes instead of the fastest route and account for the user’s need for more AI support when driving manually in confusing situations.
AI agents that do not account for the user’s preferences and behavior typically degrade the utility for their human users. However, this is challenging because the AI agent needs to (a) infer information about the interacting user and (b) be able to interact efficiently with a large number of different human users, each possibly showing different behaviors. In particular, during development of an AI agent, it is often only possible to interact with a limited number of human users and the AI agent needs to generalize to new users after deployment (or acquire information needed therefore quickly). This resembles multiagent reinforcement learning settings in which an AI agent faces unknown agents at test time Grover et al. (2018) and the coldstart problem in recommender systems Bobadilla et al. (2012).
In this paper, we study the problem of designing AI agents that can robustly cooperate with new unknown users for humanmachine partnerships in reinforcement learning (RL) settings after deployment. In these problems, the AI agent often only has access to the reward information during its development while no (explicit) reward information is available once the agent is deployed. As shown in this paper, an AI agent can only achieve high utility in this setting if it is adaptive to its user while a nonadaptive AI agent can perform arbitrarily bad. We propose two adaptive policies for our considered setting, one of which comes with strong theoretical robustness guarantees at test time, while the other is inspired by recent deeplearning approaches for RL and is easier to scale to larger problems. Both policies build upon inferring the human user’s properties and leverage these inferences to act robustly.
Our approach is related to ideas of multitask, metalearning, and generalization in reinforcement learning.
However, most of these approaches require access to reward information at test time and rarely offer theoretical guarantees for robustness (see discussion on related work in Section 7).
Below, we highlight our main contributions:

We provide an algorithmic framework for designing robust policies for interacting with agents of unknown behavior. Furthermore, we prove robustness guarantees for approaches building on our framework.

We propose two policies according to our framework: AdaptPool which precomputes a set of bestresponse policies and executes them adaptively based on inferences of the type of humanuser; and AdaptDQN which implements adaptive policies by a neural network in combination with an inference module.

We empirically demonstrate the excellent performance of our proposed policies when facing an unknown user.
2 The Problem Setup
We formalize the problem through a reinforcement learning (RL) framework. The agents are hereafter referred to as agent and agent : here, agent represents the AI agent whereas agent could be a person, i.e., human user. Our goal is to develop a learning algorithm for agent that leads to high utility even in cases when the behavior of agent and its committed policy is unknown.
2.1 The model
We model the preferences and induced behavior of agent via a parametric space . From agent ’s perspective, each leads to a parameterized MDP consisting of the following:

a set of states , with denoting a generic state.

a set of actions , with denoting a generic action of agent .

a transition kernel parameterized by as , which is a tensor with indices defined by the current state , the agent ’s action , and the next state . In particular, , where is sampled from agent ’s policy in state . That is, corresponds to the transition dynamics derived from a twoagent MDP with transition dynamics and agent ’s policy .

a reward function parameterized by as for . This captures the preferences of agent that agent should account for.

a discount factor weighing shortterm rewards against longterm rewards.

an initial state distribution .
Our goal is to develop a learning algorithm that achieves high utility even in cases when is unknown. In line with the motivating applications discussed above, we consider the following two phases:

Training (development) phase. During development, our learning algorithm can iteratively interact with a limited number of different MDPs for : here, agent can observe rewards as well as agent ’s actions needed for learning purposes.

Test (deployment) phase. After deployment, our learning algorithm interacts with a parameterized MDP as described above for unknown : here, agent only observes agent ’s actions but not rewards.
2.2 Utility of agent
For a fixed policy of agent , we define its total expected reward in the MDP as follows:
(1) 
where the expectation is over the stochasticity of policy and the transition dynamics . Here is the state at time . For , this comes from the distribution .
For known .
When the underlying parameter is known, the task of finding the best response policy of agent reduces to the following:
(2) 
where defines the set of stationary Markov policies.
For unknown .
However, when the underlying parameter is unknown, we define the best response (in a minmax sense) policy of agent as:
(3) 
Clearly, . In general, this gap can be arbitrarily large, as formally stated in the following theorem.
Theorem 1.
There exists a problem instance where the performance of agent can be arbitrarily worse when agent ’s type is unknown. In other words, the gap is arbitrarily high.
The proof is presented in the supplementary material. Theorem 1 shows that the performance of agent can be arbitrarily bad when it doesn’t know and is restricted to execute a fixed stationary Markov policy. In the next section, we present an algorithmic framework for designing robust policies for agent for unknown .
3 Designing Robust Policies
In this section, we introduce our algorithmic framework for designing robust policies for the AI agent .
3.1 Algorithmic framework
Our approach relies on observing the behavior (i.e., actions taken) to make inferences about the agent ’s type and adapting agent ’s policy accordingly to facilitate efficient cooperation. This is inspired by how people make decisions in uncertain situations (e.g., ability to safely drive a car even if the other driver on the road is driving aggressively). The key intuition is that at test time, the agent can observe agent ’s actions which are taken as when in state to infer , and in turn use this additional information to make an improved decision on which actions to take. More formally, we define the observation history available at the beginning of timestep as and use it to infer the type of agent and act appropriately.
In particular, we will make use of an Inference procedure (details provided in Section 5). Given , this procedure returns an estimate of the type of agent at time given by . Then, we consider stochastic policies of the form . The space of these policies is given by . For a fixed policy of agent and fixed, unknown , we define its total expected reward in the MDP as follows:
(4) 
Note that at any time , we have and is generated according to .
We seek to find the policy for agent given by the following optimization problem:
(5) 
In the next two sections, we will design algorithms to optimize the objective in Equation (5) following the framework outlined in Algorithm 1. In particular, we will discuss two possible architectures for policy and corresponding Training procedures in Section 4. Then, in Section 5, we describe ways to implement the Inference procedure for inferring agent ’s type using observed actions. Below, we provide theoretical insights into the robustness of the proposed algorithmic framework.
3.2 Performance analysis
We begin by specifying three technical questions that are important to gain theoretical insights into the robustness of the proposed framework, see below:

Independent of the specific procedures used for Training and Inference, the first question to tackle is the following: When agent ’s true type is and agent uses a best response policy for such that , what are the performance guarantees on the total utility achieved by agent ? (see Theorem 2).

Regarding Training procedure: When agent ’s type is and the inference procedure outputs such that , what is the performance of policy ? (see Section 4).

Regarding Inference procedure: When agent ’s type is , can we infer such that either is small, or agent ’s policies and are approximately equivalent? (see Section 5)
3.2.1 Smoothness properties
For addressing Q.1, we introduce a number of properties characterizing our problem setting. These properties are essentially smoothness conditions on MDPs that enable us to make statements about the following intermediate issue: For two types , how “similar" are the corresponding MDPs from agent ’s point of view?
The first property characterizes the smoothness of rewards for agent w.r.t. parameter . Formally, the parametric MDP is smooth with respect to the rewards if for any and we have
(6) 
The second property characterizes the smoothness of policies for agent w.r.t. parameter ; this in turn implies that the MDP’s transition dynamics as perceived by agent are smooth. Formally, the parametric MDP is smooth in the behavior of agent if for any and we have
(7) 
For instance, one setting where this property holds naturally is when is a soft Bellman policy computed w.r.t. a reward function for agent which is smooth in Ziebart (2010); Kamalaruban et al. (2019).
The third property is a notion of influence as introduced by Dimitrakakis et al. (2017): This notion captures how much one agent can affect the probability distribution of the next state with her actions as perceived by the second agent. Formally, we capture the influence of agent on agent as follows:
(8) 
where represents the action of agent , represents two distinct actions of agent , and is the transition dynamics of the twoagent MDP (see Section 2.1). Note that and allows us to do finegrained performance analysis: for instance, when , then agent doesn’t affect the transition dynamics as perceived by agent and we can expect to have better performance for agent .
3.2.2 Guarantees
Putting this together, we can provide the following guarantees as an answer for Q.1:
Theorem 2.
Let be the type of agent at test time and agent uses a policy such that . The parameters characterize the smoothness as defined above. Then, the total reward achieved by agent satisfies the following guarantee
The proof of the theorem is provided in the supplementary material and builds up on the theory of approximate equivalence of MDPs by EvenDar and Mansour (2003). In the next two sections, we provide specific instantiations of Training and Inference procedures.
4 Training Procedures
In this section, we present two procedures to train adaptive policies (see Training in Algorithm 1).
4.1 Training procedure AdaptPool
The basic idea of AdaptPool is to maintain a pool Pool of best response policies for and, in the test phase, switch between these policies based on inference of the type .
4.1.1 Architecture of the policy
The adaptive pool based policy (AdaptPool) consists of a pool (Pool) of best response policies corresponding to different possible agent ’s types , and a nearestneighbor policy selection mechanism. In particular, when invoking AdaptPool for state and inferred agent ’s type , the policy first identifies the most similar agent in Pool, i.e., , and then executes an action using the best response policy .
4.1.2 Training process
During training we compute a pool of best response policies Pool for a set of possible agent ’s types , see Algorithm 2.
4.1.3 Guarantees
It turns out that if the set of possible agent ’s types is chosen appropriately, Algorithm 1 instantiated with AdaptPool enjoys strong performance guarantees. In particular, choosing as a sufficiently fine cover of the parameter space , ensures that for any , that we might encounter at test time, we have considered a sufficiently similar agent during training and hence can execute a best response policy which achieves good performance, see corollary below.
Corollary 3.
Let be an cover for , i.e., for all . Let be the type of agent and the Inference procedure outputs such that . Let . Then, at time , the policy used by agent has the following guarantees:
Corollary 3 follows from the result of Theorem 2 given that the pool Pool of policies trained by AdaptPool is sufficiently rich. Note that the accuracy of Inference would typically improve over time and hence the performance of the algorithm is expected to improve over time in practice, see Section 6.2. Building on the idea of AdaptPool, next we provide a more practical implementation of training procedure which does not require to maintain an explicit pool of best response policies and therefore is easier to scale to larger problems.
4.2 Training procedure AdaptDQN
AdaptDQN builds on the ideas of AdaptPool: Here, instead of explicitly maintaining a pool of best response policies for agent , we have a policy network trained on an augmented state space . This policy network resembles Deep QNetwork (DQN) architecture Mnih et al. (2015), but operates on an augmented state space and takes as input a tuple . Similar architecture was used by Hessel et al. (2019), where one policy network was trained to play 57 Atari games, and the state space was augmented with the index of the game. In the test phase, agent selects actions given by this policy network.
4.2.1 Architecture of the policy
The adaptive policy (AdaptDQN) consists of a neural network trained on an augmented state space . In particular, when invoking AdaptDQN for state and inferred agent ’s type , we use the augmented state space as input to the neural network. The output layer of the network computes the Qvalues of all possible actions corresponding to the augmented input state. Agent selects the action with the maximum Qvalue.
4.2.2 Training process
Here, we provide a description of how we train the policy network using augmented state space, see Algorithm 3. During one iteration of training the policy network, we first sample a parameter . We then obtain the optimal best response policy of agent for the MDP . We compute the vector of all Qvalues corresponding to this policy, i.e, (represented by in Algorithm 3), using the standard Bellman equations Sutton and Barto (1998). In our setting, we use these precomputed Qvalues to serve as the target values for the associated parameter for training the policy network. The loss function used for training is the standard squared error loss between the target Qvalues computed using the procedure described above and those given by the network under training. The gradient of this loss function is used for backpropagation through the network. Multiple such iterations are carried out during training, until a convergence criteria is met. For more details on Deep QNetworks, we refer the reader to see Mnih et al. (2015).
5 Inference Procedure
In the test phase, the inference of agent ’s type from an observation history is a key component of our framework, and crucial for facilitating efficient collaboration. Concretely, Theorem 2 implies that a best response policy also achieves good performance for agent with true parameter if is small and MDP is smooth w.r.t. parameter as described in Section 3.2.
There are several different approaches that one can consider for inference, depending on application setting. For instance, we can use probabilistic approaches as proposed in the work of Everett and Roberts (2018) where a pool of agent ’s policies is maintained and inference is done at run time via simple probabilistic methods. Based on the work by Grover et al. (2018), we can also maintain a more compact representation of agent ’s policies and then apply probabilistic methods on this representation.
We can also do inference based on ideas of inverse reinforcement learning (IRL) where observation history serves the purpose of demonstrations Abbeel and Ng (2004); Ziebart (2010). This is particularly suitable when the parameter exactly corresponds to the rewards used by agent when computing its policy . In fact, this is the approach that we follow for our inference module, and in particular, we employ the popular IRL algorithm, namely Maximum Causal Entropy (MCE) IRL algorithm Ziebart (2010). We refer the reader to Section 6.1 for more details.
6 Experiments
We evaluate the performance of our algorithms using a gathering game environment, see Figure 2. Below, we provide details of the experimental setup and then discuss results.
6.1 Experimental setup
6.1.1 Environment details
For our experiments, we consider an episodic setting where two agents play the game repeatedly for multiple episodes enumerated as . Each episode of the game lasts for 500 steps. Now, to translate the episode count to time steps as used in Algorithm 1 (line 3), we have at the end of episode.
For any fixed , agent ’s policy is computed first by ignoring the presence of agent as described below—this is in line with our motivating applications where agent is the humanagent with a prespecified policy. In order to compute agent ’s policy , we consider agent operating in a singleagent MDP denoted as where (i) corresponds to the location of agent in the gridspace, (ii) the action space is as described in Figure 2, (iii) the reward function corresponds to reward associated with two fruit types given by , (iv) corresponds to transition dynamics of agent alone in the environment, (v) discount factor , and (vi) corresponds to agent starting in the upperleft corner (see Figure 2). Given , we compute as a soft Bellman policy – suitable to capture suboptimal human behaviour in applications Ziebart (2010).
From agent ’s point of view, each gives rise to a parametric MDP in which agent is operating in the game along with the corresponding agent , see Figure 2.
6.1.2 Baselines and implementation details.
We use three baselines to compare the performance of our algorithms: (i) Rand corresponds to picking a random and using best response policy , (ii) FixedMM corresponds to the fixed best response (in a minmax sense) policy in Eq. 3, and (iii) FixedBest is a variant of FixedMM and corresponds to the fixed best response (in a average sense) policy.
We implemented two variants of AdaptPool which store policies corresponding to and covers of (see Corollary 3), denoted as and in Figure 3. Next, we give specifications of the trained policy network used in AdaptDQN. We used to be a level discretization of . The trained network has 3 hidden layers with leaky RELUunits (with ) having , , and hidden units respectively, and a linear output layer with units (corresponding to the size of action set ) (see Mnih et al. (2015) for more details on training Deep QNetwork). The input to the neural network is a concatenation of the location of the agents, and the parameter vector , where (this corresponds to the augmented state space described in Section 4.2). The location of each agent is represented as a onehot encoding of a vector of length corresponding to the number of grid cells Hence the length of the input vector to the neural network is . During training, agent implemented epsilongreedy exploratory policies (with exploration rate decaying linearly over training iterations from 1.0 to 0.01). Training lasted for about 50 million iterations.
Our inference module is based on the MCEIRL approach Ziebart (2010) to infer by observing actions taken by agent ’s policy. Note that, we are using MCEIRL to infer the reward function parameters used by agent for computing its policy in the MDP (see “Environment details" above). At the beginning, the inference module is initialized with , and its output at time given by is based on history . In particular, we implemented a sequential variant of MCEIRL algorithm which updates the estimate only at the end of every episode using stochastic gradient descent with learning rate . We refer the reader to Ziebart (2010) for details on the original MCEIRL algorithm and to Kamalaruban et al. (2019) for the sequential variant.
6.2 Results: Worstcase and average performance
We evaluate the performance of algorithms on different obtained by a level discretization of the 2D parametric space . For a given , the results were averaged over runs. Results are shown in Figure 3. As can be seen in Figure 2(a), the worstcase performance of both AdaptDQN and AdaptPool is significantly better than that of the three baselines (FixedBest, Rand and FixedMM), indicating robustness of our algorithmic framework. In our experiments, the FixedMM and FixedBest baselines correspond to best response policies for and respectively. Under both these policies, agent ’s behavior is qualitatively similar to the one shown in Figure 1(c). As can be seen, under these policies, agent avoids both fruits and avoids any collision; however, this does not allow agent to assist agent in collecting fruits even in scenarios where fruits have positive rewards.
In Figure 2(c), we show the convergence behavior of the inference module. Here, Worst shows the worst case performance: As can be seen in the Worst line, there are cases where the performance of the inference procedure is bad, i.e., is large. This usually happens when different parameter values of results in agent having equivalent policies. In these cases, estimating the exact without any additional information is difficult. In our experiments, we noted that even if is large, it is often the case that agent ’s policies and are approximately equivalent which is important for getting a good approximation of the transition dynamics . Despite the poor performance of the inference module in such cases, the performance of our algorithms is significantly better than the baselines (as is evident in Figure 2(a)). In the supplementary material, we provide additional experimental results corresponding to the algorithms’ performance for each individual to gain further insights.
6.3 Results: Performance heatmaps for each
Here, we provide additional experimental results to gain further insights into the performance of our algorithms. These results are presented in Figure 4 in the form of heat maps for each individual : Heat maps either represent performance of algorithms (in terms of the total reward ) or the performance of inference procedure (in terms of the norm ). These results are plotted in the episode (cf., Figure 3 where the performance was plotted over time with increasing ).
It is important to note that there are cases where the performance of inference procedure is bad, i.e., is large. This usually happens when different parameter values of results in agent having equivalent policies. In these cases, estimating the exact without any additional information is difficult. In our experiments, we noted that even if is large, it is often the case that agent ’s policies and are approximately equivalent which is important for getting a good approximation of the transition dynamics . Despite the poor performance of the inference module in such cases, the performance of our algorithms (see and AdaptDQN in the figure) is significantly better than the baselines (see FixedBest in the figure).
7 Related Work
Modeling and inferring about other agents. The inference problem has been considered in the literature in various forms. For instance, Grover et al. (2018) consider the problem of learning policy representations that can be used for interacting with unseen agents when using representationconditional policies. They also consider the case of inferring another agent’s representation (parameters) during test time. Macindoe et al. (2012) consider planners for collaborative domains that can take actions to learn about the intent of another agent or hedge against its uncertainty. Nikolaidis et al. (2015) cluster human users into types and aim to infer the type of new user online, with the goal of executing the policy for that type. They test their approach in robothuman interaction but do not provide any theoretical analysis of their approach. Beyond reinforcement learning, the problem of modeling and inferring about other agents has been studied in other applications such as personalization of web search ranking results by inferring user’s preferences based on their online activity White et al. (2013, 2014); Singla et al. (2014).
Multitask and metalearning. Our problem setting can be interpreted as a multitask RL problem in which each possible agent corresponds to a different task, or as a metalearning RL problem in which the goal is to learn a policy that can quickly adapt to new partners. Hessel et al. (2019) study the problem of multitask learning in the RL setting in which a single agent has to solve multiple tasks, e.g., solve all Atari games. However, they do not consider a separate test set to measure generalization of trained agents but rather train and evaluate on the same tasks. Sæmundsson et al. (2018) consider the problem of meta learning for RL in the context of changing dynamics of the environment and approach it using a Gaussian processes and a hierarchical latent variable model approach.
Robust RL. The idea of robust RL is to learn policies that are robust to certain types of errors or mismatches. In the context of our paper, mismatch occurs in the sense of encountering human agents that have not been encountered at training time and the learned policies should be robust in this situation. Pinto et al. (2017) consider training of policies in the context of a destabilizing adversary with the goal of coping with model mismatch and data scarcity. Roy et al. (2017) study the problem of RL under model mismatch such that the learning agent cannot interact with the actual test environment but only a reasonably close approximation. The authors develop robust modelfree learning algorithms for this setting.
More complex interactions, teaching, and steering. In our paper, the type of interaction between two agents is limited as agent does not affect agent ’s behaviour, allowing us to gain a deeper theoretical understanding of this setting. There is also a related literature on “steering” the behavior of other agent. For example, (i) the environment design framework of Zhang et al. (2009), where one agent tries to steer the behavior of another agent by modifying its reward function, (ii) the cooperative inverse reinforcement learning of HadfieldMenell et al. (2016), where the human uses demonstrations to reveal a proper reward function to the AI agent, and (iii) the advicebased interaction model Amir et al. (2016), where the goal is to communicate advice to a suboptimal agent on how to act.
Dealing with nonstationary agents. The work of Everett and Roberts (2018) is closely related to ours: they design a Switching Agent Model (SAM) that combines deep reinforcement learning with opponent modelling to robustly switch between multiple policies. Zheng et al. (2018) also consider a similar setting of detecting nonstationarity and reusing policies on the fly, and introduce distilled policy network that serves as the policy library. Our algorithmic framework is similar in spirit to these two papers, however, in our setting, the focus is on acting optimally against an unknown agent whose behavior is stationary and we provide theoretical guarantees on the performance of our algorithms. Singla et al. (2018) have considered the problem of learning with experts advice where experts are not stationary and are learning agents themselves. However, their focus is on designing a metaalgorithm on how to coordinate with these experts and is technically very different from ours. A few other recent papers have also considered repeated humanAI interaction where the human agent is nonstationary and is evolving its behavior in response to AI agent (see Radanovic et al. (2019); Nikolaidis et al. (2017b). Prior work also considers a learner that is aware of the presence of other actors (see Foerster et al. (2018); Raileanu et al. (2018)).
8 Conclusions
Inspired by realworld applications like virtual personal assistants, we studied the problem of designing AI agents that can robustly cooperate with new people in humanmachine partnerships. Inspired by our motivating applications, we focused on an important practical aspect that there is often a clear distinction between the training and test phase: the explicit reward information is only available during training but adaptation is also needed during testing. We provided a framework for designing adaptive policies and gave theoretical insights into its robustness. In experiments, we demonstrated that these policies can achieve good performance when interacting with previously unseen agents.
Acknowledgements
This work was supported by Microsoft Research through its PhD Scholarship Programme.
References
 Apprenticeship learning via inverse reinforcement learning. In ICML, Cited by: §5.
 Guidelines for humanAI interaction. In CHI, pp. 3:1–3:13. Cited by: §1.
 Interactive teaching strategies for agent training. In IJCAI, Cited by: §7.
 A collaborative filtering approach to mitigate the new user cold start problem. KnowledgeBased Systems 26, pp. 225 – 238. External Links: ISSN 09507051, Document, Link Cited by: §1.
 Multiview decision processes: The helperAI problem. In Advances in Neural Information Processing Systems, Cited by: §B.2, §B.2, §B.2, §3.2.1, Lemma 6.
 Approximate equivalence of Markov decision processes. In Learning Theory and Kernel Machines, B. Schölkopf and M. K. Warmuth (Eds.), Berlin, Heidelberg, pp. 581–594. Cited by: §B.1, §B.1, Definition B.1, §3.2.2.
 Learning against nonstationary agents with opponent modelling and deep reinforcement learning. In AAAI Spring Symposia 2018, Cited by: §5, §7.
 Learning with opponentlearning awareness. In AAMAS, pp. 122–130. Cited by: §7.
 Learning policy representations in multiagent systems. In ICML, pp. 1797–1806. Cited by: §1, §5, §7.
 Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §1, §7.
 Teaching inverse reinforcement learners via features and demonstrations. In Advances in Neural Information Processing Systems, pp. 8464–8473. Cited by: §1.
 Multitask deep reinforcement learning with popart. In AAAI, pp. 3796–3803. Cited by: §4.2, §7.
 Interactive teaching algorithms for inverse reinforcement learning. In IJCAI, Cited by: §3.2.1, §6.1.2.
 Multiagent reinforcement learning in sequential social dilemmas. In AAMAS, pp. 464–473. Cited by: Figure 2.
 Pomcop: belief space planning for sidekicks in cooperative games. In AIIDE, Cited by: §7.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §4.2.2, §4.2, §6.1.2.
 Mathematical models of adaptation in humanrobot collaboration. CoRR abs/1707.02586. Cited by: §1.
 Gametheoretic modeling of human adaptation in humanrobot collaboration. In Proceedings of the International conference on humanrobot interaction, pp. 323–331. Cited by: §7.
 Efficient model learning from jointaction demonstrations for humanrobot collaborative tasks. In HRI, pp. 189–196. Cited by: §1, §7.
 Information and information stability of random variables and processes. HoldenDay. Cited by: 2nd item.
 Robust adversarial reinforcement learning. In ICML, Cited by: §7.
 Learning to collaborate in Markov decision processes. In ICML, Cited by: §B.2, §7.
 Modeling others using oneself in multiagent reinforcement learning. In ICML, pp. 4254–4263. Cited by: Figure 2, §7.
 Reinforcement learning under model mismatch. In Advances in Neural Information Processing Systems, pp. 3043–3052. Cited by: §7.
 Meta reinforcement learning with latent variable Gaussian processes. In UAI, Cited by: §7.
 Learning to interact with learning agents. In AAAI, pp. 4083–4090. Cited by: §7.
 Enhancing personalization via search activity attribution. In SIGIR, pp. 1063–1066. Cited by: §7.
 Reinforcement learning  an introduction. Adaptive computation and machine learning, MIT Press. Cited by: §B.1, §4.2.2.
 Learneraware teaching: inverse reinforcement learning with preferences and constraints. In Advances in Neural Information Processing Systems, Cited by: §1.
 Enhancing personalized search by mining and modeling task behavior. In WWW, pp. 1411–1420. Cited by: §7.
 From devices to people: Attribution of search activity in multiuser settings. In WWW, pp. 431–442. Cited by: §7.
 Collaborative intelligence: Humans and AI are joining forces. Harvard Business Review 96 (4), pp. 114–123. Cited by: §1.
 Towards AIpowered personalization in MOOC learning. npj Science of Learning 2 (1), pp. 15. Cited by: §1.
 Policy teaching through reward function learning. In EC, pp. 295–304. Cited by: §7.
 A deep bayesian policy reuse approach against nonstationary agents. In Advances in Neural Information Processing Systems, pp. 962–972. Cited by: §7.
 Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Ph.D. Thesis. Cited by: §3.2.1, §5, §6.1.1, §6.1.2.
Appendix A Proof of Theorem 1
Proof.
We provide a proof via constructing a problem instance. Let the parametric space be . Next, we define two MDPs and below:

set of states is given by , with denoting a generic state. Here, state represents a state where reward can be accumulated, and state is a terminal state.

set of actions is given by with denoting a generic action for agent and denoting a generic action for agent .

for , we have if and if where . Note that the reward function only depends on the state and not on the action taken. Also, the reward function is same for both and .

discount factor and initial state distribution is given by .

most crucial part of this problem instance is the transition dynamics and that we specify below. Note that, for , , where , i.e., corresponds to the transition dynamics derived from a two agent MDP for which agent ’s policy is . We define transition dynamics below in Figure 5, and policies and below in Figure 6.




Next, in Figure 7, we show two MDPs and as perceived by agent . It is easy to see that the best response policies for agent are given as follows: (i) for , , and (ii) for , . Any action can be taken from state as it brings zero reward and agent continues to stay in . Also, these best response policies have a total reward given by and .
However, when the underlying parameter is unknown, the best response (in a maxmin sense) policy of agent as defined in Equation 3 is given by:
Next, we compute for our problem instance. When considering the space of policies , it is enough to focus only on the state and consider policies which take action from state with probability where . For any such policy such that , we can compute the following:
It can easily be shown that is the policy given by , i.e., . Next we focus on the primary quantity of interest in the theorem, i.e.,
As mentioned earlier, we have for both and . Also, it is easy to compute that for both and . Hence, we can show that
which is arbitrary large when is close to 1 or for large values of .
∎
Appendix B Proof of Theorem 2
In this section, we provide a proof of Theorem 2. The proof builds up on a few technical lemmas that we introduce first.
b.1 Approximatelyequivalent MDPs
First, we introduce a generic notion of approximatelyequivalent MDPs and derive a few technical results for them that are useful to prove Theorem 2. This notion and technical results are adapted from the work by EvenDar and Mansour (2003).
Definition B.1 (approximatelyequivalent MDPs, adapted from EvenDar and Mansour (2003)).
Suppose we have two MDPs and , and rewards are bounded in . We call and as approximatelyequivalent if the following holds:
Next, we state a useful technical lemma, which is adapted from the results of EvenDar and Mansour (2003).
Lemma 4.
Suppose we have two approximatelyequivalent MDPs and . Let and denote optimal policies (not necessarily unique) for and respectively. Let denote the vector of value function per state for policy in MDP ; similarly, denotes the vector of value function per state for policy in MDP . We can bound these two vectors of value functions as follows:
(9) 
Proof.
For ease of presentation of the key ideas, we will write this proof considering reward functions that only depend on the current state and not on actions taken; the proof can be easily extended to generic reward functions.
The proof idea is based on looking at intermediate outputs of a policyiteration algorithm Sutton and Barto (1998) when evaluating in and . Let us consider an iteration of policyiteration algorithm and let us use to denote the vector of value functions when evaluating in . Similarly, we use to denote the vector of value functions when evaluating in at iteration of policyiteration algorithm. Note that as , the vectors and converge to and