Towards Deployment of Robust AI Agents for Human-Machine Partnerships

# Towards Deployment of Robust AI Agents for Human-Machine Partnerships

Ahana Ghosh
MPI-SWS
gahana@mpi-sws.org &Sebastian Tschiatschek
Microsoft Research
setschia@microsoft.com Hamed Mahdavi
MPI-SWS
MPI-SWS
###### Abstract

We study the problem of designing AI agents that can robustly cooperate with people in human-machine partnerships. Our work is inspired by real-life scenarios in which an AI agent, e.g., a virtual assistant, has to cooperate with new users after its deployment. We model this problem via a parametric MDP framework where the parameters correspond to a user’s type and characterize her behavior. In the test phase, the AI agent has to interact with a user of unknown type. Our approach to designing a robust AI agent relies on observing the user’s actions to make inferences about the user’s type and adapting its policy to facilitate efficient cooperation. We show that without being adaptive, an AI agent can end up performing arbitrarily bad in the test phase. We develop two algorithms for computing policies that automatically adapt to the user in the test phase. We demonstrate the effectiveness of our approach in solving a two-agent collaborative task.

## 1 Introduction

An increasing number of AI systems are deployed in human-facing applications like autonomous driving, medicine, and education Yu et al. (2017). In these applications, the human-user and the AI system (agent) form a partnership, necessitating mutual awareness for achieving optimal results Hadfield-Menell et al. (2016); Wilson and Daugherty (2018); Amershi et al. (2019). For instance, to provide high utility to a human-user, it is important that an AI agent can account for a user’s preferences defining her behavior and act accordingly, thereby being adaptive to the user’s type Nikolaidis et al. (2015, 2017a); Amershi et al. (2019); Tschiatschek et al. (2019); Haug et al. (2018). As a concrete example, an AI agent for autonomous driving applications should account for a user’s preference to take scenic routes instead of the fastest route and account for the user’s need for more AI support when driving manually in confusing situations.

AI agents that do not account for the user’s preferences and behavior typically degrade the utility for their human users. However, this is challenging because the AI agent needs to (a) infer information about the interacting user and (b) be able to interact efficiently with a large number of different human users, each possibly showing different behaviors. In particular, during development of an AI agent, it is often only possible to interact with a limited number of human users and the AI agent needs to generalize to new users after deployment (or acquire information needed therefore quickly). This resembles multi-agent reinforcement learning settings in which an AI agent faces unknown agents at test time Grover et al. (2018) and the cold-start problem in recommender systems Bobadilla et al. (2012).

In this paper, we study the problem of designing AI agents that can robustly cooperate with new unknown users for human-machine partnerships in reinforcement learning (RL) settings after deployment. In these problems, the AI agent often only has access to the reward information during its development while no (explicit) reward information is available once the agent is deployed. As shown in this paper, an AI agent can only achieve high utility in this setting if it is adaptive to its user while a non-adaptive AI agent can perform arbitrarily bad. We propose two adaptive policies for our considered setting, one of which comes with strong theoretical robustness guarantees at test time, while the other is inspired by recent deep-learning approaches for RL and is easier to scale to larger problems. Both policies build upon inferring the human user’s properties and leverage these inferences to act robustly.

Our approach is related to ideas of multi-task, meta-learning, and generalization in reinforcement learning. However, most of these approaches require access to reward information at test time and rarely offer theoretical guarantees for robustness (see discussion on related work in Section 7). Below, we highlight our main contributions:

• We provide an algorithmic framework for designing robust policies for interacting with agents of unknown behavior. Furthermore, we prove robustness guarantees for approaches building on our framework.

• We propose two policies according to our framework: AdaptPool which pre-computes a set of best-response policies and executes them adaptively based on inferences of the type of human-user; and AdaptDQN which implements adaptive policies by a neural network in combination with an inference module.

• We empirically demonstrate the excellent performance of our proposed policies when facing an unknown user.

## 2 The Problem Setup

We formalize the problem through a reinforcement learning (RL) framework. The agents are hereafter referred to as agent  and agent : here, agent  represents the AI agent whereas agent  could be a person, i.e., human user. Our goal is to develop a learning algorithm for agent  that leads to high utility even in cases when the behavior of agent  and its committed policy is unknown.

### 2.1 The model

We model the preferences and induced behavior of agent  via a parametric space . From agent ’s perspective, each leads to a parameterized MDP consisting of the following:

• a set of states , with denoting a generic state.

• a set of actions , with denoting a generic action of agent .

• a transition kernel parameterized by as , which is a tensor with indices defined by the current state , the agent ’s action , and the next state . In particular, , where is sampled from agent ’s policy in state . That is, corresponds to the transition dynamics derived from a two-agent MDP with transition dynamics and agent ’s policy .

• a reward function parameterized by as for . This captures the preferences of agent that agent should account for.

• a discount factor weighing short-term rewards against long-term rewards.

• an initial state distribution .

Our goal is to develop a learning algorithm that achieves high utility even in cases when  is unknown. In line with the motivating applications discussed above, we consider the following two phases:

• Training (development) phase. During development, our learning algorithm can iteratively interact with a limited number of different MDPs for : here, agent  can observe rewards as well as agent ’s actions needed for learning purposes.

• Test (deployment) phase. After deployment, our learning algorithm interacts with a parameterized MDP as described above for unknown : here, agent  only observes agent ’s actions but not rewards.

### 2.2 Utility of agent Ay

For a fixed policy of agent , we define its total expected reward in the MDP as follows:

 Jθ(π)=E[∞∑τ=1γτ−1Rθ(sτ,aτ) | D0,Tθ,π], (1)

where the expectation is over the stochasticity of policy and the transition dynamics . Here is the state at time . For , this comes from the distribution .

##### For known θ.

When the underlying parameter is known, the task of finding the best response policy of agent reduces to the following:

 π∗θ=argmaxπ∈ΠJθ(π) (2)

where defines the set of stationary Markov policies.

##### For unknown θ.

However, when the underlying parameter is unknown, we define the best response (in a minmax sense) policy of agent as:

 π∗Θ=argminπ∈Πmaxθ∈Θ(Jθ(π∗θ)−Jθ(π)) (3)

Clearly, . In general, this gap can be arbitrarily large, as formally stated in the following theorem.

###### Theorem 1.

There exists a problem instance where the performance of agent  can be arbitrarily worse when agent ’s type is unknown. In other words, the gap is arbitrarily high.

The proof is presented in the supplementary material. Theorem 1 shows that the performance of agent  can be arbitrarily bad when it doesn’t know and is restricted to execute a fixed stationary Markov policy. In the next section, we present an algorithmic framework for designing robust policies for agent  for unknown .

## 3 Designing Robust Policies

In this section, we introduce our algorithmic framework for designing robust policies for the AI agent .

### 3.1 Algorithmic framework

Our approach relies on observing the behavior (i.e., actions taken) to make inferences about the agent ’s type and adapting agent ’s policy accordingly to facilitate efficient cooperation. This is inspired by how people make decisions in uncertain situations (e.g., ability to safely drive a car even if the other driver on the road is driving aggressively). The key intuition is that at test time, the agent  can observe agent ’s actions which are taken as when in state to infer , and in turn use this additional information to make an improved decision on which actions to take. More formally, we define the observation history available at the beginning of timestep as and use it to infer the type of agent and act appropriately.

In particular, we will make use of an Inference procedure (details provided in Section 5). Given , this procedure returns an estimate of the type of agent at time given by . Then, we consider stochastic policies of the form . The space of these policies is given by . For a fixed policy of agent and fixed, unknown , we define its total expected reward in the MDP as follows:

 Jθ(ψ)=E[∞∑τ=1γτ−1Rθ(sτ,aτ) | D0,Tθ,ψ]. (4)

Note that at any time , we have and is generated according to .

We seek to find the policy for agent given by the following optimization problem:

 minψ∈Ψmaxθ∈Θ(Jθ(π∗θ)−Jθ(ψ)) (5)

In the next two sections, we will design algorithms to optimize the objective in Equation (5) following the framework outlined in Algorithm 1. In particular, we will discuss two possible architectures for policy and corresponding Training procedures in Section 4. Then, in Section 5, we describe ways to implement the Inference procedure for inferring agent ’s type using observed actions. Below, we provide theoretical insights into the robustness of the proposed algorithmic framework.

### 3.2 Performance analysis

We begin by specifying three technical questions that are important to gain theoretical insights into the robustness of the proposed framework, see below:

1. Independent of the specific procedures used for Training and Inference, the first question to tackle is the following: When agent ’s true type is and agent  uses a best response policy for such that , what are the performance guarantees on the total utility achieved by agent ? (see Theorem 2).

2. Regarding Training procedure: When agent ’s type is and the inference procedure outputs such that , what is the performance of policy ? (see Section 4).

3. Regarding Inference procedure: When agent ’s type is , can we infer such that either is small, or agent ’s policies and are approximately equivalent? (see Section 5)

#### 3.2.1 Smoothness properties

For addressing Q.1, we introduce a number of properties characterizing our problem setting. These properties are essentially smoothness conditions on MDPs that enable us to make statements about the following intermediate issue: For two types , how “similar" are the corresponding MDPs from agent ’s point of view?

The first property characterizes the smoothness of rewards for agent  w.r.t. parameter . Formally, the parametric MDP is -smooth with respect to the rewards if for any and we have

 maxs∈S,a∈A|Rθ(s,a)−Rθ′(s,a)|≤α⋅rmax⋅||θ−θ′||2 (6)

The second property characterizes the smoothness of policies for agent  w.r.t. parameter ; this in turn implies that the MDP’s transition dynamics as perceived by agent  are smooth. Formally, the parametric MDP is -smooth in the behavior of agent if for any and we have

 maxs∈SKL(πxθ(. | s);πxθ′(. | s))≤β⋅||θ−θ′||2. (7)

For instance, one setting where this property holds naturally is when is a soft Bellman policy computed w.r.t. a reward function for agent which is smooth in Ziebart (2010); Kamalaruban et al. (2019).

The third property is a notion of influence as introduced by Dimitrakakis et al. (2017): This notion captures how much one agent can affect the probability distribution of the next state with her actions as perceived by the second agent. Formally, we capture the influence of agent  on agent  as follows:

 Ix:=maxs∈S(maxa,b,b′∥∥Tx,y(. | s,a,b)−Tx,y(. | s,a,b′)∥∥1), (8)

where represents the action of agent  , represents two distinct actions of agent , and is the transition dynamics of the two-agent MDP (see Section 2.1). Note that and allows us to do fine-grained performance analysis: for instance, when , then agent  doesn’t affect the transition dynamics as perceived by agent  and we can expect to have better performance for agent .

#### 3.2.2 Guarantees

Putting this together, we can provide the following guarantees as an answer for Q.1:

###### Theorem 2.

Let be the type of agent  at test time and agent  uses a policy such that . The parameters characterize the smoothness as defined above. Then, the total reward achieved by agent   satisfies the following guarantee

 Jθtest(π∗^θ)≥Jθtest(π∗θtest)−ϵ⋅α⋅rmax1−γ−Ix⋅√2⋅β⋅ϵ⋅rmax(1−γ)2

The proof of the theorem is provided in the supplementary material and builds up on the theory of approximate equivalence of MDPs by Even-Dar and Mansour (2003). In the next two sections, we provide specific instantiations of Training and Inference procedures.

## 4 Training Procedures

In this section, we present two procedures to train adaptive policies (see Training in Algorithm 1).

The basic idea of AdaptPool is to maintain a pool Pool of best response policies for and, in the test phase, switch between these policies based on inference of the type .

#### 4.1.1 Architecture of the policy ψ

The adaptive pool based policy (AdaptPool) consists of a pool (Pool) of best response policies corresponding to different possible agent ’s types , and a nearest-neighbor policy selection mechanism. In particular, when invoking AdaptPool for state and inferred agent ’s type , the policy first identifies the most similar agent in Pool, i.e., , and then executes an action using the best response policy .

#### 4.1.2 Training process

During training we compute a pool of best response policies Pool for a set of possible agent ’s types , see Algorithm 2.

#### 4.1.3 Guarantees

It turns out that if the set of possible agent ’s types is chosen appropriately, Algorithm 1 instantiated with AdaptPool enjoys strong performance guarantees. In particular, choosing as a sufficiently fine -cover of the parameter space , ensures that for any , that we might encounter at test time, we have considered a sufficiently similar agent during training and hence can execute a best response policy which achieves good performance, see corollary below.

###### Corollary 3.

Let be an -cover for , i.e., for all . Let be the type of agent  and the Inference procedure outputs such that . Let . Then, at time , the policy used by agent  has the following guarantees:

 Jθtest(π∗^θt)≥Jθtest(π∗θtest)−ϵ⋅α⋅rmax1−γ−Ix⋅√2⋅β⋅ϵ⋅rmax(1−γ)2

Corollary 3 follows from the result of Theorem 2 given that the pool Pool of policies trained by AdaptPool is sufficiently rich. Note that the accuracy of Inference would typically improve over time and hence the performance of the algorithm is expected to improve over time in practice, see Section 6.2. Building on the idea of AdaptPool, next we provide a more practical implementation of training procedure which does not require to maintain an explicit pool of best response policies and therefore is easier to scale to larger problems.

AdaptDQN builds on the ideas of AdaptPool: Here, instead of explicitly maintaining a pool of best response policies for agent , we have a policy network trained on an augmented state space . This policy network resembles Deep Q-Network (DQN) architecture Mnih et al. (2015), but operates on an augmented state space and takes as input a tuple . Similar architecture was used by Hessel et al. (2019), where one policy network was trained to play 57 Atari games, and the state space was augmented with the index of the game. In the test phase, agent  selects actions given by this policy network.

#### 4.2.1 Architecture of the policy ψ

The adaptive policy (AdaptDQN) consists of a neural network trained on an augmented state space . In particular, when invoking AdaptDQN for state and inferred agent ’s type , we use the augmented state space as input to the neural network. The output layer of the network computes the Q-values of all possible actions corresponding to the augmented input state. Agent selects the action with the maximum Q-value.

#### 4.2.2 Training process

Here, we provide a description of how we train the policy network using augmented state space, see Algorithm 3. During one iteration of training the policy network, we first sample a parameter . We then obtain the optimal best response policy of agent  for the MDP . We compute the vector of all Q-values corresponding to this policy, i.e, (represented by in Algorithm 3), using the standard Bellman equations Sutton and Barto (1998). In our setting, we use these pre-computed Q-values to serve as the target values for the associated parameter for training the policy network. The loss function used for training is the standard squared error loss between the target Q-values computed using the procedure described above and those given by the network under training. The gradient of this loss function is used for back-propagation through the network. Multiple such iterations are carried out during training, until a convergence criteria is met. For more details on Deep Q-Networks, we refer the reader to see Mnih et al. (2015).

## 5 Inference Procedure

In the test phase, the inference of agent ’s type from an observation history is a key component of our framework, and crucial for facilitating efficient collaboration. Concretely, Theorem 2 implies that a best response policy also achieves good performance for agent with true parameter if is small and MDP is smooth w.r.t. parameter as described in Section 3.2.

There are several different approaches that one can consider for inference, depending on application setting. For instance, we can use probabilistic approaches as proposed in the work of Everett and Roberts (2018) where a pool of agent ’s policies is maintained and inference is done at run time via simple probabilistic methods. Based on the work by Grover et al. (2018), we can also maintain a more compact representation of agent ’s policies and then apply probabilistic methods on this representation.

We can also do inference based on ideas of inverse reinforcement learning (IRL) where observation history serves the purpose of demonstrations Abbeel and Ng (2004); Ziebart (2010). This is particularly suitable when the parameter exactly corresponds to the rewards used by agent when computing its policy . In fact, this is the approach that we follow for our inference module, and in particular, we employ the popular IRL algorithm, namely Maximum Causal Entropy (MCE) IRL algorithm Ziebart (2010). We refer the reader to Section 6.1 for more details.

## 6 Experiments

We evaluate the performance of our algorithms using a gathering game environment, see Figure 2. Below, we provide details of the experimental setup and then discuss results.

### 6.1 Experimental setup

#### 6.1.1 Environment details

For our experiments, we consider an episodic setting where two agents play the game repeatedly for multiple episodes enumerated as . Each episode of the game lasts for 500 steps. Now, to translate the episode count to time steps as used in Algorithm 1 (line 3), we have at the end of episode.

For any fixed , agent ’s policy is computed first by ignoring the presence of agent  as described below—this is in line with our motivating applications where agent  is the human-agent with a pre-specified policy. In order to compute agent ’s policy , we consider agent  operating in a single-agent MDP denoted as where (i) corresponds to the location of agent  in the grid-space, (ii) the action space is as described in Figure 2, (iii) the reward function corresponds to reward associated with two fruit types given by , (iv) corresponds to transition dynamics of agent  alone in the environment, (v) discount factor , and (vi) corresponds to agent  starting in the upper-left corner (see Figure 2). Given , we compute as a soft Bellman policy – suitable to capture sub-optimal human behaviour in applications Ziebart (2010).

From agent ’s point of view, each gives rise to a parametric MDP in which agent  is operating in the game along with the corresponding agent , see Figure 2.

#### 6.1.2 Baselines and implementation details.

We use three baselines to compare the performance of our algorithms: (i) Rand corresponds to picking a random and using best response policy , (ii) FixedMM corresponds to the fixed best response (in a minmax sense) policy in Eq. 3, and (iii) FixedBest is a variant of FixedMM and corresponds to the fixed best response (in a average sense) policy.

We implemented two variants of AdaptPool which store policies corresponding to and covers of (see Corollary 3), denoted as and in Figure 3. Next, we give specifications of the trained policy network used in AdaptDQN. We used to be a level discretization of . The trained network has 3 hidden layers with leaky RELU-units (with ) having , , and hidden units respectively, and a linear output layer with units (corresponding to the size of action set ) (see Mnih et al. (2015) for more details on training Deep Q-Network). The input to the neural network is a concatenation of the location of the agents, and the parameter vector , where (this corresponds to the augmented state space described in Section 4.2). The location of each agent is represented as a one-hot encoding of a vector of length corresponding to the number of grid cells Hence the length of the input vector to the neural network is . During training, agent  implemented epsilon-greedy exploratory policies (with exploration rate decaying linearly over training iterations from 1.0 to 0.01). Training lasted for about 50 million iterations.

Our inference module is based on the MCE-IRL approach Ziebart (2010) to infer by observing actions taken by agent ’s policy. Note that, we are using MCE-IRL to infer the reward function parameters used by agent  for computing its policy in the MDP (see “Environment details" above). At the beginning, the inference module is initialized with , and its output at time given by is based on history . In particular, we implemented a sequential variant of MCE-IRL algorithm which updates the estimate only at the end of every episode using stochastic gradient descent with learning rate . We refer the reader to Ziebart (2010) for details on the original MCE-IRL algorithm and to Kamalaruban et al. (2019) for the sequential variant.

### 6.2 Results: Worst-case and average performance

We evaluate the performance of algorithms on different obtained by a level discretization of the 2-D parametric space . For a given , the results were averaged over runs. Results are shown in Figure 3. As can be seen in Figure 2(a), the worst-case performance of both AdaptDQN and AdaptPool is significantly better than that of the three baselines (FixedBest, Rand and FixedMM), indicating robustness of our algorithmic framework. In our experiments, the FixedMM and FixedBest baselines correspond to best response policies for and respectively. Under both these policies, agent ’s behavior is qualitatively similar to the one shown in Figure 1(c). As can be seen, under these policies, agent  avoids both fruits and avoids any collision; however, this does not allow agent  to assist agent  in collecting fruits even in scenarios where fruits have positive rewards.

In Figure 2(c), we show the convergence behavior of the inference module. Here, Worst shows the worst case performance: As can be seen in the Worst line, there are cases where the performance of the inference procedure is bad, i.e., is large. This usually happens when different parameter values of results in agent  having equivalent policies. In these cases, estimating the exact without any additional information is difficult. In our experiments, we noted that even if is large, it is often the case that agent ’s policies and are approximately equivalent which is important for getting a good approximation of the transition dynamics . Despite the poor performance of the inference module in such cases, the performance of our algorithms is significantly better than the baselines (as is evident in Figure 2(a)). In the supplementary material, we provide additional experimental results corresponding to the algorithms’ performance for each individual to gain further insights.

### 6.3 Results: Performance heatmaps for each θtest

Here, we provide additional experimental results to gain further insights into the performance of our algorithms. These results are presented in Figure 4 in the form of heat maps for each individual : Heat maps either represent performance of algorithms (in terms of the total reward ) or the performance of inference procedure (in terms of the norm ). These results are plotted in the episode (cf., Figure 3 where the performance was plotted over time with increasing ).

It is important to note that there are cases where the performance of inference procedure is bad, i.e., is large. This usually happens when different parameter values of results in agent  having equivalent policies. In these cases, estimating the exact without any additional information is difficult. In our experiments, we noted that even if is large, it is often the case that agent ’s policies and are approximately equivalent which is important for getting a good approximation of the transition dynamics . Despite the poor performance of the inference module in such cases, the performance of our algorithms (see and AdaptDQN in the figure) is significantly better than the baselines (see FixedBest in the figure).

## 7 Related Work

Modeling and inferring about other agents. The inference problem has been considered in the literature in various forms. For instance, Grover et al. (2018) consider the problem of learning policy representations that can be used for interacting with unseen agents when using representation-conditional policies. They also consider the case of inferring another agent’s representation (parameters) during test time. Macindoe et al. (2012) consider planners for collaborative domains that can take actions to learn about the intent of another agent or hedge against its uncertainty. Nikolaidis et al. (2015) cluster human users into types and aim to infer the type of new user online, with the goal of executing the policy for that type. They test their approach in robot-human interaction but do not provide any theoretical analysis of their approach. Beyond reinforcement learning, the problem of modeling and inferring about other agents has been studied in other applications such as personalization of web search ranking results by inferring user’s preferences based on their online activity White et al. (2013, 2014); Singla et al. (2014).

Multi-task and meta-learning. Our problem setting can be interpreted as a multi-task RL problem in which each possible agent corresponds to a different task, or as a meta-learning RL problem in which the goal is to learn a policy that can quickly adapt to new partners. Hessel et al. (2019) study the problem of multi-task learning in the RL setting in which a single agent has to solve multiple tasks, e.g., solve all Atari games. However, they do not consider a separate test set to measure generalization of trained agents but rather train and evaluate on the same tasks. Sæmundsson et al. (2018) consider the problem of meta learning for RL in the context of changing dynamics of the environment and approach it using a Gaussian processes and a hierarchical latent variable model approach.

Robust RL. The idea of robust RL is to learn policies that are robust to certain types of errors or mismatches. In the context of our paper, mismatch occurs in the sense of encountering human agents that have not been encountered at training time and the learned policies should be robust in this situation. Pinto et al. (2017) consider training of policies in the context of a destabilizing adversary with the goal of coping with model mismatch and data scarcity. Roy et al. (2017) study the problem of RL under model mismatch such that the learning agent cannot interact with the actual test environment but only a reasonably close approximation. The authors develop robust model-free learning algorithms for this setting.

More complex interactions, teaching, and steering. In our paper, the type of interaction between two agents is limited as agent does not affect agent ’s behaviour, allowing us to gain a deeper theoretical understanding of this setting. There is also a related literature on “steering” the behavior of other agent. For example, (i) the environment design framework of Zhang et al. (2009), where one agent tries to steer the behavior of another agent by modifying its reward function, (ii) the cooperative inverse reinforcement learning of Hadfield-Menell et al. (2016), where the human uses demonstrations to reveal a proper reward function to the AI agent, and (iii) the advice-based interaction model Amir et al. (2016), where the goal is to communicate advice to a sub-optimal agent on how to act.

Dealing with non-stationary agents. The work of Everett and Roberts (2018) is closely related to ours: they design a Switching Agent Model (SAM) that combines deep reinforcement learning with opponent modelling to robustly switch between multiple policies. Zheng et al. (2018) also consider a similar setting of detecting non-stationarity and reusing policies on the fly, and introduce distilled policy network that serves as the policy library. Our algorithmic framework is similar in spirit to these two papers, however, in our setting, the focus is on acting optimally against an unknown agent whose behavior is stationary and we provide theoretical guarantees on the performance of our algorithms. Singla et al. (2018) have considered the problem of learning with experts advice where experts are not stationary and are learning agents themselves. However, their focus is on designing a meta-algorithm on how to coordinate with these experts and is technically very different from ours. A few other recent papers have also considered repeated human-AI interaction where the human agent is non-stationary and is evolving its behavior in response to AI agent (see Radanovic et al. (2019); Nikolaidis et al. (2017b). Prior work also considers a learner that is aware of the presence of other actors (see Foerster et al. (2018); Raileanu et al. (2018)).

## 8 Conclusions

Inspired by real-world applications like virtual personal assistants, we studied the problem of designing AI agents that can robustly cooperate with new people in human-machine partnerships. Inspired by our motivating applications, we focused on an important practical aspect that there is often a clear distinction between the training and test phase: the explicit reward information is only available during training but adaptation is also needed during testing. We provided a framework for designing adaptive policies and gave theoretical insights into its robustness. In experiments, we demonstrated that these policies can achieve good performance when interacting with previously unseen agents.

#### Acknowledgements

This work was supported by Microsoft Research through its PhD Scholarship Programme.

## References

• P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In ICML, Cited by: §5.
• S. Amershi, D. Weld, M. Vorvoreanu, A. Fourney, B. Nushi, P. Collisson, J. Suh, S. Iqbal, P. N. Bennett, K. Inkpen, J. Teevan, R. Kikin-Gil, and E. Horvitz (2019) Guidelines for human-AI interaction. In CHI, pp. 3:1–3:13. Cited by: §1.
• O. Amir, E. Kamar, A. Kolobov, and B. Grosz (2016) Interactive teaching strategies for agent training. In IJCAI, Cited by: §7.
• J. Bobadilla, F. Ortega, A. Hernando, and J. Bernal (2012) A collaborative filtering approach to mitigate the new user cold start problem. Knowledge-Based Systems 26, pp. 225 – 238. External Links: ISSN 0950-7051, Document, Link Cited by: §1.
• C. Dimitrakakis, D. C. Parkes, G. Radanovic, and P. Tylkin (2017) Multi-view decision processes: The helper-AI problem. In Advances in Neural Information Processing Systems, Cited by: §B.2, §B.2, §B.2, §3.2.1, Lemma 6.
• E. Even-Dar and Y. Mansour (2003) Approximate equivalence of Markov decision processes. In Learning Theory and Kernel Machines, B. Schölkopf and M. K. Warmuth (Eds.), Berlin, Heidelberg, pp. 581–594. Cited by: §B.1, §B.1, Definition B.1, §3.2.2.
• R. Everett and S. J. Roberts (2018) Learning against non-stationary agents with opponent modelling and deep reinforcement learning. In AAAI Spring Symposia 2018, Cited by: §5, §7.
• J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch (2018) Learning with opponent-learning awareness. In AAMAS, pp. 122–130. Cited by: §7.
• A. Grover, M. Al-Shedivat, J. K. Gupta, Y. Burda, and H. Edwards (2018) Learning policy representations in multiagent systems. In ICML, pp. 1797–1806. Cited by: §1, §5, §7.
• D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. D. Dragan (2016) Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §1, §7.
• L. Haug, S. Tschiatschek, and A. Singla (2018) Teaching inverse reinforcement learners via features and demonstrations. In Advances in Neural Information Processing Systems, pp. 8464–8473. Cited by: §1.
• M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt (2019) Multi-task deep reinforcement learning with popart. In AAAI, pp. 3796–3803. Cited by: §4.2, §7.
• P. Kamalaruban, R. Devidze, V. Cevher, and A. Singla (2019) Interactive teaching algorithms for inverse reinforcement learning. In IJCAI, Cited by: §3.2.1, §6.1.2.
• J. Z. Leibo, V. F. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel (2017) Multi-agent reinforcement learning in sequential social dilemmas. In AAMAS, pp. 464–473. Cited by: Figure 2.
• O. Macindoe, L. P. Kaelbling, and T. Lozano-Pérez (2012) Pomcop: belief space planning for sidekicks in cooperative games. In AIIDE, Cited by: §7.
• V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §4.2.2, §4.2, §6.1.2.
• S. Nikolaidis, J. Forlizzi, D. Hsu, J. A. Shah, and S. S. Srinivasa (2017a) Mathematical models of adaptation in human-robot collaboration. CoRR abs/1707.02586. Cited by: §1.
• S. Nikolaidis, S. Nath, A. D. Procaccia, and S. Srinivasa (2017b) Game-theoretic modeling of human adaptation in human-robot collaboration. In Proceedings of the International conference on human-robot interaction, pp. 323–331. Cited by: §7.
• S. Nikolaidis, R. Ramakrishnan, K. Gu, and J. A. Shah (2015) Efficient model learning from joint-action demonstrations for human-robot collaborative tasks. In HRI, pp. 189–196. Cited by: §1, §7.
• M. S. Pinsker (1964) Information and information stability of random variables and processes. Holden-Day. Cited by: 2nd item.
• L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta (2017) Robust adversarial reinforcement learning. In ICML, Cited by: §7.
• G. Radanovic, R. Devidze, D. Parkes, and A. Singla (2019) Learning to collaborate in Markov decision processes. In ICML, Cited by: §B.2, §7.
• R. Raileanu, E. Denton, A. Szlam, and R. Fergus (2018) Modeling others using oneself in multi-agent reinforcement learning. In ICML, pp. 4254–4263. Cited by: Figure 2, §7.
• A. Roy, H. Xu, and S. Pokutta (2017) Reinforcement learning under model mismatch. In Advances in Neural Information Processing Systems, pp. 3043–3052. Cited by: §7.
• S. Sæmundsson, K. Hofmann, and M. P. Deisenroth (2018) Meta reinforcement learning with latent variable Gaussian processes. In UAI, Cited by: §7.
• A. Singla, S. H. Hassani, and A. Krause (2018) Learning to interact with learning agents. In AAAI, pp. 4083–4090. Cited by: §7.
• A. Singla, R. W. White, A. Hassan, and E. Horvitz (2014) Enhancing personalization via search activity attribution. In SIGIR, pp. 1063–1066. Cited by: §7.
• R. S. Sutton and A. G. Barto (1998) Reinforcement learning - an introduction. Adaptive computation and machine learning, MIT Press. Cited by: §B.1, §4.2.2.
• S. Tschiatschek, A. Ghosh, L. Haug, R. Devidze, and A. Singla (2019) Learner-aware teaching: inverse reinforcement learning with preferences and constraints. In Advances in Neural Information Processing Systems, Cited by: §1.
• R. W. White, W. Chu, A. Hassan, X. He, Y. Song, and H. Wang (2013) Enhancing personalized search by mining and modeling task behavior. In WWW, pp. 1411–1420. Cited by: §7.
• R. W. White, A. Hassan, A. Singla, and E. Horvitz (2014) From devices to people: Attribution of search activity in multi-user settings. In WWW, pp. 431–442. Cited by: §7.
• H. J. Wilson and P. R. Daugherty (2018) Collaborative intelligence: Humans and AI are joining forces. Harvard Business Review 96 (4), pp. 114–123. Cited by: §1.
• H. Yu, C. Miao, C. Leung, and T. J. White (2017) Towards AI-powered personalization in MOOC learning. npj Science of Learning 2 (1), pp. 15. Cited by: §1.
• H. Zhang, D. C. Parkes, and Y. Chen (2009) Policy teaching through reward function learning. In EC, pp. 295–304. Cited by: §7.
• Y. Zheng, Z. Meng, J. Hao, Z. Zhang, T. Yang, and C. Fan (2018) A deep bayesian policy reuse approach against non-stationary agents. In Advances in Neural Information Processing Systems, pp. 962–972. Cited by: §7.
• B. D. Ziebart (2010) Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Ph.D. Thesis. Cited by: §3.2.1, §5, §6.1.1, §6.1.2.

## Appendix A Proof of Theorem 1

###### Proof.

We provide a proof via constructing a problem instance. Let the parametric space be . Next, we define two MDPs and below:

• set of states is given by , with denoting a generic state. Here, state represents a state where reward can be accumulated, and state is a terminal state.

• set of actions is given by with denoting a generic action for agent  and denoting a generic action for agent .

• for , we have if and if where . Note that the reward function only depends on the state and not on the action taken. Also, the reward function is same for both and .

• discount factor and initial state distribution is given by .

• most crucial part of this problem instance is the transition dynamics and that we specify below. Note that, for , , where , i.e., corresponds to the transition dynamics derived from a two agent MDP for which agent ’s policy is . We define transition dynamics below in Figure 5, and policies and below in Figure 6.

Next, in Figure 7, we show two MDPs and as perceived by agent . It is easy to see that the best response policies for agent  are given as follows: (i) for , , and (ii) for , . Any action can be taken from state as it brings zero reward and agent continues to stay in . Also, these best response policies have a total reward given by and .

However, when the underlying parameter is unknown, the best response (in a maxmin sense) policy of agent as defined in Equation 3 is given by:

 π∗Θ=argminπ∈Πmaxθ∈Θ(Jθ(π∗θ)−Jθ(π))

Next, we compute for our problem instance. When considering the space of policies , it is enough to focus only on the state and consider policies which take action from state with probability where . For any such policy such that , we can compute the following:

 Jθ1(π) =rmax1−p⋅γandJθ2(π)=rmax1−(1−p)⋅γ

It can easily be shown that is the policy given by , i.e., . Next we focus on the primary quantity of interest in the theorem, i.e.,

 maxθ∈Θ(Jθ(π∗θ)−Jθ(π∗Θ))

As mentioned earlier, we have for both and . Also, it is easy to compute that for both and . Hence, we can show that

 maxθ∈Θ(Jθ(π∗θ)−Jθ(π∗Θ))≥rmax1−γ−2

which is arbitrary large when is close to 1 or for large values of .

## Appendix B Proof of Theorem 2

In this section, we provide a proof of Theorem 2. The proof builds up on a few technical lemmas that we introduce first.

### b.1 Approximately-equivalent MDPs

First, we introduce a generic notion of approximately-equivalent MDPs and derive a few technical results for them that are useful to prove Theorem 2. This notion and technical results are adapted from the work by Even-Dar and Mansour (2003).

###### Definition B.1 (approximately-equivalent MDPs, adapted from Even-Dar and Mansour (2003)).

Suppose we have two MDPs and , and rewards are bounded in . We call and as approximately-equivalent if the following holds:

 maxa∈A,s∈S∥∥T1(⋅ | s,a)−T2(⋅ | s,a)∥∥1≤ϵp maxa∈A,s∈S|R1(s,a)−R2(s,a)|≤ϵr⋅rmax

Next, we state a useful technical lemma, which is adapted from the results of Even-Dar and Mansour (2003).

###### Lemma 4.

Suppose we have two approximately-equivalent MDPs and . Let and denote optimal policies (not necessarily unique) for and respectively. Let denote the vector of value function per state for policy in MDP ; similarly, denotes the vector of value function per state for policy in MDP . We can bound these two vectors of value functions as follows:

 ∥∥Vπ1M1−Vπ1M2∥∥∞≤ϵr⋅rmax1−γ+γ⋅ϵp⋅rmax(1−γ)2 (9)
###### Proof.

For ease of presentation of the key ideas, we will write this proof considering reward functions that only depend on the current state and not on actions taken; the proof can be easily extended to generic reward functions.

The proof idea is based on looking at intermediate outputs of a policy-iteration algorithm Sutton and Barto (1998) when evaluating in and . Let us consider an iteration of policy-iteration algorithm and let us use to denote the vector of value functions when evaluating in . Similarly, we use to denote the vector of value functions when evaluating in at iteration of policy-iteration algorithm. Note that as , the vectors and converge to and