Interactive Teaching Algorithms for Inverse Reinforcement Learning

Interactive Teaching Algorithms for Inverse Reinforcement Learning

Parameswaran Kamalaruban Authors contributed equally to this work.    Rati Devidze11footnotemark: 1    Volkan Cevher &Adish Singla\affiliationsLIONS, EPFL
Max Planck Institute for Software Systems (MPI-SWS)
\emails{kamalaruban.parameswaran,volkan.cevher}@epfl.ch, {rdevidze,adishs}@mpi-sws.org
Abstract

We study the problem of inverse reinforcement learning (IRL) with the added twist that the learner is assisted by a helpful teacher. More formally, we tackle the following algorithmic question: How could a teacher provide an informative sequence of demonstrations to an IRL learner to speed up the learning process? We present an interactive teaching framework where a teacher adaptively chooses the next demonstration based on learner’s current policy. In particular, we design teaching algorithms for two concrete settings: an omniscient setting where a teacher has full knowledge about the learner’s dynamics and a blackbox setting where the teacher has minimal knowledge. Then, we study a sequential variant of the popular MCE-IRL learner and prove convergence guarantees of our teaching algorithm in the omniscient setting. Extensive experiments with a car driving simulator environment show that the learning progress can be speeded up drastically as compared to an uninformative teacher.

1 Introduction

Imitation Learning, also known as Learning from Demonstrations, enables a learner to acquire new skills by observing a teacher’s behavior. It plays an important role in many real-life learning settings, including human-to-human interaction [\citeauthoryearBuchsbaum et al.2011, \citeauthoryearShafto et al.2014], and human-to-robot interaction [\citeauthoryearSchaal1997, \citeauthoryearBillard et al.2008, \citeauthoryearArgall et al.2009, \citeauthoryearChernova and Thomaz2014].

Inverse reinforcement learning (IRL) is one of the popular approaches to imitation learning: IRL algorithms operate by first inferring an intermediate reward function explaining the demonstrated behavior, and then obtaining a policy corresponding to the inferred reward [\citeauthoryearRussell1998, \citeauthoryearAbbeel and Ng2004]. IRL has been extensively studied in the context of designing efficient learning algorithms for a given set of demonstrations [\citeauthoryearAbbeel and Ng2004, \citeauthoryearRatliff et al.2006, \citeauthoryearZiebart et al.2008, \citeauthoryearBoularias et al.2011, \citeauthoryearWulfmeier et al.2015, \citeauthoryearFinn et al.2016]. There has also been some recent work on designing active/interactive IRL algorithms that focus on reducing the number of demonstrations that needs to be requested from a teacher [\citeauthoryearAmin et al.2017, \citeauthoryearDorsa Sadigh et al.2017]. Despite these advances, the problem of generating an optimal sequence of demonstrations to teach an IRL agent is still not well understood.

Motivated by applications of intelligent tutoring systems to teach sequential decision-making tasks, such as surgical training111Simulators for surgical training: https://www.virtamed.com/en/ or car driving222Simulators for car driving: https://www.driverinteractive.com/, we study the IRL framework from the viewpoint of a “teacher" in order to best assist an IRL “learner". \citeauthorcakmak2012algorithmic,danielbrown2018irl (\citeyearcakmak2012algorithmic,danielbrown2018irl) have studied the problem of teaching an IRL learner in a batch setting, i.e., the teacher has to provide a near-optimal set of demonstrations at once. Their teaching algorithms are non-interactive, i.e., they construct their teaching sequences without incorporating any feedback from the learner and hence unable to adapt the teaching to the learner’s progress.

In real-life pedagogical settings, it is evident that a teacher can leverage the learner’s feedback in adaptively choosing next demonstrations/tasks to accelerate the learning progress. For instance, consider a scenario where a driving instructor wants to teach a student certain driving skills. The instructor can easily identify the mistakes/weaknesses of the student (e.g., unable to do rear parking), and then carefully choose tasks that this student should perform, along with specific demonstrations to rectify any mistakes. In this paper, we study the problem of designing an interactive teaching algorithm for an IRL learner.

1.1 Overview of Our Approach

We consider an interactive teaching framework where at any given time: (i) the teacher observes the learner’s current policy, (ii) then, the teacher provides the next teaching task/demonstration, and (iii) the learner performs an update. We design interactive teaching algorithms for two settings:

  • an “omniscient" teaching setting where the teacher has full knowledge of the learner’s dynamics and can fully observe the learner’s current policy.

  • a “blackbox" teaching setting where the teacher doesn’t know learner’s dynamics and has only a noisy estimate of the learner’s current policy.

In the omniscient teaching setting, we study a sequential variant of the popular IRL algorithm, namely Maximum Causal Entropy (MCE) IRL algorithm [\citeauthoryearZiebart et al.2008, \citeauthoryearRhinehart and Kitani2017]). Our main idea in designing our omniscient teaching algorithm, OmniTeacher, is to first reduce the problem of teaching a target policy to that of teaching a corresponding hyperparameter (see Section 3.2); then, the teacher greedily steers the learner towards this hyperparameter. We then prove convergence guarantees of the OmniTeacher algorithm and show that it can significantly reduce the number of demonstrations required to achieve a desired performance of the learner (see Theorem 1, Section 4).

While omniscient teacher yields strong theoretical guarantees, it’s applicability is limited given that the learner’s dynamics are unknown and difficult to infer in practical applications. Based on insights from the omniscient teacher, we develop a simple greedy teaching algorithm, BboxTeacher, for a more practical blackbox setting (see Section 5).

We perform extensive experiments in a synthetic learning environment (with both linear and non-linear reward settings) inspired by a car driving simulator [\citeauthoryearNg and Russell2000, \citeauthoryearLevine et al.2010]. We demonstrate that our teaching algorithms can bring significant improvements in speeding up the learning progress compared to an uninformative teaching strategy that picks demonstrations at random. Furthermore, the performance of the BboxTeacher algorithm is close to the OmniTeacher algorithm even though it operates with limited information about the learner.

2 Problem Setup

We now formalize the problem addressed in this paper.

2.1 Environment

The environment is formally represented by an MDP . The sets of possible states and actions are denoted by and respectively. captures the state transition dynamics, i.e., denotes the probability of landing in state by taking action from state . Here is the discounting factor, and is an initial distribution over states . We denote a policy as a mapping from a state to a probability distribution over actions. The underlying reward function is given by .

2.2 Interaction between Learner and Teacher

In our setting, we have two entities: a teacher and a sequential IRL learner. The teacher has access to the full MDP and has a target policy computed as an optimal policy w.r.t. . The learner knows the MDP but not the reward function , i.e., has only access to . The teacher’s goal is to provide an informative sequence of demonstrations to teach the policy to the learner. Here, a teacher’s demonstration is obtained by first choosing an initial state (where ) and then choosing a trajectory of state-action pairs obtained by executing the policy in the MDP .

We consider an interactive teaching framework with three key steps formally described in Algorithm 1. At any time , the teacher observes an estimate of learner’s current policy and accordingly provides an informative demonstration to assist the learner. Then, the learner updates it’s policy to based on demonstration and it’s internal learning dynamics.

1:for  do
2:     teacher observes an estimate of learner’s policy
3:     teacher provides a demonstration to the learner
4:     learner updates the policy to using
Algorithm 1 Interactive Teaching Framework

2.3 Occupancy Measure and Expected Reward

We introduce the following two notions to formally define the teaching objective. For any policy , the occupancy measure and the total expected reward of in the MDP are defined as follows respectively:

(1)
(2)

Here, denotes the probability of visiting the state after steps by following the policy . Similarly, for any demonstration , we define

Then for a collection of demonstrations , we have .

2.4 Teaching Objective

Let denote the learner’s final policy at the end of teaching. The performance of the policy (w.r.t. ) in can be evaluated via the following measures (for some fixed ):

  1. , ensuring high reward [\citeauthoryearAbbeel and Ng2004, \citeauthoryearZiebart2010].

  2. , ensuring that learner’s behavior induced by the policy matches that of the teacher [\citeauthoryearHo and Ermon2016]. Here is the total variation distance between two probability measures and .

IRL learner’s goal is to -approximate the teacher’s policy w.r.t. one of these performance measures [\citeauthoryearZiebart2010, \citeauthoryearHo and Ermon2016]. In this paper, we study this problem from the viewpoint of a teacher in order to provide a near-optimal sequence of demonstrations to the learner to achieve the desired goal. The teacher’s performance is then measured by the number of demonstrations required to achieve the above objective.

3 Omniscient Teaching Setting

In this section, we consider the omniscient teaching setting where the teacher knows the learner’s dynamics including the learner’s current parameters at any given time. We begin by introducing a specific learner model that we study for this setting.

3.1 Learner Model

We consider a learner implementing an IRL algorithm based on Maximum Causal Entropy approach (MCE-IRL) [\citeauthoryearZiebart et al.2008, \citeauthoryearZiebart2010, \citeauthoryearWulfmeier et al.2015, \citeauthoryearRhinehart and Kitani2017, \citeauthoryearZhou et al.2018]. First, we discuss the parametric reward function of the learner and then introduce a sequential variant of the MCE-IRL algorithm used in our interactive teaching framework.

1:Initialization: parameter , policy
2:for  do
3:     receives a demonstration with starting state
4:     update
5:     compute
6:Output: policy
Algorithm 2 Sequential MCE-IRL

Parametric reward function.

We consider the learner model with parametric reward function where is a parameter. The reward function also depends on the learner’s feature representation . For linear reward function , represents the weights. As an example of non-linear rewards, the function could be high-order polynomial in (see Section 6.2). As a more powerful non-linear reward model, could be the weights of a neural network with as input layer and as output [\citeauthoryearWulfmeier et al.2015].

Soft Bellman policies.

Given a fixed parameter , we model the learner’s behavior via the following soft Bellman policy333For the case of linear reward functions, soft Bellman policies are obtained as the solution to the MCE-IRL optimization problem [\citeauthoryearZiebart2010, \citeauthoryearZhou et al.2018]. Here, we extend this idea to model the learner’s policy with the general parametric reward function. However, we note that for our learner model, it is not always possible to formulate a corresponding optimization problem that leads to the policy form mentioned above.:

(3)

For any given , the corresponding policy can be efficiently computed via Soft-Value-Iteration procedure (see [\citeauthoryearZiebart2010, Algorithm. 9.1], [\citeauthoryearZhou et al.2018]).

Sequential MCE-IRL and gradient update.

We consider a sequential MCE-IRL learner for our interactive setting where, at time step , the learner receives next demonstration with starting state . Given the learner’s current parameter and policy , the learner updates its parameter via an online gradient descent update rule given by with as the learning rate. The gradient at time is computed as

where is given by Eq. (1) and computed with as the only initial state, i.e., . As shown in Appendix A, can be seen as an empirical counterpart of the gradient of the following loss function:

capturing the discounted negative log-likelihood of teacher’s demonstrations.

For brevity, we define the following quantities:

(4)
(5)

and we write the gradient compactly as . Algorithm 2 presents the proposed sequential MCE-IRL learner. In particular, we use the online projected gradient descent update given by , where for large enough (cf. Section 4.3).

3.2 Omniscient Teacher

Next, we present our omniscient teaching algorithm, OmniTeacher, for sequential MCE-IRL learner.

1:for  do
2:     teacher picks by solving Eq. (6)
3:     learner receives and updates using Algorithm 2
Algorithm 3 OmniTeacher for sequential MCE-IRL

Policy to hyperparameter teaching.

The main idea in designing our algorithm, OmniTeacher, is to first reduce the problem of teaching a target policy to that of teaching a corresponding hyperparameter, denoted below as . Then, under certain technical conditions, teaching this to the learner ensures that the learner’s policy -approximate the target policy . We defer the technical details to Section 4.

Omniscient teaching algorithm.

Now, we design a teaching strategy to greedily steer the learner’s parameter towards the target hyperparameter . The main idea is to pick a demonstration which minimizes the distance between the learner’s current parameter and , i.e., minimize . We note that, \citeauthorliu2017iterative,pmlr-v80-liu18b (\citeyearliu2017iterative,pmlr-v80-liu18b) have used this idea in their iterative machine teaching framework to teach regression and classification tasks. In particular, we consider the following optimization problem for selecting an informative demonstration at time :

(6)

where is an initial state (where ) and is a trajectory obtained by executing the policy starting from .444Note that is not constructed synthetically by the teacher, but is obtained by executing policy starting from . However, is not just a random rollout of the policy : When there are multiple possible trajectories from using , the teacher can choose the most desirable one as per the joint optimization problem in Eq. (6). The resulting teaching strategy is given in Algorithm 3.

4 Omniscient Teaching Setting: Analysis

In this section, we analyze the teaching complexity of our algorithm OmniTeacher; the proofs are provided Appendix B.

4.1 Convergence to Hyperparameter

In this section, we analyze the teaching complexity, i.e., the total number of time steps required, to steer the learner towards the target hyperparameter. For this analysis, we quantify the “richness" of demonstrations (i.e., as solution to Eq. (6)) in terms of providing desired gradients to the learner.

For the state picked by the teacher at time step , let us first define the following objective function given by . Note that the optimal solution takes the closed form as .

Ideally, the teacher would like to provide a demonstration (starting from ) for which . In this case, the learner’s hyperparameter converges to after this time step. However, in our setting, the teacher is restricted to only provide demonstrations obtained by executing the teacher’s policy in the MDP . We say that a teacher is -rich, if for every , the teacher can provide a demonstration that satisfies the following:

for some s.t. , and . The main intuition behind these two quantities is the following: (i) bounds the magnitude of the gradient in the desired direction of and (ii) accounts for the deviation of the gradient from the desired direction of .

The following lemma provides a bound on the teaching complexity of OmniTeacher to steer the learner’s parameter to the target .

Lemma 1.

Given , let the OmniTeacher be -rich with , where , and . Then for the OmniTeacher algorithm with , we have .

Note that the above lemma only guarantees the convergence to the target hyperparameter . To provide guarantees on our teaching objective, we need additional technical conditions which we discuss below.

4.2 Convergence to Policy

We require a smoothness condition on the learner’s reward function to ensure that the learner’s output policy is close to the policy in terms of the total expected reward (see Eq. (2)). Formally is -smooth if the following holds:

(7)

Note that in the linear reward case, the smoothness parameter is . Given that the reward function is smooth, the following lemma illustrates the inherent smoothness of the soft Bellman policies given in Eq. (3).

Lemma 2.

Consider an MDP , and a learner model with parametric reward function which is -smooth. Then for any , the associated soft Bellman policies and satisfy the following:

where .

The above lemma suggests that convergence to the target hyperparameter also guarantees convergence to the policy associated with the target hyperparameter in terms of total expected reward.

4.3 Convergence to Policy

Finally, we need to ensure that the learner’s model is powerful enough to capture the teacher’s reward and policy . Intuitively, this would imply that there exists a target hyperparameter for which the soft Bellmann policy has a total expected reward close to that of the teacher’s policy w.r.t. the reward function . Formally, we say that a learner is -learnable (where ) if there exists a , such that the following holds:

In particular, for achieving the desired teaching objective (see Theorem 1), we require that the learner is -learnable. This in turn implies that there exists a such that .555For the case of linear rewards, such a is guaranteed to exist and it can be computed efficiently; further details are provided in Appendix C. Then, combining Lemma 1 and Lemma 2, we obtain our main result:

Theorem 1.

Given . Define . Let be -smooth. Set . In addition, we have the following:

  • Teacher is -rich with , where , and .

  • Learner is -learnerable.

  • Teacher has access to such that .

Then, for the OmniTeacher algorithm with , we have .

The above result states that the number of demonstrations required to achieve the desired objective is only . In practice, this can lead to a drastic speed up in teaching compared to an “agnostic" teacher (referred to as Agnostic) that provides demonstrations at random, i.e., by randomly choosing the initial state and then picking a random rollout of the policy in the MDP . In fact, for teaching regression and classification tasks, \citeauthorliu2017iterative (\citeyearliu2017iterative) showed that an omniscient teacher achieves an exponential improvement in teaching complexity as compared to Agnostic teacher who picks the examples at random.

1:Initialization: probing parameters ( frequency, tests)
2:for  do
3:     if : teacher estimates using tests
4:     teacher picks by solving Eq. (8)
5:     learner receives and updates using it’s algorithm
Algorithm 4 BboxTeacher for a sequential IRL learner

5 Blackbox Teaching Setting

In this section, we study a more practical setting where the teacher (i) cannot directly observe the learner’s policy at any time and (ii) does not know the learner’s dynamics.

Limited observability.

We first address the challenge of limited observability. The main idea is that in real-world applications, the teacher could approximately infer the learner’s policy by probing the learner, for instance, by asking the learner to perform certain tasks or “tests". Here, the notion of a test is to pick an initial state (where ) and then asking the learner to execute the current policy from . Formally, we characterize this probing via two parameters : After an interval of time steps of teaching, the teacher asks learner to perform “tests" for every initial state.

Then, based on observed demonstrations of the learner’s policy, the teacher can approximately estimate the occupancy measure .

Unknown learner’s dynamics.

To additionally deal with the second challenge of unknown learner’s dynamics, we propose a simple greedy strategy of picking an informative demonstration. In particular, we pick the demonstration at time by solving the following optimization problem (see the corresponding equation Eq. (6) for the omniscient teacher):

(8)

where is an initial state (where ) and is a trajectory obtained by executing the policy starting from . Note that we have used the estimate instead of . This strategy is inspired from insights of the omniscient teaching setting and can be seen as picking a demonstration with maximal discrepancy between the learner’s current policy and the teacher’s policy in terms of expected reward.

The resulting teaching strategy for the blackbox setting, namely BboxTeacher, is given in Algorithm 4.

6 Experimental Evaluation

In this section, we demonstrate the performance of our algorithms in a synthetic learning environment (with both linear and non-linear reward settings) inspired by a car driving simulator [\citeauthoryearNg and Russell2000, \citeauthoryearLevine et al.2010].

Environment setup.

Fig. 2 illustrates a car driving simulator environment consisting of different lane types (henceforth referred to as tasks), denoted as T0, T1, , T8. Each of these tasks is associated with different driving skills. For instance, task T0 corresponds to a basic setup representing a traffic-free highway—it has a very small probability of the presence of another car. However, task T1 represents a crowded highway with probability of encountering a car. Task T2 has stones on the right lane, whereas task T3 has a mix of both cars and stones. Similarly, tasks T4 has grass on the right lane, and T5 has a mix of both grass and cars. Tasks T6, T7, and T8 introduce more complex features such as pedestrians, HOV, and police.

Figure 1: Car environment with different lane types (tasks). In any given lane, an agent starts from the bottom-left corner and the goal is to reach the top of the lane. Arrows represent the path taken by the teacher’s policy.
stone -1 grass -0.5 car -5 ped -10 HOV -1 police 0 car-in-f -2 ped-in-f -5
Figure 2: represents features for a state . Weight vector is used to define the teacher’s reward function .
(a) (b)
Figure 3: Linear setting of Section 6.1. (a) Convergence of the learner’s to the target . (b) Difference of total expected reward of learner’s policy w.r.t. teacher’s policy in different tasks.
(a) (b)
Figure 4: Non-linear setting of Section 6.2. (a) Results for a learner model with linear function unable to represent the teacher’s reward. (b) Results for a learner model with non-linear function .

The agent’s goal is to navigate starting from an initial state at the bottom left to the top of the lane. The agent can take three different actions given by {left, straight, right}. Action left steers the agent to the left of the current lane. If agent is already in the leftmost lane when taking action left, then the lane is randomly chosen with probability . We define similar dynamics for taking action right; action straight means no change in the lane. Irrespective of the action taken, the agent always moves forward. W.l.o.g., we consider that only the agent moves in the environment.

6.1 Linear Reward Setting

First, we study a linear reward setting, and use the notation of in subscript to denote the MDP , the teacher’s reward , and the teacher’s policy .

Mdp.

We consider lanes corresponding to the first tasks in the environment, namely T0, T1, , T7. We have number of lanes of a given task, each generated randomly according to the tasks’ characteristics described above. Hence, the total number of states in our MDP is where each cell represents a state, and each lane is associated with states (see Fig. 2). There is one initial state for each lane corresponding to the bottom left cell of that lane.

Teacher’s reward and policy.

Next, we define the reward function (i.e., the teacher’s reward function). We consider a state-dependent reward that depends on the underlying features of a state given by the vector as follows:

  • features indicating the type of the current grid cell as stone, grass, car, ped, HOV, and police.

  • features providing some look-ahead information such as whether there is a car or pedestrian in the immediate front cell (denoted as car-in-f and ped-in-f).

Given this, we define the teacher’s reward function of linear form as , where the values are given in Fig. 2. Teacher’s policy is then computed as the optimal policy w.r.t. this reward function and is illustrated via arrows in Fig. 2 (T0 to T7) representing the path taken by the teacher when driving in this environment.

Learner model.

We consider the learner model with linear reward function that depends only on state, i.e., , where as given in Fig. 2. The learner is implementing the sequential MCE-IRL in Algorithm 2, where the learner’s prior knowledge is captured by the initial policy (corresponding to hyperparameter ).

In the experiments, we consider the following prior knowledge of the learner: is initially trained based on demonstrations of sampled only from the lanes associated with the tasks T0, T1, T2, and T3. Intuitively, the learner initially possesses skills to avoid collisions with cars and to avoid hitting stones while driving. We expect to teach three major skills to the learner, i.e., avoiding grass while driving (task T4, T5, and T6), maintaining distance to pedestrians (task T6), and not to drive on HOV (task T7).

(a) Learner’s initial skills: T0
(b) Learner’s initial skills: T0T3
(c) Learner’s inital skills: T0T5
Figure 5: Teaching curriculum (i.e., the task associated with the picked state by BboxTeacher in Algorithm 4) for three different settings depending on the learner’s initial skills trained on (a) T0, (b) T0T3, and (c) T0T5.

Experimental results.

We evaluate the performance of different teaching algorithms, and report the results by averaging over runs. We use lanes of each task (i.e., lanes in total). OmniTeacher in Algorithm 3 computes the target hyperparameter as per Footnote 5. For BboxTeacher in Algorithm 4, we use and .

Fig. 2(a) shows the convergence of , where is computed by OmniTeacher. As expected, OmniTeacher outperforms other teaching algorithms (BboxTeacher and Agnostic teacher) that do not have knowledge of the learner’s dynamics and are not directly focusing on teaching this hyperparameter . Interestingly, the convergence of BboxTeacher is still significantly faster than Agnostic teacher. In Fig. 2(b), we consider the teaching objective of total expected reward difference . We separately plot this objective with starting states limited to task T4 and task T7. The results suggest that convergence to target leads to reduction in expected reward difference (one of the teaching objective). The performance of BboxTeacher (BB-T7) is even better than OmniTeacher (Om-T7) for task T7. This is because BboxTeacher (Eq. (8)) picks tasks on which the learner’s total expected reward difference is highest.

Teaching curriculum.

In Fig. 5, we compare the teaching curriculum of BboxTeacher for three different settings, where the learner’s initial policy is trained based on demonstrations of sampled only from the lanes associated with the tasks (a) T0, (b) T0T3, and (c) T0T5. The curriculum here refers to the task associated with the state picked by the teacher at time to provide the demonstration. In these plots, we can see specific structures and temporal patterns emerging in the curriculum. In particular, we can see that the teaching curriculum focuses on tasks that help the learner acquire new skills. For instance, in Fig. 4(b), the teacher primarily picks tasks that provide new skills corresponding to the features grass, HOV, and ped. Recall that, for BboxTeacher (Algorithm 4), we use and for our experiments. As a result, the curriculum plots show blocks of length as the algorithm ends up picking the same task until new tests are performed to get a fresh estimate of the learner’s policy.

6.2 Non-linear Reward Setting

Here, we study a non-linear reward setting, and use the notation of in subscript for , , and .

Mdp.

We consider the MDP which consists of lanes corresponding to tasks: T0, T1, T2, T4, and T8 with the total number of states being as described above.

Teacher’s reward and policy.

Here, we define the teacher’s reward function . The key difference between and is that the teacher in this setting prefers to drive on HOV (gets a reward instead of reward). However, if police is present while driving on HOV, there is a penalty of . Teacher’s optimal policy for T8 is given in Fig. 2. For other tasks (apart from T7 and T8), the teacher’s optimal policy is the same as .

Learner model.

We consider two different learner models: (i) with linear reward function , and (ii) with non-linear reward function . Here, as in Section 6.1 and is the dimension of . As a prior knowledge to get at time , we train the learner initially based on demonstrations of sampled only from the lanes associated with the tasks T0, T1, and T2.

Experimental results.

We use similar experimental settings as in Section 6.1 (i.e., , averaging runs, etc.). We separately report results for teaching a learner with linear (see Fig. 3(a)) and non-linear (see Fig. 3(b)).

Fig. 3(a) shows that both BboxTeacher and Agnostic teacher are unable to make progress in teaching task T8 to the learner. Interestingly, the overall performance measuring the total expected reward difference on the whole MDP for BboxTeacher (BB-all) is worse compared to Agnostic (Ag-all): This is an artifact of BboxTeacher’s strategy in Eq. (8) being stuck in always picking task T8 (even though the learner is not making any progress). Fig. 3(b) illustrates that the above mentioned limitations are gone when teaching a learner using a non-linear reward function. Here, the rate of reduction in total expected reward difference is significantly faster for BboxTeacher as compared to Agnostic, as was also observed in Fig. 2(b).

These results demonstrate that the learning progress can be sped up significantly by an adaptive teacher even with limited knowledge about the learner, as compared to an uninformative teacher. These results also signify the importance that the learner’s representation of feature space and reward function should be powerful enough to learn the desired behavior.

7 Related Work

Imitation learning.

The two popular approaches for imitation learning include (i) behavioral cloning, which directly replicates the desired behavior [\citeauthoryearBain and Sommut1999], and (ii) inverse reinforcement learning (IRL) which infers the reward function explaining the desired behavior [\citeauthoryearRussell1998]. We refer the reader to a recent survey by \citeauthorosa2018algorithmic (\citeyearosa2018algorithmic) on imitation learning.

\citeauthor

kareem2018_repeated (\citeyearkareem2018_repeated) have studied interactive IRL algorithms that actively request the teacher to provide suitable demonstrations with the goal of reducing the number of interactions. However, the key difference in our approach is that we take the viewpoint of a teacher in how to best assist the learning agent by providing an optimal sequence of demonstrations. Our approach is inspired by real-life pedagogical settings where carefully choosing the teaching demonstrations and tasks can accelerate the learning progress [\citeauthoryearHo et al.2016]. \citeauthorhadfield2016cooperative (\citeyearhadfield2016cooperative) have studied the value alignment problem in a game-theoretic setup, and provided an approximate scheme to generate instructive demonstrations for an IRL agent. In our work, we devise a systematic procedure (with convergence guarantees for the omniscient setting) to choose an optimal sequence of demonstrations, by taking into account the learner’s dynamics.

Steering and teaching in reinforcement learning.

A somewhat different but related problem setting is that of reward shaping and environment design where the goal is to modify/design the reward function to guide/steer the behavior of a learning agent [\citeauthoryearNg et al.1999, \citeauthoryearZhang et al.2009, \citeauthoryearSorg et al.2010]. Another related problem setting is considered in the advice-based interaction framework (e.g., [\citeauthoryearTorrey and Taylor2013, \citeauthoryearAmir et al.2016]), where the goal is to communicate advice to a suboptimal agent on how to act in the world.

Algorithmic teaching.

Another line of research relevant to our work is that of algorithmic teaching. Here, one studies the interaction between a teacher and a learner where the teacher’s objective is to find an optimal training sequence to steer the learner towards a desired goal [\citeauthoryearGoldman and Kearns1995, \citeauthoryearLiu et al.2017, \citeauthoryearZhu et al.2018]. Algorithmic teaching provides a rigorous formalism for a number of real-world applications such as personalized education and intelligent tutoring systems [\citeauthoryearPatil et al.2014, \citeauthoryearRafferty et al.2016, \citeauthoryearHunziker et al.2018], social robotics [\citeauthoryearCakmak and Thomaz2014], and human-in-the-loop systems [\citeauthoryearSingla et al.2014, \citeauthoryearSingla et al.2013]. Most of the work in machine teaching is in a batch setting where the teacher provides a batch of teaching examples at once without any adaptation. The question of how a teacher should adaptively select teaching examples for a learner has been addressed recently but only in the supervised learning setting [\citeauthoryearMelo et al.2018, \citeauthoryearLiu et al.2018, \citeauthoryearChen et al.2018, \citeauthoryearYeo et al.2019].

Teaching sequential tasks.

\citeauthor

DBLP:conf/uai/WalshG12,cakmak2012algorithmic (\citeyearDBLP:conf/uai/WalshG12,cakmak2012algorithmic) have studied algorithmic teaching for sequential decision-making tasks. \citeauthorcakmak2012algorithmic (\citeyearcakmak2012algorithmic) studied the problem of teaching an IRL agent in the batch setting, i.e., the teacher has to provide a near-optimal set of demonstrations at once. They considered the IRL algorithm by \citeauthorng2000algorithms (\citeyearng2000algorithms), which could only result in inferring an equivalent class of reward weight parameters for which the observed behavior is optimal. In recent work, \citeauthordanielbrown2018irl (\citeyeardanielbrown2018irl) have extended the previous work of \citeauthorcakmak2012algorithmic (\citeyearcakmak2012algorithmic) by showing that the teaching problem can be formulated as a set cover problem. However, their teaching strategy does not take into account how the learner progresses (i.e., it is non-interactive). In contrast, we study interactive teaching setting to teach a sequential MCE-IRL algorithm [\citeauthoryearZiebart et al.2008, \citeauthoryearRhinehart and Kitani2017]. This interactive setting, in turn, allows the teacher to design a personalized and adaptive curriculum important for efficient learning [\citeauthoryearTadepalli2008]. \citeauthorhaug_teaching_2018 (\citeyearhaug_teaching_2018) have studied the problem of teaching an IRL agent adaptively; however, they consider a very different setting where the teacher and the learner have a mismatch in their worldviews.

8 Conclusions

We studied the problem of designing interactive teaching algorithms to provide an informative sequence of demonstrations to a sequential IRL learner. In an omniscient teaching setting, we presented OmniTeacher which achieves the teaching objective with demonstrations. Then, utilizing the insights from OmniTeacher, we proposed BboxTeacher for a more practical blackbox setting. We showed the effectiveness of our algorithms via extensive experiments in a learning environment inspired by a car driving simulator.

As future work, we will investigate extensions of our ideas to more complex environments; we hope that, ultimately, such extensions will provide a basis for designing teaching strategies for intelligent tutoring systems (see Footnote 1 and Footnote 2). It would also be interesting to benchmark active imitation methods for MCE-IRL learner using our omniscient teacher (see, e.g., [\citeauthoryearBrown and Niekum2019]). Our results are also important in getting a better understanding of the robustness of MCE-IRL learner against adversarial training-set poisoning attacks. Our fast convergence results in Theorem 1 suggests that the MCE-IRL learner is actually brittle to adversarial attacks, and designing a robust MCE-IRL is an important direction of future work.

Acknowledgments. This work was supported in part by the Swiss National Science Foundation (SNSF) under grant number 407540_167319.

References

  • [\citeauthoryearAbbeel and Ng2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
  • [\citeauthoryearAmin et al.2017] Kareem Amin, Nan Jiang, and Satinder P. Singh. Repeated inverse reinforcement learning. In NIPS, pages 1813–1822, 2017.
  • [\citeauthoryearAmir et al.2016] Ofra Amir, Ece Kamar, Andrey Kolobov, and Barbara J. Grosz. Interactive teaching strategies for agent training. In IJCAI, pages 804–811, 2016.
  • [\citeauthoryearArgall et al.2009] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 2009.
  • [\citeauthoryearBain and Sommut1999] Michael Bain and Claude Sommut. A framework for hehavioural claning. Machine Intelligence 15, 15:103, 1999.
  • [\citeauthoryearBillard et al.2008] Aude Billard, Sylvain Calinon, Ruediger Dillmann, and Stefan Schaal. Robot programming by demonstration. In Springer handbook of robotics. 2008.
  • [\citeauthoryearBoularias et al.2011] Abdeslam Boularias, Jens Kober, and Jan Peters. Relative entropy inverse reinforcement learning. In AISTATS, pages 182–189, 2011.
  • [\citeauthoryearBrown and Niekum2019] Daniel S. Brown and Scott Niekum. Machine teaching for inverse reinforcement learning: Algorithms and applications. In AAAI, 2019.
  • [\citeauthoryearBuchsbaum et al.2011] Daphna Buchsbaum, Alison Gopnik, Thomas L Griffiths, and Patrick Shafto. Children’s imitation of causal action sequences is influenced by statistical and pedagogical evidence. Cognition, 2011.
  • [\citeauthoryearCakmak and Lopes2012] Maya Cakmak and Manuel Lopes. Algorithmic and human teaching of sequential decision tasks. In AAAI, 2012.
  • [\citeauthoryearCakmak and Thomaz2014] Maya Cakmak and Andrea L Thomaz. Eliciting good teaching from humans for machine learners. Artificial Intelligence, 217:198–215, 2014.
  • [\citeauthoryearChen et al.2018] Yuxin Chen, Adish Singla, Oisin Mac Aodha, Pietro Perona, and Yisong Yue. Understanding the role of adaptivity in machine teaching: The case of version space learners. In NeurIPS, 2018.
  • [\citeauthoryearChernova and Thomaz2014] Sonia Chernova and Andrea L Thomaz. Robot learning from human teachers. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2014.
  • [\citeauthoryearDorsa Sadigh et al.2017] Anca D Dragan Dorsa Sadigh, Shankar Sastry, and Sanjit A Seshia. Active preference-based learning of reward functions. In RSS, 2017.
  • [\citeauthoryearFinn et al.2016] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In ICML, pages 49–58, 2016.
  • [\citeauthoryearGoldman and Kearns1995] Sally A Goldman and Michael J Kearns. On the complexity of teaching. Journal of Computer and System Sciences, 50(1):20–31, 1995.
  • [\citeauthoryearHadfield-Menell et al.2016] Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In NIPS, 2016.
  • [\citeauthoryearHaug et al.2018] Luis Haug, Sebastian Tschiatschek, and Adish Singla. Teaching Inverse Reinforcement Learners via Features and Demonstrations. In NeurIPS, 2018.
  • [\citeauthoryearHo and Ermon2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, 2016.
  • [\citeauthoryearHo et al.2016] Mark K Ho, Michael Littman, James MacGlashan, Fiery Cushman, and Joseph L Austerweil. Showing versus doing: Teaching by demonstration. In NIPS, 2016.
  • [\citeauthoryearHunziker et al.2018] Anette Hunziker, Yuxin Chen, Oisin Mac Aodha, Manuel Gomez-Rodriguez, Andreas Krause, Pietro Perona, Yisong Yue, and Adish Singla. Teaching multiple concepts to a forgetful learner. CoRR, abs/1805.08322, 2018.
  • [\citeauthoryearLevine et al.2010] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Feature construction for inverse reinforcement learning. In NIPS, pages 1342–1350, 2010.
  • [\citeauthoryearLiu et al.2017] Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B. Smith, James M. Rehg, and Le Song. Iterative machine teaching. In ICML, 2017.
  • [\citeauthoryearLiu et al.2018] Weiyang Liu, Bo Dai, Xingguo Li, Zhen Liu, James Rehg, and Le Song. Towards black-box iterative machine teaching. In ICML, 2018.
  • [\citeauthoryearMelo et al.2018] Francisco S Melo, Carla Guerra, and Manuel Lopes. Interactive optimal teaching with unknown learners. In IJCAI, pages 2567–2573, 2018.
  • [\citeauthoryearNg and Russell2000] Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In ICML, 2000.
  • [\citeauthoryearNg et al.1999] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, 1999.
  • [\citeauthoryearOsa et al.2018] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, Jan Peters, et al. An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 7(1-2):1–179, 2018.
  • [\citeauthoryearPatil et al.2014] Kaustubh R Patil, Xiaojin Zhu, Łukasz Kopeć, and Bradley C Love. Optimal teaching for limited-capacity human learners. In NIPS, pages 2465–2473, 2014.
  • [\citeauthoryearRafferty et al.2016] Anna N Rafferty, Emma Brunskill, Thomas L Griffiths, and Patrick Shafto. Faster teaching via pomdp planning. Cognitive science, 2016.
  • [\citeauthoryearRatliff et al.2006] Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In ICML, pages 729–736, 2006.
  • [\citeauthoryearRhinehart and Kitani2017] Nicholas Rhinehart and Kris M Kitani. First-person activity forecasting with online inverse reinforcement learning. In ICCV, pages 3696–3705, 2017.
  • [\citeauthoryearRussell1998] Stuart Russell. Learning agents for uncertain environments. In COLT, pages 101–103, 1998.
  • [\citeauthoryearSchaal1997] Stefan Schaal. Learning from demonstration. In NIPS, pages 1040–1046, 1997.
  • [\citeauthoryearShafto et al.2014] Patrick Shafto, Noah D Goodman, and Thomas L Griffiths. A rational account of pedagogical reasoning: Teaching by, and learning from, examples. Cognitive psychology, 71:55–89, 2014.
  • [\citeauthoryearSingla et al.2013] Adish Singla, Ilija Bogunovic, G Bartók, A Karbasi, and A Krause. On actively teaching the crowd to classify. In NIPS Workshop on Data Driven Education, 2013.
  • [\citeauthoryearSingla et al.2014] Adish Singla, Ilija Bogunovic, Gábor Bartók, Amin Karbasi, and Andreas Krause. Near-optimally teaching the crowd to classify. In ICML, 2014.
  • [\citeauthoryearSorg et al.2010] Jonathan Sorg, Satinder P. Singh, and Richard L. Lewis. Reward design via online gradient ascent. In NIPS, pages 2190–2198, 2010.
  • [\citeauthoryearSun et al.2018] Wen Sun, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Dual policy iteration. arXiv:1805.10755, 2018.
  • [\citeauthoryearTadepalli2008] Prasad Tadepalli. Learning to solve problems from exercises. Computational Intelligence, 2008.
  • [\citeauthoryearTorrey and Taylor2013] Lisa Torrey and Matthew Taylor. Teaching on a budget: Agents advising agents in reinforcement learning. In AAMAS, pages 1053–1060, 2013.
  • [\citeauthoryearWalsh and Goschin2012] Thomas J. Walsh and Sergiu Goschin. Dynamic teaching in sequential decision making environments. In UAI, 2012.
  • [\citeauthoryearWulfmeier et al.2015] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. CoRR, abs/1507.04888, 2015.
  • [\citeauthoryearYeo et al.2019] Teresa Yeo, Parameswaran Kamalaruban, Adish Singla, Arpit Merchant, Thibault Asselborn, Louis Faucon, Pierre Dillenbourg, and Volkan Cevher. Iterative classroom teaching. In AAAI, 2019.
  • [\citeauthoryearZhang et al.2009] Haoqi Zhang, David C Parkes, and Yiling Chen. Policy teaching through reward function learning. In EC, pages 295–304, 2009.
  • [\citeauthoryearZhou et al.2018] Zhengyuan Zhou, Michael Bloem, and Nicholas Bambos. Infinite time horizon maximum causal entropy inverse reinforcement learning. IEEE Trans. Automat. Contr., 63(9):2787–2802, 2018.
  • [\citeauthoryearZhu et al.2018] Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N. Rafferty. An overview of machine teaching. CoRR, abs/1801.05927, 2018.
  • [\citeauthoryearZiebart et al.2008] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
  • [\citeauthoryearZiebart2010] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.

Appendix A Gradient for Sequential MCE-IRL

In this section, we show that can be seen as an empirical counterpart of the gradient of the following loss function:

capturing the discounted negative log-likelihood of teacher’s demonstrations. In Proposition 1, we show that gradient is given by (see Eq. (9)).

Given the teacher’s demonstration with starting state , we compute as follows:

  • In Eq. (9), consider the gradient component corresponding to teacher’s policy, i.e., . We replace with it’s empirical counterpart using . This results in the following component in :

  • In Eq. (9), consider the gradient component corresponding to learner’s policy, i.e., . We compute with as the only initial state, i.e., . This results in the following component in :

Hence, given by

Proposition 1.

Consider the loss function defined as follows:

Then the gradient of w.r.t. is given by

(9)
Proof.

We will make use of the following quantities as part of the proof:

  • is defined as the probability of visiting state at time by the policy .

  • is defined as the probability of taking action from state at time by the policy .

  • is defined as the probability of visiting state at time by the policy .

  • is defined as the probability of taking action from state at time by the policy .

First, we rewrite as follows:

Now, we compute the gradient of the first term: