Teaching Inverse Reinforcement Learnersvia Features and Demonstrations

Teaching Inverse Reinforcement Learners via Features and Demonstrations


Learning near-optimal behaviour from an expert’s demonstrations typically relies on the assumption that the learner knows the features that the true reward function depends on. In this paper, we study the problem of learning from demonstrations in the setting where this is not the case, i.e., where there is a mismatch between the worldviews of the learner and the expert. We introduce a natural quantity, the teaching risk, which measures the potential suboptimality of policies that look optimal to the learner in this setting. We show that bounds on the teaching risk guarantee that the learner is able to find a near-optimal policy using standard algorithms based on inverse reinforcement learning. Based on these findings, we suggest a teaching scheme in which the expert can decrease the teaching risk by updating the learner’s worldview, and thus ultimately enable her to find a near-optimal policy.

1 Introduction

Reinforcement learning has recently led to impressive and widely recognized results in several challenging application domains, including game-play, e.g., of classical games (Go) and Atari games. In these applications, a clearly defined reward function, i.e., whether a game is won or lost in the case of Go or the number of achieved points in the case of Atari games, is optimized by a reinforcement learning agent interacting with the environment.

However, in many applications it is very difficult to specify a reward function that captures all important aspects. For instance, in an autonomous driving application, the reward function of an autonomous vehicle should capture many different desiderata, including the time to reach a specified goal, safe driving characteristics, etc. In such situations, learning from demonstrations can be a remedy, transforming the need of specifying the reward function to the task of providing an expert’s demonstrations of desired behaviour; we will refer to this expert as the teacher. Based on these demonstrations, a learning agent, or simply learner attempts to infer a (stochastic) policy that approximates the feature counts of the teacher’s demonstrations. Examples for algorithms that can be used to that end are those in Abbeel and Ng (2004) and Ziebart et al. (2008), which use inverse reinforcement learning (IRL) to estimate a reward function for which the demonstrated behaviour is optimal, and then derive a policy based on that.

For this strategy to be successful, i.e., for the learner to find a policy that achieves good performance with respect to the reward function set by the teacher, the learner has to know what features the teacher considers and the reward function depends on. However, as we argue, this assumption does not hold in many real-world applications. For instance, in the autonomous driving application, the teacher, e.g., a human driver, might consider very different features, including high-level semantic features, while the learner, i.e., the autonomous car, only has sensory inputs in the form of distance sensors and cameras providing low-level semantic features. In such a case, there is a mismatch between the teacher’s and the learner’s features which can lead to degraded performance and unexpected behaviour of the learner.

In this paper we investigate exactly this setting. We assume that the true reward function is a linear combination of a set of features known to the teacher. The learner also assumes that the reward function is linear, but in features which are different from the truly relevant ones; e.g., the learner could only observe a subset of those features. In this setting, we study the potential decrease in performance of the learner as a function of the learner’s worldview. We introduce a natural and easily computable quantity, the teaching risk, which bounds the maximum possible performance gap of the teacher and the learner.

We continue our investigation by considering a teaching scenario in which the teacher can provide additional features to the learner, e.g., add additional sensors to an autonomous vehicle. This naturally raises the question which features should be provided to the learner to maximize her performance. To this end, we propose an algorithm that greedily minimizes the teaching risk, thereby shrinking the maximal gap in performance that policies optimized with respect to the learner’s resp. teacher’s worldview can have.

Or main contributions are:

  1. We formalize the problem of worldview mismatch for reward computation and policy optimization based on demonstrations.

  2. We introduce the concept of teaching risk, bounding the maximal performance gap of the teacher and the learner as a function of the learner’s worldview and the true reward function.

  3. We formally analyze the teaching risk and its properties, giving rise to an algorithm for teaching a learner with an incomplete worldview.

  4. We substantiate our findings in a large set of experiments.

2 Related Work

Our work is related to the area of algorithmic machine teaching, where the objective is to design effective teaching algorithms to improve the learning process of a learner Zhu et al. (2018); Zhu (2015). Machine teaching has recently been studied in the context of diverse real-world applications such as personalized education and intelligent tutoring systems Hunziker et al. (2018); Rafferty et al. (2016); Patil et al. (2014), social robotics Cakmak and Thomaz (2014), adversarial machine learning Mei and Zhu (2015), program synthesis Mayer et al. (2017), and human-in-the-loop crowdsourcing systems Singla et al. (2014, 2013). However, different from ours, most of the current work in machine teaching is limited to supervised learning settings, and to a setting where the teacher has full knowledge about the learner’s model.

Going beyond supervised learning, Cakmak et al. (2012); Brown and Niekum (2018) have studied the problem of teaching an IRL agent, similar in spirit to what we do in our work. Our work differs from their work in several aspects—Cakmak et al. (2012); Brown and Niekum (2018) assumes that the teacher has full knowledge of the learner’s feature space, and then provides a near-optimal set of demonstrations in a batch at once; we consider a more realistic setting where there is a mismatch between the teacher’s and the learner’s feature space. Furthermore, in our setting, the teaching signal is a mixture of demonstrations and features.

Our work is also related to teaching via explanations and features as explored recently by Aodha et al. (2018) in a supervised learning setting. However, we explore the space of teaching by explanations when teaching an IRL agent, which makes it technically very different from Aodha et al. (2018). Another important aspect of our teaching algorithm is that it is adaptive in nature, in the sense that the next teaching signal accounts for the current performance of the learner (i.e., worldview in our setting). Recent work of Chen et al. (2018); Liu et al. (2017) have studied adaptive teaching algorithms, however only in a supervised learning setting.

Apart from machine teaching, our work is related to Stadie et al. (2017) and Sermanet et al. (2018), which also study imitation learning problems in which the teacher and the learner view the world differently. However, these two works are technically very different from ours, as we consider the problem of providing teaching signals under worldview mismatch from the perspective of the teacher.

3 The Model

Basic definitions.

Our environment is described by a Markov decision process , where is a finite set of states, is a finite set of available actions, is a family of distributions on indexed by with describing the probability of transitioning from state to state when action is taken, is the initial-state distribution on describing the probability of starting in a given state, is a reward function and is a discount factor. We assume that there exists a feature map such that the reward function is linear in the features given by , i.e.

for some which we assume to satisfy .

By a policy we mean a family of distributions on indexed by , where describes the probability taking action in state . We denote by the set of all such policies. The performance measure for policies we are interested in is the expected discounted reward , where the expectation is taken with respect to the distribution over trajectories induced by together with the transition probabilities and the initial-state distribution . We call a policy optimal for the reward function if . Note that

where , , is the map taking a policy to its vector of (discounted) feature expectations. Note also that the image of this map is a bounded subset of due to the finiteness of and the presence of the discounting factor ; we denote by its diameter. Here and in what follows, denotes the Euclidean norm.

Problem formulation.

We consider two agents, a learner and a teacher , whose ultimate objective is that finds a near-optimal policy with the help of .

The challenge we address in this paper is that of achieving this objective under the assumption that there is a mismatch between the worldviews of and , by which we mean the following: Instead of the “true” feature vectors , observes feature vectors , where

is a linear map (i.e., a matrix) that we interpret as ’s worldview. The simplest case is that selects a subset of the features given by , thus modelling the situation where only has access to a subset of the features relevant for the true reward, which is a reasonable assumption for many real-world situations. More generally, could encode different weightings of those features.

The question we ask is whether and how can provide demonstrations or perform other teaching interventions, in a way such as to make sure that achieves the goal of finding a policy with near-optimal performance, or at least gets as close to that goal as possible.

Assumptions on the teacher and on the learner.

We assume that knows the full specification of the MDP as well as ’s worldview , and that she can help to learn in two different ways:

  1. By providing with demonstrations of behaviour in the MDP;

  2. By updating ’s worldview .

Demonstrations can be provided in the form of trajectories sampled from a (not necessarily optimal) policy , or in the form of feature expectations of such a policy. The method by which can update will be discussed in Section 5. Based on ’s instructions, then attemps to train a policy whose feature expectations approximate those of . Note that, if this is successful, the performance of is close to that of due to the form of the reward function.

We assume that has access to an algorithm that enables her to do the following: Whenever she is given sufficiently many demonstrations sampled from a policy , she is able to find a policy whose feature expectations in her worldview approximate those of , i.e., . Examples for algorithms that could use to that end are the algorithms in Abbeel and Ng (2004) and Ziebart et al. (2008) which are based on IRL. The following discussion does not require any further specification of what precise algorithm uses in order to match feature expectations.

Figure 1: A simple example to illustrate the challenges arising when teaching under worldview mismatch. We consider an MDP in which is the set of cells in the gridworld displayed, , and with feature map taking to the one-hot vector . The initial state distribution is uniform and the transition dynamics are deterministic. More specifically, when agent takes action (resp. ), it moves to the neighboring cell to the right (resp. left); when agent is in the rightmost (resp. leftmost) cell, the action (resp. ) is not permitted. The reward weights are given by up to normalizing; the values are also encoded by the colors of the cells. The policy in (\subreffig:1d-policies-optimal-policy) is the optimal policy with respect to the true reward function. Assuming that the learner only observes the feature corresponding to the central cell, i.e., , the policy in (\subreffig:1d-policies-policy0) is a better teaching policy in the worst-case sense. See the main text for a detailed description.

Challenges when teaching under worldview mismatch.

If there was no mismatch in the worldview (i.e., if was the identity matrix in ), then the teacher could simply provide demonstrations from the optimal policy to achieve the desired objective. However, the example in Figure 1 illustrates that this is not the case when there is a mismatch between the worldviews.

For the MDP in Figure 1, assume that the teacher provides demonstrations using , which moves to the rightmost cell as quickly as possible and then alternates between cells 4 and 5 (see Figure 1). Note that the policy which moves to the leftmost cell as quickly as possible and then alternates between cells 1 and 2, has the same feature expectations as in the learner’s worldview; in fact, is the unique policy other than with that property (provided we restrict to deterministic policies). As the teacher is unaware of the internal workings of the learner, she has no control over which of these two policies the learner will eventually learn by matching feature expectations.

However, the teacher can ensure that the learner achieves a better performance in a worst case sense by providing demonstrations tailored to the problem at hand. In particular, assume that the teacher uses , the policy shown in Figure 1, which moves to the central cell as quickly as possible and then alternates between cells 3 and 4. The only other policy with which the learner could match the feature expectations of in her worldview (restricting again to deterministic policies) is the one that moves to the central cell as quickly as possible and then alternates between states 2 and 3.

Note that , and hence is a better teaching policy than regarding the performance that a learner matching feature expectations in her worldview achieves in the worst case. In particular, this example shows that providing demonstrations from the truly optimal policy does not guarantee that the learner’s policy achieves good performance in general.

4 Teaching Risk

Definition of teaching risk.

The fundamental problem in the setting described in Section 3 is that two policies that perform equally well with respect to any estimate that may have of the reward function, may perform very differently with respect to the true reward function. Hence, even if is able to imitate the behaviour of the teacher well in her worldview, there is genenerally no guarantee on how good her performance is with respect to the true reward function. For an illustration, see Figure 2.

To address this problem, we define the following quantity: The teaching risk for a given worldview with respect to reward weights is


Here and denote the kernels of resp. . Geometrically, is the cosine of the angle between and ; in other words, measures the degree to which deviates from satisfying .

Significance of teaching risk.

To understand the significance of the teaching risk in our context, assume that is able to find a policy which matches the feature expectations of ’s (not necessarily optimal) policy perfectly in her worldview, which is equivalent to . Directly from the definition of the teaching risk, we see that the gap between their performances with respect to the true reward function satisfies


with equality if is proportional to a vector realizing the maximum in (1). If the teaching risk is large, this performance gap can generally be large as well. This motivates the interpretation of as a measure of the risk when teaching the task modelled by an MDP with reward weights to a learner whose worldview is represented by .

Figure 2: Two policies in the environment introduced in Figure 1 (the policy here is identical to the one in Figure 1(\subreffig:1d-policies-policy0)). We assume again that can only observe the feature corresponding to the central cell. Provided that the initial state distribution is uniform, the feature expectations of and in ’s worldview are equal, and hence these policies perform equally well with respect to any estimate of the reward function that may have. In fact, both look optimal to if she assumes that the central cell carries positive reward. However, their performance with respect to the true reward function is positive for but negative for . This illustrates that, if all we know about is that she matches feature counts in her worldview, we can generally not give good performance guarantees for the policy she finds.

On the other hand, smallness of the teaching risk implies that this performance gap cannot be too large. The following theorem generalizes the bound in (2) to the situation in which only approximates the feature expectations of .

Theorem 1.

Assume that . Then the gap between the true performances of and satisfies

with .

Theorem 1 shows the following: If imitates ’s behaviour well in her worldview (meaning that can be chosen small) and if the teaching risk is sufficiently small, then will perform nearly as well as with respect to the true reward. In particular, if ’s policy is optimal, , then ’s policy is guaranteed to be near-optimal.

The quantity appearing in Theorem 1 is a bound on the amount to which distorts lengths of vectors in the orthogonal complement of . Note that is independent of the teaching risk, in the sense that one can change it, e.g., by rescaling by some , without changing the teaching risk.

Teaching risk as obstruction to recognizing optimality.

We now provide a second motivation for the consideration of the teaching risk, by interpreting it as a quantity that measures the degree to which truly optimal policies deviate from looking optimal to . We make the technical assumption that is the closure of a bounded open set with smooth boundary (this will only be needed for the proofs). Our first observation is the following:

Proposition 1.

Let be a policy which is optimal for . If , then is suboptimal with respect to any choice of reward function with .

In view of Proposition 1, a natural question is whether we can bound the suboptimality, in ’s view, of a truly optimal policy in terms of the teaching risk. The following theorem provides such a bound:

Theorem 2.

Let be a policy which is optimal for . There exists a unit vector such that

where .

Note that the expression on the right hand side of the inequality in Theorem 2 tends to as , provided is bounded. Theorem 2 therefore implies that, if is small, a truly optimal policy is near-optimal for some choice of reward function linear in the features observes, namely, the reward function with the vector whose existence is claimed by the theorem.

5 Teaching

Reward vector , set of teachable features , feature budget , initial worldview , teacher policy , initial learner policy , performance threshold .
for  do
     if  then
          selects feature to teach
          ’s worldview gets updated
          trains a new policy
     end if
end for
Algorithm 1 TRGreedy: Feature- and demo-based teaching with TR-greedy feature selection

Feature teaching.

The discussion in the last section shows that, under our assumptions on how learns, a teaching scheme in which solely provides demonstrations to can generally, i.e., without any assumption on the teaching risk, not lead to reasonable guarantees on the learner’s performance with respect to the true reward. A natural strategy is to introduce additional teaching operations by which the teacher can update ’s worldview and thereby decrease the teaching risk.

The simplest way by which the can change ’s worldview is by informing her about features that are relevant to performing well in the task, thus causing her to update her worldview to

Viewing as a matrix, this operation appends as a row to . (Strictly speaking, the feature that is thus provided is ; we identify this map with the vector in the following and thus keep calling a “feature”.)

This operation has simple interpretations in the settings we are interested in: If is a human learner, “teaching a feature” could mean making aware that a certain quantity, which she might not have taken into account so far, is crucial to achieving high performance. If is a machine, such as an autonomous car or a robot, it could mean installing an additional sensor.

Teachable features.

Note that if could provide arbitrary vectors as new features, she could always, no matter what is, decrease the teaching risk to zero in a single teaching step by choosing , which amounts to telling the true reward function. We assume that this is not possible, and that instead only the elements of a fixed finite set of teachable features

can be taught. In real-world applications, such constraints could come from the limited availability of sensors and their costs; in the case that is a human, they could reflect the requirement that features need to be interpretable, i.e., that they can only be simple combinations of basic observable quantities.

Greedy minimization of teaching risk.

Our basic teaching algorithm TRGreedy (Algorithm 1) works as follows: and interact in rounds, in each of which provides with the feature which reduces the teaching risk of ’s worldview with respect to by the largest amount. then trains a policy with the goal of imitating her current view of the feature expectations of the teacher’s policy; the Learning algorithm she uses could be the apprenticeship learning algorithm from Abbeel and Ng (2004).

Computation of the teaching risk.

The computation of the teaching risk required of in every round of Algorithm 1 can be performed as follows: One first computes the orthogonal complement of in and intersects that with , thus obtaining (generically) a 1-dimensional subspace of ; this can be done using SVD. The teaching risk is then with the unique unit vector in with .

6 Experiments

Figure 3: Performance vs. teaching risk. Each point in the plots shows the relative performance that a learner with a random worldview matrix achieved after one round of learning and the teaching risk of . For all plots, a gridworld with , was used. The reward vector was sampled randomly in each round. (\subreffig:tr-vs-perf.a)–(\subreffig:tr-vs-perf.c) correspond to different values of the discount factor .

Our experimental setup is similar to the one in Abbeel and Ng (2004), i.e., we use gridworlds in which non-overlapping square regions of neighbouring cells are grouped together to form macrocells for some dividing . The state set is the set of gridpoints, the action set is , and the feature map maps a gridpoint belonging to macrocell to the one-hot vector ; the dimension of the “true” feature space is therefore . Note that these gridworlds satisfy the quite special property that for states , we either have (if belong to the same macrocell), or . The reward weights are sampled randomly for all experiments unless mentioned otherwise. As the Learning algorithm within Algorithm 1, we use the projection version of the apprenticeship learning algorithm from Abbeel and Ng (2004).

Performance vs. teaching risk.

The plots in Figure 3 illustrate the significance of the teaching risk for the problem of teaching a learner under worldview mismatch. To obtain these plots, we used a gridworld with , ; for each value , we sampled five random worldview matrices , and let train a policy using the projection algorithm in Abbeel and Ng (2004), with the goal of matching the feature expectations corresponding to an optimal policy for a reward vector that was sampled randomly in each round. Each point in the plots corresponds to one such experiment and shows the relative performance of after the training round vs. the teaching risk of ’s worldview matrix .

All plots in Figure 3 show that the variance of the learner’s performance decreases as the teaching risk decreases. This supports our interpretation of the teaching risk as a measure of the potential gap between the performances of and when matches the feature expectations of in her worldview. The plots also show that the bound for this gap provided in Theorem 1 is overly conservative in general, given that ’s performance is often high and has small variance even if the teaching risk is relatively large.

The plots indicate that for larger (i.e., less discounting), it is easier for to achieve high performance even if the teaching risk is large. This makes intuitive sense: If there is a lot of discounting, it is important to reach high reward states quickly in order to perform well, which necessitates being able to recognize where these states are located, which in turn requires the teaching risk to be small. If there is little discounting, it is sufficient to know the location of some maybe distant reward state, and hence even a learner with a very deficient worldview (i.e., high teaching risk) can do well in that case.

Figure 4: Gridworld with . The colors in (\subreffig:gridworld-interpretable) indicate the reward of the corresponding macrocell, with blue meaning positive and red meaning negative reward. The numbers within each macrocell correspond to the feature index. The histograms in (\subreffig:histogram1) and (\subreffig:histogram2) show how often, in a series of 100 experiments, each feature was selected as the first resp. second feature to be taught to a learner with a random 5-dimensional initial worldview.

Small gridworlds with high reward states and obstacles.

We tested TRGreedy (Algorithm 1) on gridworlds such as the one in Figure 4, with a small number of states with high positive rewards, some obstacle states with high negative rewards, and all other states having rewards close to zero. The histograms in Figures 4 and 4 show how often each of the features was selected by the algorithm as the first resp. second feature to be taught to the learner in 100 experiments, in each of which the learner was initialized with a random 5-dimensional worldview. In most cases, the algorithm first selected the features corresponding to one of the high reward cells 4 and 9 or to one of the obstacle cells 18 and 23, which are clearly those that the learner must be most aware of in order to achieve high performance.

Comparison of algorithms.

We compared the performance of TRGreedy (Algorithm 1) to two variants of the algorithm which are different only in how the features to be taught in each round are selected: The first variant, Random, simply selects a random feature from the set of all teachable features. The second variant, PerfGreedy, greedily selects the feature that will lead to the best performance in the next round among all (computed by simulating the teaching process for each feature and evaluating the corresponding learner).

The plots in Figure 5 show, for each of the three algorithms, the relative performance with respect to the true reward function that the learner achieved after each round of feature teaching and training a policy , as well as the corresponding teaching risks and runtimes, plotted over the number of features taught. The relative performance of the learner’s policy was computed as .

We observed in all our experiments that TRGreedy performed significantly better than Random. While the comparison between TRGreedy and PerfGreedy was slightly in favour of the latter, one should note that a teacher running PerfGreedy must simulate a learning round of for all features not yet taught, which presupposes that knows ’s learning algorithm, and which also leads to very high runtime. If only knows that is able to match (her view of) the feature expectations of ’s demonstrations and simulates using some algorithm capable of this, there is no guarantee that will perform as well as ’s simulated counterpart, as there may be a large discrepancy between the true performances of two policies which in ’s view have the same feature expectations. In contrast, TRGreedy relies on much less information, namely the kernel of , and in particular is agnostic to the precise learning algorithm that uses to approximate feature counts.

\thesubsubfigure Relative performance
\thesubsubfigure Teaching risk
\thesubsubfigure Runtime
Figure 5: Comparison of TRGreedy vs. PerfGreedy vs. Random. The plots show (\subreffig:experimental-comparison-a) the relative performance that the learner achieved after each round of feature teaching and training a policy, (\subreffig:experimental-comparison-b) the teaching risk after each such step, and (\subreffig:experimental-comparison-c) the runtime required to perform each step. We averaged over 100 experiments, in each of which a new random gridworld of size and a new set of randomly selected features with were sampled; the bars in the relative performance plot indicate the standard deviations. The discount factor used was in all cases.

7 Conclusions and Outlook

We presented an approach to dealing with the problem of worldview mismatch in situations in which a learner attempts to find a policy matching the feature counts of a teacher’s demonstrations. We introduced the teaching risk, a quantity that depends on the worldview of the learner and the true reward function and which (1) measures the degree to which policies which are optimal from the point of view of the learner can be suboptimal from the point of view of the teacher, and (2) is an obstruction for truly optimal policies to look optimal to the learner. We showed that under the condition that the teaching risk is small, a learner matching feature counts using e.g. standard IRL-based methods is guaranteed to learn a near-optimal policy from demonstrations of the teacher even under worldview mismatch.

Based on these findings, we presented our teaching algorithm TRGreedy, in which the teacher updates the learner’s worldview by teaching her features which are relevant for the true reward function in a way that greedily minimizes the teaching risk, and then provides her with demostrations based on which she learns a policy using any suitable algorithm. We tested our algorithm in gridworld settings and compared it to other ways of selecting features to be taught. Experimentally, we found that TRGreedy performed comparably to a variant which selected features based on greedily maximizing performance, and consistently better than a variant with randomly selected features.

We plan to investigate extensions of our ideas to nonlinear settings and to test them in more complex environments in future work. We hope that, ultimately, such extensions will be applicable in real-world scenarios, for example in systems in which human expert knowledge is represented as a reward function, and where the goal is to teach this expert knowledge to human learners.

Appendix A Proof of Theorem 1

Proof of Theorem 1.

Denote by the orthogonal projection onto and let . Note that we have and . It follows that

using the definition of , the fact that , and the assumption that . We then obtain

using the triangle inequality, the Cauchy-Schwarz inequality and the definition of , and the estimates above. ∎

Appendix B Proof of Proposition 1

Figure 6: A situation in which : Here, is the projection on the horizontal axis. The points in which maps to the boundary of , and which therefore appear optimal to , are the two points marked by , at which the normal vector to is contained in ; these are precisely the points which are optimal for some with (namely, ). All other points in get mapped by to the interior of and therefore appear suboptimal for any choice of reward function might consider.
Proof of Proposition 1.

As mentioned in the main text, we assume that the set is the closure of a bounded open set and has a smooth boundary .

Note that for a policy to be optimal with respect to , the following conditions need to be satisfied:

  1. lies in the boundary ,

  2. is normal to at .

The second statement is equivalent to saying that the tangent space to at is .

Assume now that . This is equivalent to saying that , i.e., to saying that is not tangent to at . That implies that there exist some such that is contained in the interior of , which means that a sufficiently small ball around is contained in . In particular, a small ball around in the affine space is entirely contained in . This implies that is contained in the interior of , i.e., not in the boundary . Therefore is suboptimal with respect to any choice of reward function with . ∎

Appendix C Proof of Theorem 2

Proof of Theorem 2.

The assumption that is optimal for the reward function implies that for all . By decomposing as , where denotes the orthogonal projection onto and the orthogonal projection onto , we obtain


The first summand can be bounded as follows:


using the Cauchy-Schwarz inequality and the fact that . By combining estimates (3) and (4), we obtain


Denote now by the Moore-Penrose pseudoinverse of , and by its transpose. We have


where the second equality uses the fact that the restriction of to is the identity (in fact, , a general property of Moore-Penrose pseudoinverses). Setting and combining inequality (5) with (6), we obtain


We now estimate the term :


Since , combining (7) and (8) yields

This holds for all , and hence we can maximize over to obtain the statement claimed in Theorem 2. ∎


  1. Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In ICML.
  2. Aodha, O. M., Su, S., Chen, Y., Perona, P., and Yue, Y. (2018). Teaching categories to human learners with visual explanations. In CVPR.
  3. Brown, D. S. and Niekum, S. (2018). Machine teaching for inverse reinforcement learning: Algorithms and applications. CoRR, abs/1805.07687.
  4. Cakmak, M., Lopes, M., et al. (2012). Algorithmic and human teaching of sequential decision tasks. In AAAI.
  5. Cakmak, M. and Thomaz, A. L. (2014). Eliciting good teaching from humans for machine learners. Artificial Intelligence, 217:198–215.
  6. Chen, Y., Singla, A., Mac Aodha, O., Perona, P., and Yue, Y. (2018). Understanding the role of adaptivity in machine teaching: The case of version space learners. In NIPS.
  7. Hunziker, A., Chen, Y., Mac Aodha, O., Gomez-Rodriguez, M., Krause, A., Perona, P., Yue, Y., and Singla, A. (2018). Teaching multiple concepts to a forgetful learner. CoRR, abs/1805.08322.
  8. Liu, W., Dai, B., Humayun, A., Tay, C., Yu, C., Smith, L. B., Rehg, J. M., and Song, L. (2017). Iterative machine teaching. In ICML, pages 2149–2158.
  9. Mayer, M., Hamza, J., and Kuncak, V. (2017). Proactive synthesis of recursive tree-to-string functions from examples (artifact). In DARTS-Dagstuhl Artifacts Series, volume 3. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
  10. Mei, S. and Zhu, X. (2015). Using machine teaching to identify optimal training-set attacks on machine learners. In AAAI, pages 2871–2877.
  11. Patil, K. R., Zhu, X., Kopeć, Ł., and Love, B. C. (2014). Optimal teaching for limited-capacity human learners. In NIPS, pages 2465–2473.
  12. Rafferty, A. N., Brunskill, E., Griffiths, T. L., and Shafto, P. (2016). Faster teaching via pomdp planning. Cognitive science, 40(6):1290–1332.
  13. Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S., and Brain, G. (2018). Time-contrastive networks: Self-supervised learning from video. In ICRA, pages 1134–1141.
  14. Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A. (2013). On actively teaching the crowd to classify. In NIPS Workshop on Data Driven Education.
  15. Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A. (2014). Near-optimally teaching the crowd to classify. In ICML, pages 154–162.
  16. Stadie, B. C., Abbeel, P., and Sutskever, I. (2017). Third-person imitation learning. CoRR, abs/1703.01703.
  17. Zhu, X. (2015). Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI, pages 4083–4087.
  18. Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. (2018). An overview of machine teaching. CoRR, abs/1801.05927.
  19. Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description