Teaching Inverse Reinforcement Learners via Features and Demonstrations
Abstract
Learning nearoptimal behaviour from an expert’s demonstrations typically relies on the assumption that the learner knows the features that the true reward function depends on. In this paper, we study the problem of learning from demonstrations in the setting where this is not the case, i.e., where there is a mismatch between the worldviews of the learner and the expert. We introduce a natural quantity, the teaching risk, which measures the potential suboptimality of policies that look optimal to the learner in this setting. We show that bounds on the teaching risk guarantee that the learner is able to find a nearoptimal policy using standard algorithms based on inverse reinforcement learning. Based on these findings, we suggest a teaching scheme in which the expert can decrease the teaching risk by updating the learner’s worldview, and thus ultimately enable her to find a nearoptimal policy.
1 Introduction
Reinforcement learning has recently led to impressive and widely recognized results in several challenging application domains, including gameplay, e.g., of classical games (Go) and Atari games. In these applications, a clearly defined reward function, i.e., whether a game is won or lost in the case of Go or the number of achieved points in the case of Atari games, is optimized by a reinforcement learning agent interacting with the environment.
However, in many applications it is very difficult to specify a reward function that captures all important aspects. For instance, in an autonomous driving application, the reward function of an autonomous vehicle should capture many different desiderata, including the time to reach a specified goal, safe driving characteristics, etc. In such situations, learning from demonstrations can be a remedy, transforming the need of specifying the reward function to the task of providing an expert’s demonstrations of desired behaviour; we will refer to this expert as the teacher. Based on these demonstrations, a learning agent, or simply learner attempts to infer a (stochastic) policy that approximates the feature counts of the teacher’s demonstrations. Examples for algorithms that can be used to that end are those in Abbeel and Ng (2004) and Ziebart et al. (2008), which use inverse reinforcement learning (IRL) to estimate a reward function for which the demonstrated behaviour is optimal, and then derive a policy based on that.
For this strategy to be successful, i.e., for the learner to find a policy that achieves good performance with respect to the reward function set by the teacher, the learner has to know what features the teacher considers and the reward function depends on. However, as we argue, this assumption does not hold in many realworld applications. For instance, in the autonomous driving application, the teacher, e.g., a human driver, might consider very different features, including highlevel semantic features, while the learner, i.e., the autonomous car, only has sensory inputs in the form of distance sensors and cameras providing lowlevel semantic features. In such a case, there is a mismatch between the teacher’s and the learner’s features which can lead to degraded performance and unexpected behaviour of the learner.
In this paper we investigate exactly this setting. We assume that the true reward function is a linear combination of a set of features known to the teacher. The learner also assumes that the reward function is linear, but in features which are different from the truly relevant ones; e.g., the learner could only observe a subset of those features. In this setting, we study the potential decrease in performance of the learner as a function of the learner’s worldview. We introduce a natural and easily computable quantity, the teaching risk, which bounds the maximum possible performance gap of the teacher and the learner.
We continue our investigation by considering a teaching scenario in which the teacher can provide additional features to the learner, e.g., add additional sensors to an autonomous vehicle. This naturally raises the question which features should be provided to the learner to maximize her performance. To this end, we propose an algorithm that greedily minimizes the teaching risk, thereby shrinking the maximal gap in performance that policies optimized with respect to the learner’s resp. teacher’s worldview can have.
Or main contributions are:

We formalize the problem of worldview mismatch for reward computation and policy optimization based on demonstrations.

We introduce the concept of teaching risk, bounding the maximal performance gap of the teacher and the learner as a function of the learner’s worldview and the true reward function.

We formally analyze the teaching risk and its properties, giving rise to an algorithm for teaching a learner with an incomplete worldview.

We substantiate our findings in a large set of experiments.
2 Related Work
Our work is related to the area of algorithmic machine teaching, where the objective is to design effective teaching algorithms to improve the learning process of a learner Zhu et al. (2018); Zhu (2015). Machine teaching has recently been studied in the context of diverse realworld applications such as personalized education and intelligent tutoring systems Hunziker et al. (2018); Rafferty et al. (2016); Patil et al. (2014), social robotics Cakmak and Thomaz (2014), adversarial machine learning Mei and Zhu (2015), program synthesis Mayer et al. (2017), and humanintheloop crowdsourcing systems Singla et al. (2014, 2013). However, different from ours, most of the current work in machine teaching is limited to supervised learning settings, and to a setting where the teacher has full knowledge about the learner’s model.
Going beyond supervised learning, Cakmak et al. (2012); Brown and Niekum (2018) have studied the problem of teaching an IRL agent, similar in spirit to what we do in our work. Our work differs from their work in several aspects—Cakmak et al. (2012); Brown and Niekum (2018) assumes that the teacher has full knowledge of the learner’s feature space, and then provides a nearoptimal set of demonstrations in a batch at once; we consider a more realistic setting where there is a mismatch between the teacher’s and the learner’s feature space. Furthermore, in our setting, the teaching signal is a mixture of demonstrations and features.
Our work is also related to teaching via explanations and features as explored recently by Aodha et al. (2018) in a supervised learning setting. However, we explore the space of teaching by explanations when teaching an IRL agent, which makes it technically very different from Aodha et al. (2018). Another important aspect of our teaching algorithm is that it is adaptive in nature, in the sense that the next teaching signal accounts for the current performance of the learner (i.e., worldview in our setting). Recent work of Chen et al. (2018); Liu et al. (2017) have studied adaptive teaching algorithms, however only in a supervised learning setting.
Apart from machine teaching, our work is related to Stadie et al. (2017) and Sermanet et al. (2018), which also study imitation learning problems in which the teacher and the learner view the world differently. However, these two works are technically very different from ours, as we consider the problem of providing teaching signals under worldview mismatch from the perspective of the teacher.
3 The Model
Basic definitions.
Our environment is described by a Markov decision process , where is a finite set of states, is a finite set of available actions, is a family of distributions on indexed by with describing the probability of transitioning from state to state when action is taken, is the initialstate distribution on describing the probability of starting in a given state, is a reward function and is a discount factor. We assume that there exists a feature map such that the reward function is linear in the features given by , i.e.
for some which we assume to satisfy .
By a policy we mean a family of distributions on indexed by , where describes the probability taking action in state . We denote by the set of all such policies. The performance measure for policies we are interested in is the expected discounted reward , where the expectation is taken with respect to the distribution over trajectories induced by together with the transition probabilities and the initialstate distribution . We call a policy optimal for the reward function if . Note that
where , , is the map taking a policy to its vector of (discounted) feature expectations. Note also that the image of this map is a bounded subset of due to the finiteness of and the presence of the discounting factor ; we denote by its diameter. Here and in what follows, denotes the Euclidean norm.
Problem formulation.
We consider two agents, a learner and a teacher , whose ultimate objective is that finds a nearoptimal policy with the help of .
The challenge we address in this paper is that of achieving this objective under the assumption that there is a mismatch between the worldviews of and , by which we mean the following: Instead of the “true” feature vectors , observes feature vectors , where
is a linear map (i.e., a matrix) that we interpret as ’s worldview. The simplest case is that selects a subset of the features given by , thus modelling the situation where only has access to a subset of the features relevant for the true reward, which is a reasonable assumption for many realworld situations. More generally, could encode different weightings of those features.
The question we ask is whether and how can provide demonstrations or perform other teaching interventions, in a way such as to make sure that achieves the goal of finding a policy with nearoptimal performance, or at least gets as close to that goal as possible.
Assumptions on the teacher and on the learner.
We assume that knows the full specification of the MDP as well as ’s worldview , and that she can help to learn in two different ways:

By providing with demonstrations of behaviour in the MDP;

By updating ’s worldview .
Demonstrations can be provided in the form of trajectories sampled from a (not necessarily optimal) policy , or in the form of feature expectations of such a policy. The method by which can update will be discussed in Section 5. Based on ’s instructions, then attemps to train a policy whose feature expectations approximate those of . Note that, if this is successful, the performance of is close to that of due to the form of the reward function.
We assume that has access to an algorithm that enables her to do the following: Whenever she is given sufficiently many demonstrations sampled from a policy , she is able to find a policy whose feature expectations in her worldview approximate those of , i.e., . Examples for algorithms that could use to that end are the algorithms in Abbeel and Ng (2004) and Ziebart et al. (2008) which are based on IRL. The following discussion does not require any further specification of what precise algorithm uses in order to match feature expectations.
Challenges when teaching under worldview mismatch.
If there was no mismatch in the worldview (i.e., if was the identity matrix in ), then the teacher could simply provide demonstrations from the optimal policy to achieve the desired objective. However, the example in Figure 1 illustrates that this is not the case when there is a mismatch between the worldviews.
For the MDP in Figure 1, assume that the teacher provides demonstrations using , which moves to the rightmost cell as quickly as possible and then alternates between cells 4 and 5 (see Figure 1). Note that the policy which moves to the leftmost cell as quickly as possible and then alternates between cells 1 and 2, has the same feature expectations as in the learner’s worldview; in fact, is the unique policy other than with that property (provided we restrict to deterministic policies). As the teacher is unaware of the internal workings of the learner, she has no control over which of these two policies the learner will eventually learn by matching feature expectations.
However, the teacher can ensure that the learner achieves a better performance in a worst case sense by providing demonstrations tailored to the problem at hand. In particular, assume that the teacher uses , the policy shown in Figure 1, which moves to the central cell as quickly as possible and then alternates between cells 3 and 4. The only other policy with which the learner could match the feature expectations of in her worldview (restricting again to deterministic policies) is the one that moves to the central cell as quickly as possible and then alternates between states 2 and 3.
Note that , and hence is a better teaching policy than regarding the performance that a learner matching feature expectations in her worldview achieves in the worst case. In particular, this example shows that providing demonstrations from the truly optimal policy does not guarantee that the learner’s policy achieves good performance in general.
4 Teaching Risk
Definition of teaching risk.
The fundamental problem in the setting described in Section 3 is that two policies that perform equally well with respect to any estimate that may have of the reward function, may perform very differently with respect to the true reward function. Hence, even if is able to imitate the behaviour of the teacher well in her worldview, there is genenerally no guarantee on how good her performance is with respect to the true reward function. For an illustration, see Figure 2.
To address this problem, we define the following quantity: The teaching risk for a given worldview with respect to reward weights is
(1) 
Here and denote the kernels of resp. . Geometrically, is the cosine of the angle between and ; in other words, measures the degree to which deviates from satisfying .
Significance of teaching risk.
To understand the significance of the teaching risk in our context, assume that is able to find a policy which matches the feature expectations of ’s (not necessarily optimal) policy perfectly in her worldview, which is equivalent to . Directly from the definition of the teaching risk, we see that the gap between their performances with respect to the true reward function satisfies
(2) 
with equality if is proportional to a vector realizing the maximum in (1). If the teaching risk is large, this performance gap can generally be large as well. This motivates the interpretation of as a measure of the risk when teaching the task modelled by an MDP with reward weights to a learner whose worldview is represented by .
On the other hand, smallness of the teaching risk implies that this performance gap cannot be too large. The following theorem generalizes the bound in (2) to the situation in which only approximates the feature expectations of .
Theorem 1.
Assume that . Then the gap between the true performances of and satisfies
with .
Theorem 1 shows the following: If imitates ’s behaviour well in her worldview (meaning that can be chosen small) and if the teaching risk is sufficiently small, then will perform nearly as well as with respect to the true reward. In particular, if ’s policy is optimal, , then ’s policy is guaranteed to be nearoptimal.
The quantity appearing in Theorem 1 is a bound on the amount to which distorts lengths of vectors in the orthogonal complement of . Note that is independent of the teaching risk, in the sense that one can change it, e.g., by rescaling by some , without changing the teaching risk.
Teaching risk as obstruction to recognizing optimality.
We now provide a second motivation for the consideration of the teaching risk, by interpreting it as a quantity that measures the degree to which truly optimal policies deviate from looking optimal to . We make the technical assumption that is the closure of a bounded open set with smooth boundary (this will only be needed for the proofs). Our first observation is the following:
Proposition 1.
Let be a policy which is optimal for . If , then is suboptimal with respect to any choice of reward function with .
In view of Proposition 1, a natural question is whether we can bound the suboptimality, in ’s view, of a truly optimal policy in terms of the teaching risk. The following theorem provides such a bound:
Theorem 2.
Let be a policy which is optimal for . There exists a unit vector such that
where .
Note that the expression on the right hand side of the inequality in Theorem 2 tends to as , provided is bounded. Theorem 2 therefore implies that, if is small, a truly optimal policy is nearoptimal for some choice of reward function linear in the features observes, namely, the reward function with the vector whose existence is claimed by the theorem.
5 Teaching
Feature teaching.
The discussion in the last section shows that, under our assumptions on how learns, a teaching scheme in which solely provides demonstrations to can generally, i.e., without any assumption on the teaching risk, not lead to reasonable guarantees on the learner’s performance with respect to the true reward. A natural strategy is to introduce additional teaching operations by which the teacher can update ’s worldview and thereby decrease the teaching risk.
The simplest way by which the can change ’s worldview is by informing her about features that are relevant to performing well in the task, thus causing her to update her worldview to
Viewing as a matrix, this operation appends as a row to . (Strictly speaking, the feature that is thus provided is ; we identify this map with the vector in the following and thus keep calling a “feature”.)
This operation has simple interpretations in the settings we are interested in: If is a human learner, “teaching a feature” could mean making aware that a certain quantity, which she might not have taken into account so far, is crucial to achieving high performance. If is a machine, such as an autonomous car or a robot, it could mean installing an additional sensor.
Teachable features.
Note that if could provide arbitrary vectors as new features, she could always, no matter what is, decrease the teaching risk to zero in a single teaching step by choosing , which amounts to telling the true reward function. We assume that this is not possible, and that instead only the elements of a fixed finite set of teachable features
can be taught. In realworld applications, such constraints could come from the limited availability of sensors and their costs; in the case that is a human, they could reflect the requirement that features need to be interpretable, i.e., that they can only be simple combinations of basic observable quantities.
Greedy minimization of teaching risk.
Our basic teaching algorithm TRGreedy (Algorithm 1) works as follows: and interact in rounds, in each of which provides with the feature which reduces the teaching risk of ’s worldview with respect to by the largest amount. then trains a policy with the goal of imitating her current view of the feature expectations of the teacher’s policy; the Learning algorithm she uses could be the apprenticeship learning algorithm from Abbeel and Ng (2004).
Computation of the teaching risk.
The computation of the teaching risk required of in every round of Algorithm 1 can be performed as follows: One first computes the orthogonal complement of in and intersects that with , thus obtaining (generically) a 1dimensional subspace of ; this can be done using SVD. The teaching risk is then with the unique unit vector in with .
6 Experiments
Our experimental setup is similar to the one in Abbeel and Ng (2004), i.e., we use gridworlds in which nonoverlapping square regions of neighbouring cells are grouped together to form macrocells for some dividing . The state set is the set of gridpoints, the action set is , and the feature map maps a gridpoint belonging to macrocell to the onehot vector ; the dimension of the “true” feature space is therefore . Note that these gridworlds satisfy the quite special property that for states , we either have (if belong to the same macrocell), or . The reward weights are sampled randomly for all experiments unless mentioned otherwise. As the Learning algorithm within Algorithm 1, we use the projection version of the apprenticeship learning algorithm from Abbeel and Ng (2004).
Performance vs. teaching risk.
The plots in Figure 3 illustrate the significance of the teaching risk for the problem of teaching a learner under worldview mismatch. To obtain these plots, we used a gridworld with , ; for each value , we sampled five random worldview matrices , and let train a policy using the projection algorithm in Abbeel and Ng (2004), with the goal of matching the feature expectations corresponding to an optimal policy for a reward vector that was sampled randomly in each round. Each point in the plots corresponds to one such experiment and shows the relative performance of after the training round vs. the teaching risk of ’s worldview matrix .
All plots in Figure 3 show that the variance of the learner’s performance decreases as the teaching risk decreases. This supports our interpretation of the teaching risk as a measure of the potential gap between the performances of and when matches the feature expectations of in her worldview. The plots also show that the bound for this gap provided in Theorem 1 is overly conservative in general, given that ’s performance is often high and has small variance even if the teaching risk is relatively large.
The plots indicate that for larger (i.e., less discounting), it is easier for to achieve high performance even if the teaching risk is large. This makes intuitive sense: If there is a lot of discounting, it is important to reach high reward states quickly in order to perform well, which necessitates being able to recognize where these states are located, which in turn requires the teaching risk to be small. If there is little discounting, it is sufficient to know the location of some maybe distant reward state, and hence even a learner with a very deficient worldview (i.e., high teaching risk) can do well in that case.
Small gridworlds with high reward states and obstacles.
We tested TRGreedy (Algorithm 1) on gridworlds such as the one in Figure 4, with a small number of states with high positive rewards, some obstacle states with high negative rewards, and all other states having rewards close to zero. The histograms in Figures 4 and 4 show how often each of the features was selected by the algorithm as the first resp. second feature to be taught to the learner in 100 experiments, in each of which the learner was initialized with a random 5dimensional worldview. In most cases, the algorithm first selected the features corresponding to one of the high reward cells 4 and 9 or to one of the obstacle cells 18 and 23, which are clearly those that the learner must be most aware of in order to achieve high performance.
Comparison of algorithms.
We compared the performance of TRGreedy (Algorithm 1) to two variants of the algorithm which are different only in how the features to be taught in each round are selected: The first variant, Random, simply selects a random feature from the set of all teachable features. The second variant, PerfGreedy, greedily selects the feature that will lead to the best performance in the next round among all (computed by simulating the teaching process for each feature and evaluating the corresponding learner).
The plots in Figure 5 show, for each of the three algorithms, the relative performance with respect to the true reward function that the learner achieved after each round of feature teaching and training a policy , as well as the corresponding teaching risks and runtimes, plotted over the number of features taught. The relative performance of the learner’s policy was computed as .
We observed in all our experiments that TRGreedy performed significantly better than Random. While the comparison between TRGreedy and PerfGreedy was slightly in favour of the latter, one should note that a teacher running PerfGreedy must simulate a learning round of for all features not yet taught, which presupposes that knows ’s learning algorithm, and which also leads to very high runtime. If only knows that is able to match (her view of) the feature expectations of ’s demonstrations and simulates using some algorithm capable of this, there is no guarantee that will perform as well as ’s simulated counterpart, as there may be a large discrepancy between the true performances of two policies which in ’s view have the same feature expectations. In contrast, TRGreedy relies on much less information, namely the kernel of , and in particular is agnostic to the precise learning algorithm that uses to approximate feature counts.
7 Conclusions and Outlook
We presented an approach to dealing with the problem of worldview mismatch in situations in which a learner attempts to find a policy matching the feature counts of a teacher’s demonstrations. We introduced the teaching risk, a quantity that depends on the worldview of the learner and the true reward function and which (1) measures the degree to which policies which are optimal from the point of view of the learner can be suboptimal from the point of view of the teacher, and (2) is an obstruction for truly optimal policies to look optimal to the learner. We showed that under the condition that the teaching risk is small, a learner matching feature counts using e.g. standard IRLbased methods is guaranteed to learn a nearoptimal policy from demonstrations of the teacher even under worldview mismatch.
Based on these findings, we presented our teaching algorithm TRGreedy, in which the teacher updates the learner’s worldview by teaching her features which are relevant for the true reward function in a way that greedily minimizes the teaching risk, and then provides her with demostrations based on which she learns a policy using any suitable algorithm. We tested our algorithm in gridworld settings and compared it to other ways of selecting features to be taught. Experimentally, we found that TRGreedy performed comparably to a variant which selected features based on greedily maximizing performance, and consistently better than a variant with randomly selected features.
We plan to investigate extensions of our ideas to nonlinear settings and to test them in more complex environments in future work. We hope that, ultimately, such extensions will be applicable in realworld scenarios, for example in systems in which human expert knowledge is represented as a reward function, and where the goal is to teach this expert knowledge to human learners.
Appendix A Proof of Theorem 1
Proof of Theorem 1.
Denote by the orthogonal projection onto and let . Note that we have and . It follows that
using the definition of , the fact that , and the assumption that . We then obtain
using the triangle inequality, the CauchySchwarz inequality and the definition of , and the estimates above. ∎
Appendix B Proof of Proposition 1
Proof of Proposition 1.
As mentioned in the main text, we assume that the set is the closure of a bounded open set and has a smooth boundary .
Note that for a policy to be optimal with respect to , the following conditions need to be satisfied:

lies in the boundary ,

is normal to at .
The second statement is equivalent to saying that the tangent space to at is .
Assume now that . This is equivalent to saying that , i.e., to saying that is not tangent to at . That implies that there exist some such that is contained in the interior of , which means that a sufficiently small ball around is contained in . In particular, a small ball around in the affine space is entirely contained in . This implies that is contained in the interior of , i.e., not in the boundary . Therefore is suboptimal with respect to any choice of reward function with . ∎
Appendix C Proof of Theorem 2
Proof of Theorem 2.
The assumption that is optimal for the reward function implies that for all . By decomposing as , where denotes the orthogonal projection onto and the orthogonal projection onto , we obtain
(3) 
The first summand can be bounded as follows:
(4) 
using the CauchySchwarz inequality and the fact that . By combining estimates (3) and (4), we obtain
(5) 
References
 Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In ICML.
 Aodha, O. M., Su, S., Chen, Y., Perona, P., and Yue, Y. (2018). Teaching categories to human learners with visual explanations. In CVPR.
 Brown, D. S. and Niekum, S. (2018). Machine teaching for inverse reinforcement learning: Algorithms and applications. CoRR, abs/1805.07687.
 Cakmak, M., Lopes, M., et al. (2012). Algorithmic and human teaching of sequential decision tasks. In AAAI.
 Cakmak, M. and Thomaz, A. L. (2014). Eliciting good teaching from humans for machine learners. Artificial Intelligence, 217:198–215.
 Chen, Y., Singla, A., Mac Aodha, O., Perona, P., and Yue, Y. (2018). Understanding the role of adaptivity in machine teaching: The case of version space learners. In NIPS.
 Hunziker, A., Chen, Y., Mac Aodha, O., GomezRodriguez, M., Krause, A., Perona, P., Yue, Y., and Singla, A. (2018). Teaching multiple concepts to a forgetful learner. CoRR, abs/1805.08322.
 Liu, W., Dai, B., Humayun, A., Tay, C., Yu, C., Smith, L. B., Rehg, J. M., and Song, L. (2017). Iterative machine teaching. In ICML, pages 2149–2158.
 Mayer, M., Hamza, J., and Kuncak, V. (2017). Proactive synthesis of recursive treetostring functions from examples (artifact). In DARTSDagstuhl Artifacts Series, volume 3. Schloss DagstuhlLeibnizZentrum fuer Informatik.
 Mei, S. and Zhu, X. (2015). Using machine teaching to identify optimal trainingset attacks on machine learners. In AAAI, pages 2871–2877.
 Patil, K. R., Zhu, X., Kopeć, Ł., and Love, B. C. (2014). Optimal teaching for limitedcapacity human learners. In NIPS, pages 2465–2473.
 Rafferty, A. N., Brunskill, E., Griffiths, T. L., and Shafto, P. (2016). Faster teaching via pomdp planning. Cognitive science, 40(6):1290–1332.
 Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S., and Brain, G. (2018). Timecontrastive networks: Selfsupervised learning from video. In ICRA, pages 1134–1141.
 Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A. (2013). On actively teaching the crowd to classify. In NIPS Workshop on Data Driven Education.
 Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A. (2014). Nearoptimally teaching the crowd to classify. In ICML, pages 154–162.
 Stadie, B. C., Abbeel, P., and Sutskever, I. (2017). Thirdperson imitation learning. CoRR, abs/1703.01703.
 Zhu, X. (2015). Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI, pages 4083–4087.
 Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. (2018). An overview of machine teaching. CoRR, abs/1801.05927.
 Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA.