Including Uncertainty when
Learning from Human Corrections
Abstract
It is difficult for humans to efficiently teach robots how to correctly perform a task. One intuitive solution is for the robot to iteratively learn the human’s preferences from corrections, where the human improves the robot’s current behavior at each iteration. When learning from corrections, we argue that while the robot should estimate the most likely human preferences, it should also know what it does not know, and integrate this uncertainty when making decisions. We advance the stateoftheart by introducing a Kalman filter for learning from corrections: this approach also maintains the uncertainty of the estimated human preferences. Next, we demonstrate how uncertainty can be leveraged for active learning and risksensitive deployment. Our results indicate that maintaining and leveraging uncertainty leads to faster learning from human corrections.
humanrobot interaction (HRI), inverse reinforcement learning (IRL)
1 Introduction
While robots can be preprogrammed by an expert designer to execute a wide range of behaviors, each robot user has different preferences for how their robot should behave. Recent work has focused on learning the human enduser’s preferences from corrections: here the robot shows the human how it is preprogrammed to perform the task, and the human corrects the robot’s behavior to suit their personal preferences. Importantly, these corrections do not need to be perfect; instead, a human correction is simply a noisy improvement of the robot’s current behavior.
Consider a robotic manipulator carrying a cup of coffee for its human enduser. The robot knows to avoid obstacles, but is not sure about the human’s preferences: e.g., should the robot carry coffee over a laptop, across a table, or avoid both regions? When learning from corrections, the robot shows the human its current estimate of the optimal trajectory. The human then corrects this trajectory—using physical humanrobot interaction or a virtual interface—and, for instance, pushes the robot farther away from the laptop. The robot learns iteratively (i.e., online), and updates its understanding of the human’s preferences after each correction. See Fig. 1 for an overview of this process.
Human corrections indicate which preferences are more probable. For instance, pushing a robot away from the table suggests that all preferences which result in the robot moving farther from the table could be the human’s preferences. Within the stateoftheart, algorithms estimate the most likely human preferences given the human’s corrections [1, 2]. However, these methods miss out on the uncertainty of this estimate: i.e., in practice, the robot may not understand the human’s preferences with much confidence. Our insight is that—because human corrections imply a probability distribution over their preferences—the robot should not only estimate the most likely human preferences from these corrections, but should also recognize which estimates it is not confident about.
Let us return to our example, where the user wants the robot to avoid carrying its coffee over their laptop. Because the human pushed the robot away from their laptop—and the table is nearby—the robot has learned to avoid both laptop and table. If the robot only estimates the most likely human preferences, it will avoid carrying coffee over the table. But the human may actually want the robot to move over the table! A robot which knows the uncertainty of its estimate is confident that it should avoid the laptop, but unsure whether it should avoid the table. We can leverage this uncertainty to elicit informative corrections (that will teach the robot about the human’s table preference) and for risksensitive deployment (that moves across the table, or avoids the table entirely).
Contributions. First, we show how iterative inverse reinforcement learning can be performed using a Kalman filter, where human corrections are noisy observations. This approach extends the stateoftheart to now track the uncertainty over the estimated preferences. Next, we leverage uncertainty within our setting to actively learn from human corrections, so that the robot can elicit corrections that will reduce uncertainty. We also describe how uncertainty can be leveraged for risksensitive deployment, i.e., to avoid preferences about which the robot is most uncertain.
2 Related Work
Inverse reinforcement learning. Also known as inverse optimal control, IRL attempts to recover an agent’s objective function from demonstrations which are optimal [3, 4, 5]. In practice, however, it is challenging for humans to provide optimal demonstrations: consider an enduser trying to guide the motion of a multi degreeoffreedom (DoF) robotic manipulator [6]. One solution is presented by probabilistic IRL approaches [7, 8], which assume that the human is noisily optimal, and maintain a distribution over the space of possible reward functions.
Alternatively, the robot can learn from human corrections. At each iteration the robot maximizes its current estimate of the human’s objective, and the human responds by slightly improving, or correcting, the robot’s behavior. The human’s corrections do not need to be optimal. Shivaswamy and Joachims [2] model learning from human corrections as Coactive Learning, and derive online IRL algorithms which are similar to [1]. In particular, the Preference Perceptron from [2] has been applied to robotic manipulators by [9, 10]. While these works learn a maximum a posteriori estimate of the human’s objective, we note that they do not maintain an uncertainty over that estimate.
Active learning. Active learning improves an agent’s learning rate by allowing that agent to select queries: the robot (i.e., learner) chooses queries, which the human (i.e., oracle) labels [11]. Active learning has previously been applied to improve IRL [12]. Most relevant to our research are [13, 14, 15], which learn the human’s objective from preferences. The robot leverages active learning to choose an informative set of candidate policies, and the human then provides their preference—i.e., a ranking or score—to indicate how the policies match their objective. In [13, 14, 15], the robot selects the set of possible corrections for the human to compare; by contrast, in our work the human makes their own corrections to the robot’s current behavior. Consider [6, 9, 10] for example, where the human physically interacts with a robotic manipulator to define their corrected trajectory.
In addition to active learning, we also point out the related topic of algorithmic teaching [16]. Given that the robot is using IRL, for instance, we can apply active teaching to find the best batch of human demonstrations, or the best demonstration at each iteration [17]. Cakmak and Lopes [18] propose one such method, where a teacher selects the start of each demonstrated trajectory to minimize the robot’s uncertainty over the teacher’s objective. Next, the authors leverage the intuition from these algorithms to instruct actual endusers how to provide better demonstrations for an IRL agent.
3 Background
Within this section we will briefly overview iterative IRL from human corrections. We derive the Preference Perceptron for Coactive Learning [2], or, equivalently, online Maximum Margin Planning (MMP) without the loss function [1]. Both approaches can also be thought of as the maximum a posteriori (MAP) estimate of the human’s objective given the human’s corrections [9, 19].
Notation. Consider a robot with state , action , and dynamics . These dynamics define the probability distribution over the robot’s next state given its current state and action: i.e., , where denotes the current timestep. The task ends after timesteps.
Trajectory and Environment. Although the robot’s state describes part of the overall system state, there are still aspects of the task not captured by . Accordingly, let us introduce the world state . To give an example, consider a robotic manipulator performing a pick and place task; the robot’s state is the robot’s joint position and velocity, and the world state includes the goal position and the positions of obstacles. When viewed together, the robot and world state form the overall system state . Here we find it advantageous to think about and separately:

: the sequence of robot states form the robot’s trajectory, such that

: the sequence of world states form the environment, such that
We can think of as the “world description” [20, 15] or as the “context” [2]. We will assume that each environment is constant: i.e., the robot’s trajectory cannot alter the environment . It is often useful to augment with the prior distribution over the robot’s initial state .
Reward. The human enduser has in mind a reward function , which determines how they want the robot to behave. Like previous IRL works [3, 5, 1, 7], we will assume that this reward function is a linear combination of features weighted by a parameter vector :
(1) 
The features are known by both the human and the robot. Given , we have described an instance of a Markov decision process (MDP) [21], that can be solved to find the optimal robot policy that maximizes the human’s reward . In practice, the true reward parameter is known only by the human. The choice of is userspecific: we can think of as encoding the human’s preferences over the robot’s trajectory [10]. Note that lies in a continuous space, and so it is often challenging to use Bayesian IRL [8] to update a belief over (i.e., find the normalizing constant).
Corrections. Previously, we used to denote a timestep within a task. Now, we will generally treat as the current iteration. At each iteration , the robot observes an environment and chooses a desired trajectory . The human enduser knows both and , and corrects the robot’s trajectory to . We assume that the corrected trajectory is more optimal than the original trajectory:
(2)  
(3) 
where we have applied the reward function (1). Intuitively, here we are claiming that the human sees the robot’s behavior, and then modifies that behavior so that the robot’s actions better align with their preferences. Notice that the human does not have to correct the robot to the optimal trajectory—i.e., provide an optimal demonstration—but only needs to improve the robot’s trajectory.
Preference Perceptron. Within the setting we have described, the current stateoftheart algorithm for iteratively learning about from human corrections is the Preference Perceptron [2], summarized here. Consider the following cost function , which—as in MMP [1]—expresses the margin between and :
(4) 
Since is differentiable with respect to , we will leverage online gradient descent [22] to get:
(5) 
In the above, is the robot’s MAP estimate of the human’s true preferences [9, 19], and is the learning rate. Intuitively, the update rule (5) maximally decreases the cost at every iteration, and therefore increases the margin between the human’s corrections and the robot’s demonstrations. We point out that (5) is the Preference Perceptron for Coactive Learning [2].
Optimal Trajectory. Given and , the robot can identify an optimal trajectory which maximizes its current estimate of the human’s reward. We obtain this trajectory by solving:
(6) 
For robotic manipulators, a trajectory optimizer such as [23, 24] can be leveraged to solve (6).
Summary. The robot observes an environment at each iteration . Based on and the robot’s current estimate of , the robot solves (6) for a trajectory . The human then corrects the robot’s trajectory, and provides a better trajectory . Finally, the robot updates its MAP estimate of using (5), and the process repeats at the next iteration. We can alternatively think of as the label which the robot assigns to the input , while is the improved label provided by the human’s correction.
Uncertainty. We recognize that (5) provides the MAP estimate of , but this Preference Perceptron does not maintain a probability distribution over . Thus, when the robot learns using (5), we do not know the uncertainty of our estimate . Put another way, the robot falsely believes that it has fully observed the human’s preferences after each correction.
4 Kalman Filter for Inverse Reinforcement Learning
Our first contribution is to recognize that (5)—the standard estimation rule for online IRL learning—can be rewritten as a Kalman filter [25]. We argue that the key advantage to using a Kalman filter is that it not only provides an online estimate similar to (5), but also maintains the uncertainty over this estimate. Here we will explain how to apply an extended Kalman filter for IRL.
Transition Model. Let us model the human’s preferences as constant between iterations, like previous IRL works [3, 5, 1, 7]. Then, we can write the transition function:
(7) 
where is the process noise at iteration . We assume that is drawn from a zeromean Gaussian distribution with covariance . Introducing process noise enables the robot to capture how the imperfect human may unintentionally alter their preferences between iterations, or may not know exactly what they want: (7) implies that the enduser’s preferences are noisily constant over time.
Observation Model. Of course, the robot does not directly observe . Instead, the robot observes the human’s corrected trajectory , and, more specifically, the corrected feature counts. Recall that feature counts are the sum of features along a trajectory (1). The robot has an observation model:
(8) 
where is the true correction the human is attempting to give, and is the observation noise. We again assume that is drawn from a zeromean Gaussian distribution with covariance . Here observation noise indicates that endusers do not give exactly those feature counts which they have in mind, and thus (8) is similar to previous IRL works that assume that the human approximately provides their intended optimal correction [8, 7].
Intended Correction. The most difficult aspect of our observation model (8) is determining , i.e., determining what correction the human intends to provide given that their current preferences are and the environment is . Finding this mapping is challenging because it requires that we model the enduser’s policy. We will here assume that the human intends to give the optimal correction:
(9) 
Recalling (6), notice that is now the optimal trajectory. We recognize that modeling the human as intending to provide optimal corrections (9) and then incorporating Gaussian observation noise over the resulting feature counts (8) is analogous to noisy optimal demonstrations [7]. While it may seem like this contradicts our original definition of corrections as improvements, we will find that modeling the human’s policy with (9) results in an estimate similar to (5).
IRL as a Dynamical System. Together (7) and (8) express IRL as a dynamical system, where the human’s preferences are noisily constant, and the robot observes approximately corrected feature counts as a function of the human’s hidden preferences. We want to estimate these preferences.
Extended Kalman Filter. Given the transition model (7) and the observation model (8), we can leverage a Kalman filter to obtain an optimal estimate of the human’s preferences [25]. To be more precise, since the observation model is here nonlinear, we apply an extended Kalman filter (EKF). The EKF linearizes the observation model around the current estimate, and then acts as a standard Kalman filter. We point out that recent developments, such as the unscented Kalman filter (UKF) [26], may outperform an EKF [27]. For simplicity of exposition—as well as the insight it provides—we will here present the EKF, while noting that we apply the UKF in our simulations.
Objective Estimate and Covariance. Let be the mean estimate of the human’s preferences, and let be the covariance (uncertainty) of this estimate. Here we list the steps to update the estimate and covariance matrix via an EKF. First, we use a Taylor series expansion to linearize the observation model (8) around the current estimate, and reach the following observation Jacobian :
(10) 
Intuitively, tells us how the intended feature counts will vary as the human’s preferences change. We expect the performance and optimality of our EKF to improve when (10) is approximately linear. Applying , we can now write the EKF update rule for IRL:
(11) 
To see how we obtain the right side of (11), note that is actually the optimal trajectory given and (9), and so from (6). Comparing (11) to the Preference Perceptron (5), we have replaced the learning rate with the Kalman gain matrix :
(12) 
Finally, the covariance matrix of the estimate, , is updated according to:
(13) 
Notice that our notation shifts the iteration associated with and one step forward as compared to the standard notation for an EKF [25]. We should read as the mean estimate of at iteration , given the observed feature counts after iterations.
Summary. Provided that both the transition and observation models for IRL have Gaussian noise, we can use an EKF or UKF to estimate the human’s preferences online, while also maintaining the covariance matrix to track the uncertainty of this estimate. It is interesting to consider that the update rule (11) does not change the estimated preferences when the corrected feature counts match the robot’s feature counts: this is analogous to iteratively matching feature expectations [3]. In practice, our Kalman filter approach has extended the Preference Perceptron (5) by adding the Kalman gain matrix, from (12), and the covariance matrix, from (13).
5 Leveraging Uncertainty when Learning from Corrections
We showed that the Preference Perceptron can be extended to include uncertainty via a Kalman filter: but how should we leverage this uncertainty? In this section, we explore how the covariance can be used to actively learn from human corrections, and then safely deploy a resultant trajectory. We demonstrate that (a) the robot can select environments to elicit more informative human corrections, and (b) we can deploy the robot with riskaverse or risksensitive behavior.
Minimizing Covariance. One reasonable goal for a robot that is learning the human’s preferences is to minimize its uncertainty over those preferences. When uncertainty is high, the robot is unsure about how it should behave, and when uncertainty is low, the robot is confident that it understands the human’s preferences. Intuitively, corrections which reduce the robot’s uncertainty over should be encouraged [13, 14, 15]. Within our Kalman filter approach, the robot should therefore elicit corrections that minimize the covariance matrix .
Greedy Environment Selection. Although the robot here cannot directly control the human’s correction, the robot can select the environment in which that correction is provided. At each iteration , the robot greedily minimizes its uncertainty regardless of the human’s actual correction by selecting the environment according to:
(14) 
In the above, is the Frobenius norm (although other norms can be used). Note that depends on from (10), and so also depends on from (12). As pointed out by [28], we can evaluate the uncertainty in advance—i.e., before the human provides a correction—and hence we can solve (14) without knowing what correction the human will actually give.
Intuition. Fig. 2 demonstrates how we can leverage greedy environment selection. Inspecting these results, we see that environments where small changes in lead to large changes in better reduce uncertainty; put another way, we generally want to maximize . For example, consider the middle simulation in Fig. 2. When the laptop is too far away from the robot’s current optimal trajectory , local corrections do not alter the feature counts, and so the robot cannot learn from this environment. Intuitively, we should prefer environments where the robot’s current trajectory interacts with the relevant features: in this case, by moving directly over the laptop.
Multiple Features. When the robot is uncertain about the human’s preferences over multiple features, , the robot must tradeoff between learning these features (see Fig. 3). We find that—if the covariance over each feature is equal—interacting with all features is optimal. By contrast, if the robot has greater uncertainty over a specific feature, the greedy robot favors environments that elicit corrections on that feature. In Fig. 3, the robot chooses a start location between the laptop and table when the initial uncertainty is equal, but biases its starting location towards the table when it has greater uncertainty about the table feature. Intuitively, a robot using (14) will select environments where the current optimal trajectory interacts with the most uncertain features.
RiskSensitive Deployment. After the robot has learned from the human’s corrections and is deployed to perform the task (without human oversight), we can leverage the covariance matrix to select safer robotic behavior. Recall that the robot’s trajectory optimizes (6) based on the estimated preferences . Planning only with fails to account for the covariance over this estimate: the robot might be confident about some learned preferences, but unsure about others. Hence, we will use a riskaverse trajectory planning approach similar to [29]. First, we generate a set of preferences :
(15) 
Here is the th column of the matrix square root of . The robot now has estimates of , where is the Kalman filter estimate, and are one standard deviation away (as defined by the current covariance ). Our riskaverse robot optimizes the worstcase reward over :
(16) 
By extension, a riskseeking robot optimizes the bestcase reward over , i.e., uses in (16), and a riskneutral robot simply optimizes with the mean estimate , which reduces to (6).
Simplification. In practice, solving (16) for multiple DoF robotic manipulators moving in continuous spaces is challenging. One particular concern is local minima, which naturally occur during trajectory optimization [30, 24]; this problem is now compounded in (16) by a nested optimization. To make risksensitive planning more tractable, we will reverse the order of optimization:
(17) 
In the above, we first find the best possible reward for each preference in , and then choose the worstcase preference . Finally, we use in (6), and obtain the riskadverse trajectory. We demonstrate the results of risksensitive deployment using this simplification in Fig. 4.
Summary. When we extend the Preference Perceptron (5) with our Kalman filter approach, we can use the covariance matrix (13) to improve learning and deployment. At each iteration, the robot selects an environment based on (14) to greedily minimize uncertainty for the next iteration. We intuitively find that this leads to environments where the robot’s current trajectory interacts with features about which the robot is unsure (see Figs. 2 and 3). When the robot is eventually deployed, the robot exploits uncertainty for risksensitive planning: we describe a simplified approach in (15) and (17). Rather than just planning with the mean estimate—which is equivalent to riskneutral planning—now the robot can avoid (or increase) features with larger uncertainty (see Fig. 4). In conclusion, the robot leverages to select environments which will elicit the most informative human corrections, as well as to include the certainty of its learned preferences during deployment.
6 Simulations
We explained how using a Kalman filter (11) allows for active learning and risksensitive deployment. To illustrate learning and deployment, we performed single iteration simulations, and displayed the results in Figs. 2, 3, and 4. This is sufficient for deployment—since no more corrections will be provided—but not for learning. During learning, the human provides corrections over multiple iterations: in this section, we explore how the robot iteratively learns from a sequence of corrections. We will compare (a) Active Learning (AL) with a Kalman filter, (b) using just a Kalman Filter (KF), and (c) learning with the Preference Perceptron (PP). We hypothesize that AL will result in more accurate learning from the same number of corrections.
Setup. Consider the setting from Fig. 3, where a 2DoF robotic manipulator is unsure whether it should carry a cup of coffee over the laptop or across the table. At each iteration , the robot selects an environment , and solves for the optimal trajectory given its current preference estimate . There are possible environments : these environments have different start, laptop, and table locations. The AL robot selects using (14), while the KF and PP robots select environments uniformly at random. There are total rounds of corrections (i.e., iterations).
Simulated Human. We consider a simulated user who does not provide optimal (or approximately optimal) demonstrations. Instead, the human corrects the robot’s trajectory by moving one waypoint from towards the equivalent waypoint along their intended trajectory, from (9). The human corrects only the waypoint with the largest error. This user is strictly informative [2].
Implementation. We solved for the optimal trajectory using TrajOpt [24]. Rather than an EKF, we used the unscented Kalman filter (UKF) [26] for the AL and KF robots because in (10) is highly nonlinear. We simulated the KF and PP robots times to obtain their expected performance. To ensure that the learning rate for PP is consistent with the Kalman gain for AL and KF, we set as the expected mean value of the matrix diagonal of across all KF simulations.
Results. Our results are shown in Figs. 5 and 6. Here Estimate Error refers to the difference between the true preferences and the robot’s learned estimate . We calculated this metric using the norm: . Regret captures the difference in reward the robot would receive if it knew , and the reward the robot actually receives using its estimate . In our setting, regret is found using , the optimal trajectory under (9), and , the optimal trajectory under (6). Recalling (1), regret equals: . We summed this regret across all environments .
Discussion. From Fig. 5, we see that AL leads to faster learning than either KF or PP when the robot’s initial uncertainty over human’s preferences is uniform. In Fig. 6, we see that AL can be especially advantageous when some aspects of the human’s preferences are initially well understood, but the robot is uncertain about others. As we might expect, PP and KF are very similar—when we do not utilize the covariance matrix , the only difference between PP and KF is the learning rate, in PP (5), and the Kalman gain, in KF (11). Our simulations show that using (14) to select the environment that greedily minimizes uncertainty can lead to faster learning.
7 Conclusion
We have proposed a Kalman filter approach for online IRL, so that the robot knows which estimates it is certain about, and which estimates it is not confident about. This approach is particularly suited to iterative learning from human corrections, and extends the existing Preference Perceptron to track uncertainty. We demonstrated two different ways uncertainty could be leveraged within this setting: active learning and risksensitive deployment. Our simulations show how we can use the Kalman filter covariance to select more informative real or virtual environments.
This project was funded in part by the NSF GRFP1450681.
References
 Ratliff et al. [2006] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning. In Proc. International Conference on Machine Learning (ICML), pages 729–736, 2006.
 Shivaswamy and Joachims [2015] P. Shivaswamy and T. Joachims. Coactive learning. Journal of Artificial Intelligence Research, 53:1–40, 2015.
 Abbeel and Ng [2004] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proc. International Conference on Machine Learning (ICML), 2004.
 Ng and Russell [2000] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In Proc. International Conference on Machine Learning (ICML), pages 663–670, 2000.
 Osa et al. [2018] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(12):1–179, 2018.
 Akgun et al. [2012] B. Akgun, M. Cakmak, K. Jiang, and A. L. Thomaz. Keyframebased learning from demonstration. International Journal of Social Robotics, 4(4):343–355, 2012.
 Ziebart et al. [2008] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In Proc. Association for the Advancement of Artificial Intelligence (AAAI), volume 8, pages 1433–1438, 2008.
 Ramachandran and Amir [2007] D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. Urbana, 51(61801):1–4, 2007.
 Bajcsy et al. [2017] A. Bajcsy, D. P. Losey, M. K. O’Malley, and A. D. Dragan. Learning robot objectives from physical human interaction. In Prof. Conference on Robot Learning (CoRL), pages 217–226, 2017.
 Jain et al. [2015] A. Jain, S. Sharma, T. Joachims, and A. Saxena. Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research, 34(10):1296–1313, 2015.
 Settles [2012] B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.
 Lopes et al. [2009] M. Lopes, F. Melo, and L. Montesano. Active learning for reward estimation in inverse reinforcement learning. In Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 31–46, 2009.
 Akrour et al. [2012] R. Akrour, M. Schoenauer, and M. Sebag. APRIL: Active preference learningbased reinforcement learning. In Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 116–131, 2012.
 Daniel et al. [2014] C. Daniel, M. Viering, J. Metz, O. Kroemer, and J. Peters. Active reward learning. In Proc. Robotics: Science and Systems (RSS), 2014.
 Sadigh et al. [2017] D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia. Active preferencebased learning of reward functions. In Proc. Robotics: Science and Systems (RSS), 2017.
 Zhu [2015] X. Zhu. Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In Proc. Association for the Advancement of Artificial Intelligence (AAAI), pages 4083–4087, 2015.
 Liu et al. [2017] W. Liu, B. Dai, A. Humayun, C. Tay, C. Yu, L. B. Smith, J. M. Rehg, and L. Song. Iterative machine teaching. In Proc. International Conference on Machine Learning (ICML), pages 2149–2158, 2017.
 Cakmak and Lopes [2012] M. Cakmak and M. Lopes. Algorithmic and human teaching of sequential decision tasks. In Proc. Association for the Advancement of Artificial Intelligence (AAAI), 2012.
 Choi and Kim [2011] J. Choi and K.E. Kim. MAP inference for Bayesian inverse reinforcement learning. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 1989–1997, 2011.
 Huang et al. [2017] S. H. Huang, D. Held, P. Abbeel, and A. D. Dragan. Enabling robots to communicate their objectives. In Proc. Robotics: Science and Systems (RSS), 2017.
 Puterman [2014] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
 Bottou [1998] L. Bottou. Online learning and stochastic approximations. In Online Learning in Neural Networks, volume 17, pages 9–42, 1998.
 Karaman and Frazzoli [2011] S. Karaman and E. Frazzoli. Samplingbased algorithms for optimal motion planning. The International Journal of Robotics Research, 30(7):846–894, 2011.
 Schulman et al. [2014] J. Schulman, Y. Duan, J. Ho, A. Lee, I. Awwal, H. Bradlow, J. Pan, S. Patil, K. Goldberg, and P. Abbeel. Motion planning with sequential convex optimization and convex collision checking. The International Journal of Robotics Research, 33(9):1251–1270, 2014.
 Choset [2005] H. M. Choset. Principles of Robot Motion: Theory, Algorithms, and Implementation. MIT press, 2005.
 Wan and Van Der Merwe [2000] E. A. Wan and R. Van Der Merwe. The unscented Kalman filter for nonlinear estimation. In Proc. Adaptive Systems for Signal Processing, Communications, and Control Symposium (ASSPCC), pages 153–158, 2000.
 Kandepu et al. [2008] R. Kandepu, B. Foss, and L. Imsland. Applying the unscented kalman filter for nonlinear state estimation. Journal of Process Control, 18(78):753–768, 2008.
 Van Den Berg et al. [2011] J. Van Den Berg, P. Abbeel, and K. Goldberg. LQGMP: Optimized path planning for robots with motion uncertainty and imperfect state information. The International Journal of Robotics Research, 30(7):895–913, 2011.
 HadfieldMenell et al. [2017] D. HadfieldMenell, S. Milli, P. Abbeel, S. J. Russell, and A. Dragan. Inverse reward design. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 6768–6777, 2017.
 Pan et al. [2014] J. Pan, Z. Chen, and P. Abbeel. Predicting initialization effectiveness for trajectory optimization. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 5183–5190, 2014.