Shared Autonomy via Hindsight Optimization
Abstract
In shared autonomy, user input and robot autonomy are combined to control a robot to achieve a goal. Often, the robot does not know a priori which goal the user wants to achieve, and must both predict the user’s intended goal, and assist in achieving that goal. We formulate the problem of shared autonomy as a Partially Observable Markov Decision Process with uncertainty over the user’s goal. We utilize maximum entropy inverse optimal control to estimate a distribution over the user’s goal based on the history of inputs. Ideally, the robot assists the user by solving for an action which minimizes the expected costtogo for the (unknown) goal. As solving the POMDP to select the optimal action is intractable, we use hindsight optimization to approximate the solution. In a user study, we compare our method to a standard predictthenblend approach. We find that our method enables users to accomplish tasks more quickly while utilizing less input. However, when asked to rate each system, users were mixed in their assessment, citing a tradeoff between maintaining control authority and accomplishing tasks quickly.
1 Introduction
Robotic teleoperation enables a user to achieve their intended goal by providing inputs into a robotic system. In direct teleoperation, user inputs are mapped directly to robot actions, putting the burden of control entirely on the user. However, input interfaces are often noisy, and may have fewer degrees of freedom than the robot they control. This makes operation tedious, and many goals impossible to achieve. Shared Autonomy seeks to alleviate this by combining teleoperation with autonomous assistance.
A key challenge in shared autonomy is that the system may not know a priori which goal the user wants to achieve. Thus, many prior works [14, 1, 27, 7] split shared autonomy into two parts: 1) predict the user’s goal, and 2) assist for that single goal, potentially using prediction confidence to regulate assistance. We refer to this approach as predictthenblend.
In contrast, we follow more recent work [11] which assists for an entire distribution over goals, enabling assistance even when the confidence for any particular goal is low. This is particularly important in cluttered environments, where it is difficult  and sometimes impossible  to predict a single goal.
We formalize shared autonomy by modeling the system’s task as a Partially Observable Markov Decision Process (POMDP) [21, 12] with uncertainty over the user’s goal. We assume the user is executing a policy for their known goal without knowledge of assistance. In contrast, the system models both the user input and robot action, and solves for an assistance action that minimizes the total expected costtogo of both systems. See Fig. 1.
The result is a system that will assist for any distribution over goals. When the system is able to make progress for all goals, it does so automatically. When a good assistance strategy is ambiguous (e.g. the robot is in between two goals), the output can be interpreted as a blending between user input and robot autonomy based on confidence in a particular goal, which has been shown to be effective [7]. See Fig. 4.
Solving for the optimal action in our POMDP is intractable. Instead, we approximate using QMDP [18], also referred to as hindsight optimization [5, 24]. This approximation has many properties suitable for shared autonomy: it is computationally efficient, works well when information is gathered easily [16], and will not oppose the user to gather information.
Additionally, we assume each goal consists of multiple targets (e.g. an object has multiple grasp poses), of which any are acceptable to a user with that goal. Given a known cost function for each target, we derive an efficient computation scheme for goals that decomposes over targets.
To evaluate our method, we conducted a user study where users teleoperated a robotic arm to grasp objects using our method and a standard predictthenblend approach. Our results indicate that users accomplished tasks significantly more quickly with less control input with our system. However, when surveyed, users tended towards preferring the simpler predictthenblend approach, citing a tradeoff between control authority and efficiency. We found this surprising, as prior work indicates that task completion time correlates strongly with user satisfaction, even at the cost of control authority [7, 11, 9]. We discuss potential ways to alter our model to take this into account.
2 Related Works
We separate related works into goal prediction and assistance strategies.
2.1 Goal Prediction
Maximum entropy inverse optimal control (MaxEnt IOC) methods have been shown to be effective for goal prediction [28, 29, 30, 7]. In this framework, the user is assumed to be an intent driven agent approximately optimizing a cost function. By minimizing the worstcase predictive loss, Ziebart et al. [28] derive a model where trajectory probability decreases exponentially with cost, and show how this cost function can be learned efficiently from user demonstrations. They then derive a method for inferring a distribution over goals from user inputs, where probabilities correspond to how efficiently the inputs achieve each goal [29]. While our framework allows for any prediction method, we choose to use MaxEnt IOC, as we can directly optimize for the user’s cost in our POMDP.
Others have approached the prediction problem by utilizing various machine learning methods. Koppula and Saxena [15] extend conditional random fields (CRFs) with object affordances to predict potential human motions. Wang et al. [23] learn a generative predictor by extending Gaussian Process Dynamical Models (GPDMs) with a latent variable for intention. Hauser [11] utilizes a Gaussian mixture model over task types (e.g. reach, grasp), and predicts both the task type and continuous parameters for that type (e.g. movements) using Gaussian mixture autoregression.
2.2 Assistance Methods
Many prior works assume the user’s goal is known, and study how methods such as potential fields [2, 6] and motion planning [26] can be utilized to assist for that goal.
For multiple goals, many works follow a predictthenblend approach of predicting the most likely goal, then assisting for that goal. These methods range from taking over when confident [8, 14], to virtual fixtures to help follow paths [1], to blending with a motion planner [7]. Many of these methods can be thought of as an arbitration between the user’s policy and a fully autonomous policy for the most likely goal [7]. These two policies are blended, where prediction confidence regulates the amount of assistance.
Recently, Hauser [11] presented a system which provides assistance while reasoning about the entire distribution over goals. Given the current distribution, the planner optimizes for a trajectory that minimizes the expected cost, assuming that no further information will be gained. After executing the plan for some time, the distribution is updated by the predictor, and a new plan is generated for the new distribution. In order to efficiently compute the trajectory, it is assumed that the cost function corresponds to squared distance, resulting in the calculation decomposing over goals. In contrast, our model is more general, enabling any cost function for which a value function can be computed. Furthermore, our POMDP model enables us to reason about future human actions.
Planning with human intention models has been used to avoid moving pedestrians. Ziebart et al. [29] use MaxEnt IOC to learn a predictor of pedestrian motion, and use this to predict the probability a location will be occupied at each time step. They build a timevarying cost map, penalizing locations likely to be occupied, and optimize trajectories for this cost. Bandy et al. [4] use fixed models for pedestrian motions, and focus on utilizing a POMDP framework with SARSOP [17] for selecting good actions. Like our approach, this enables them to reason over the entire distribution of potential goals. They show this outperforms utilizing only the maximum likelihood estimate of goal prediction for avoidance.
Outside of robotics, Fern and Tadepalli [22] have studied MDP and POMDP models for assistance. Their study focuses on an interactive assistant which suggest actions to users, who then accept or reject the action. They show that optimal action selection even in this simplified model is PSPACEcomplete. However, a simple greedy policy has bounded regret.
Nguyen et al. [20] and Macindoe et al. [19] apply similar models to creating agents in cooperative games, where autonomous agents simultaneously infer human intentions and take assistance actions. Here, the human player and autonomous agent each control separate characters, and thus affect different parts of state space. Like our approach, they model users as stochastically optimizing an MDP, and solve for assistance actions with a POMDP. In contrast to these works, our action space and state space are continuous.
3 Problem Statement
We assume there are a discrete set of possible goals, one of which is the user’s intended goal. The user supplies inputs through some interface to achieve their goal. Our shared autonomy system does not know the intended goal a priori, but utilizes user inputs to infer the goal. It selects actions to minimize the expected cost of achieving the user’s goal.
Formally, let be the continuous robot state (e.g. position, velocity), and let be the continuous actions (e.g. velocity, torque). We model the robot as a deterministic dynamical system with transition function . The user supplies continuous inputs via an interface (e.g. joystick, mouse). These user inputs map to robot actions through a known deterministic function , corresponding to the effect of direct teleoperation.
In our scenario, the user wants to move the robot to one goal in a discrete set of goals . We assume access to a stochastic user policy for each goal , usually learned from user demonstrations. In our system, we model this policy using the maximum entropy inverse optimal control (MaxEnt IOC) [28] framework, which assumes the user is approximately optimizing some cost function for their intended goal , . This model corresponds to a goal specific Markov Decision Process (MDP), defined by the tuple . We discuss details in Sec. 4.
Unlike the user, our system does not know the intended goal. We model this with a Partially Observable Markov Decision Process (POMDP) with uncertainty over the user’s goal. A POMDP maps a distribution over states, known as the belief , to actions. Define the system state as the robot state augmented by a goal, and . In a slight abuse of notation, we overload our transition function such that , which corresponds to transitioning the robot state as above, but keeping the goal the same.
In our POMDP, we assume the robot state is known, and all uncertainty is over the user’s goal. Observations in our POMDP correspond to user inputs . Given a sequence of user inputs, we infer a distribution over system states (equivalently a distribution over goals) using an observation model . This corresponds to computing for each goal, and applying Bayes’ rule. We provide details in Sec. 4.
The system uses cost function , corresponding to the cost of taking robot action when in system state and the user has input . Note that allowing the cost to depend on the observation is nonstandard, but important for shared autonomy, as prior works suggest that users prefer maintaining control authority [13]. This formulation enables us to penalize robot actions which deviate from . Our shared autonomy POMDP is defined by the tuple . The optimal solution to this POMDP minimizes the expected accumulated cost . As this is intractable to compute, we utilize Hindsight Optimization to select actions, described in Sec. 5.
4 Modelling the user policy
We now discuss our model of . In principle, we could use any generative predictor [15, 23]. We choose to use maximum entropy inverse optimal control (MaxEnt IOC) [28], as it explicitly models a user cost function . We can then optimize this directly by defining as a function of .
Define a sequence of robot states and user inputs as . Note that sequences are not required to be trajectories, in that is not necessarily the result of applying in state . Define the cost of a sequence as the sum of costs of all stateinput pairs, . Let be a sequence from time to , and a sequence of from time to , starting at robot state .
It has been shown that minimizing the worstcase predictive loss results in a model where the probability of a sequence decreases exponentially with cost, [28]. Importantly, one can efficiently learn a cost function consistent with this model from demonstrations of user execution [28].
Computationally, the difficulty lies in computing the normalizing factor , known as the partition function. Evaluating this explicitly would require enumerating all sequences and calculating their cost.
However, as the cost of a sequence is the sum of costs of all stateaction pairs, dynamic programming can be utilized to compute this through softminimum value iteration [29, 30]:
Where , the result of applying at state , and .
The log partition function is given by the soft value function, , where the integral is over all sequences starting at configuration and time . Furthermore, the probability of a single input at a given configuration is given by [29].
Many works derive a simplification that enables them to only look at the start and current configurations, ignoring the inputs in between [30, 7]. Key to this assumption is that corresponds to a trajectory, where applying action at results in . However, if the system is providing assistance, this may not be the case. In particular, if the assistance strategy believes the user’s goal is , the assistance strategy will select actions to minimize . Applying these simplifications will result positive feedback, where the robot makes itself more confident about goals it already believes are likely. In order to avoid this, we ensure that the prediction probability comes from user inputs only, and not robot actions:
Finally, to compute the probability of a goal given the partial sequence up to , we use Bayes’ rule:
This corresponds to our POMDP observation model .
5 Hindsight Optimization
Solving POMDPs, i.e. finding the optimal action for any belief state, is generally intractable. We utilize the QMDP approximation [18], also referred to as hindsight optimization [5, 24] to select actions. The idea is to estimate the costtogo of the belief by assuming full observability will be obtained at the next time step. The result is a system that never tries to gather information, but can plan efficiently in the deterministic subproblems. This concept has been shown to be effective in other domains [24, 25].
We believe this method is suitable for shared autonomy for many reasons. Conceptually, we assume the user will provide inputs at all times, and therefore we gain information without explicit information gathering. In this setting, works in other domains have shown that QMDP performs similarly to methods that consider explicit information gathering [16]. Computationally, QMDP is efficient to compute even with continuous state and action spaces, enabling fast reaction to user inputs. Finally, explicit information gathering where the user is treated as an oracle would likely be frustrating [10, 3], and this method naturally avoids it.
Let be the actionvalue function of the POMDP, estimating the costtogo of taking action when in belief with user input , and acting optimally thereafter. In our setting, uncertainty is only over goals, .
Let correspond to the actionvalue for goal , estimating the costtogo of taking action when in state with user input , and acting optimally for goal thereafter. The QMDP approximation is [18]:
Finally, as we often cannot calculate directly, we use a firstorder approximation, which leads to us to following the gradient of .
We now discuss two methods for approximating :
Robot and user both act
Estimate with at each time step, and utilize for the cost. Using this cost, we could run qlearning algorithms to compute . This would be the standard QMDP approach for our POMDP.
Robot takes over
Assume the user will stop supplying inputs, and the robot will complete the task. This enables us to use the cost function . Unlike the user, we can assume the robot will act optimally. Thus, for many cost functions we can analytically compute the value, e.g. cost of always moving towards the goal at some velocity.
An additional benefit of this method is that it makes no assumptions about the user policy , making it more robust to modelling errors. We use this method in our experiments.
6 MultiGoal MDP
There are often multiple ways to achieve a goal. We refer to each of these ways as a target. For a single goal (e.g. object to grasp), let the set of targets (e.g. grasp poses) be . We assume each target has robot and user cost functions and , from which we compute the corresponding value and actionvalue functions and , and softvalue functions and . We derive the quantities for goals, , as functions of these target functions.
6.1 MultiTarget Assistance
For simplicity of notation, let , and . We assign the cost of a stateaction pair to be the cost for the target with the minimum costtogo after this state:
Where is the robot state when action is applied at .
Theorem 1
Let be the value function for target . Define the cost for the goal as above. For an MDP with deterministic transitions, the value and actionvalue functions and can be computed as:
We show how the standard value iteration algorithm, computing and backwards, breaks down at each time step. At the final timestep T, we get:
Since by definition. Now, we show the recursive step:
Additionally, we know that , since measures the costtogo for a specific target, and the total costtogo is bounded by this value for a deterministic system. Therefore, .
6.2 MultiTarget Prediction
Here, we don’t assign the goal cost to be the cost of a single target , but instead use a distribution over targets.
Theorem 2
Define the probability of a trajectory and target as . Let and be the softvalue functions for target . The soft value functions for goal , and , can be computed as:
As the cost is additive along the trajectory, we can expand out and marginalize over future inputs to get the probability of an input now:
Where the integrals are over all trajectories. By definition, :
Marginalizing out and simplifying:
As and are defined such that , our proof is complete.
7 User Study
We compare two methods for shared autonomy in a user study: our method, referred to as policy, and a conventional predictthenblend approach based on Dragan and Srinivasa [7], referred to as blend.
Both systems use the same prediction algorithm, based on the formulation described in Sec. 4. For computational efficiency, we follow Dragan and Srinivasa [7] and use a second order approximation about the optimal trajectory. They show that, assuming a constant Hessian, we can replace the difficult to compute softmin functions and with the min value and actionvalue functions and .
Our policy approach requires specifying two cost functions, and , from which everything is derived. For , we use a simple function based on the distance between the robot state and target :
That is, a linear cost near a goal , and a constant cost otherwise. This is by no means the best cost function, but it does provide a baseline for performance. We might expect, for example, that incorporating collision avoidance into our cost function may enable better performance [26].
We set , penalizing the robot from deviating from the user command while optimizing their cost function.
The predictthenblend approach of Dragan and Srinivasa requires estimating how confident the predictor is in selecting the most probable goal. This confidence measure controls how autonomy and user input are arbitrated. For this, we use the distancebased measure used in the experiments of Dragan and Srinivasa [7], , where is the distance to the nearest target, and is some threshhold past which confidence is zero.
7.1 Hypotheses
Our experiments aim to evaluate the taskcompletion efficiency and user satisfaction of our system compared to the predictthenblend approach. Efficiency of the system is measured in two ways: the total execution time, a common measure of efficiency in shared teleoperation [6], and the total user input, a measure of user effort. User satisfaction is assessed through a survey.
This leads to the following hypotheses:
H1 Participants using the policy method will grasp objects significantly faster than the blend method
H2 Participants using the policy method will grasp objects with significantly less control input than the blend method
H3 Participants will agree more strongly on their preference for the policy method compared to the blend method
7.2 Experiment setup
We recruited 10 participants (9 male, 1 female), all with experience in robotics, but none with prior exposure to our system. To counterbalance individual differences of users, we chose a withinsubjects design, where each user used both systems.
We setup our experiments with three objects on a table  a canteen, a block, and a cup. See Fig. 14. Users teleoperated a robot arm using two joysticks on a Razer Hydra system. The right joystick mapped to the horizontal plane, and the left joystick mapped to the height. A button on the right joystick closed the hand. Each trial consisted of moving from the fixed start pose, shown in Fig. 14, to the target object, and ended once the hand was closed.
At the start of the study, users were told they would be using two different teleoperation systems, referred to as “method1” and “method2”. Users were not provided any information about the methods. Prior to the recorded trials, users went through a training procedure: First, they teleoperated the arm directly, without any assistance or objects in the scene. Second, they grasped each object one time with each system, repeating if they failed the grasp. Third, they were given the option of additional training trials for either system if they wished.
Users then proceeded to the recorded trials. For each system, users picked up each object one time in a random order. Half of the users did all blend trials first, and half did all policy trials first. Users were told they would complete all trials for one system before the system switched, but were not told the order. However, it was obvious immediately after the first trail started, as the policy method assists from the start pose and blend does not. Upon completing all trials for one system, they were told the system would be switching, and then proceeded to complete all trials for the other system. If users failed at grasping (e.g. they knocked the object over), the data was discarded and they repeated that trial. Execution time and total user input were measured for each trial.
Upon completing all trials, users were given a short survey. For each system, they were asked for their agreement on a 17 Likert scale for the following statements:

“I felt in control”

“The robot did what I wanted”

“I was able to accomplish the tasks quickly”

“If I was going to teleoperate a robotic arm, I would like to use the system”
They were also asked “which system do you prefer”, where corresponded to blend, to policy, and to neutral. Finally, they were asked to explain their choices and provide any general comments.
7.3 Results
Users were able to successfully use both systems. There were a total of two failures while using each system  once each because the user attempted to grasp too early, and once each because the user knocked the object over. These experiments were reset and repeated.
We assess our hypotheses using a significance level of , and the Benjamini–Hochberg procedure to control the false discovery rate with multiple hypotheses.
Trial times and total control input were assessed using a twofactor repeated measures ANOVA, using the assistance method and object grasped as factors. Both trial times and total control input had a significant main effect. We found that our policy method resulted in users accomplishing tasks more quickly, supporting H1 . Similarly, our policy method resulted in users grasping objects with less input, supporting H2 . See Fig. 15 for more detailed results.
To assess user preference, we performed a Wilcoxon paired signedrank test on the survey question asking if they would like to use each system, and a Wilcoxon ranksum test on the survey question of which system they prefer against the null hypothesis of no preference (value of 4). There was no evidence to support H3.
In fact, our data suggests a trend towards the opposite  that users prefer blend over policy. When asked if they would like to use the system, there was a small difference between methods (Blend: , Policy: . However, when asked which system they preferred, users expressed a stronger preference for blend (). While these results are not statistically significant according to our Wilcoxon tests and , it does suggest a trend towards preferring blend. See Fig. 16 for results for all survey questions.
We found this surprising, as prior work indicates a strong correlation between task completion time and user satisfaction, even at the cost of control authority, in both shared autonomy [7, 11] and humanrobot teaming [9] settings.
As shown in Fig. 16, users agreed more strongly that they felt in control during blend. Interestingly, when asked if the robot did what they wanted, the difference between methods was less drastic. This suggests that for some users, the robot’s autonomous actions were inline with their desired motions, even though the user was not in control.
Users also commented that they had to compensate for policy in their inputs. For example, one user stated that “(policy) did things that I was not expecting and resulted in unplanned motion”. This can perhaps be alleviated with userspecific policies, matching the behavior of particular users.
Some users suggested their preferences may change with better understanding. For example, one user stated they “disliked (policy) at first, but began to prefer it slightly after learning its behavior. Perhaps I would prefer it more strongly with more experience”. It is possible that with more training, or an explanation of how policy works, users would have preferred the policy method. We leave this for future work.
7.4 Examining trajectories
Users with different preferences had very different strategies for using each system. Some users who preferred the assistance policy changed their strategy to take advantage of the constant assistance towards all goals, applying minimal input to guide the robot to the correct goal (Fig. 19). In contrast, users who preferred blending were often opposing the actions of the autonomous policy (Fig. 22). This suggests the robot was following a strategy different from their own.
8 Conclusion and Future Work
We presented a framework for formulating shared autonomy as a POMDP. Whereas most methods in shared autonomy predict a single goal, then assist for that goal (predictthenblend), our method assists for the entire distribution of goals, enabling more efficient assistance. We utilized the MaxEnt IOC framework to infer a distribution over goals, and Hindsight Optimization to select assistance actions. We performed a user study to compare our method to a predictthenblend approach, and found that our system enabled faster task completion with less control input. Despite this, users were mixed in their preference, trending towards preferring the simpler predictthenblend approach.
We found this surprising, as prior work has indicated that users are willing to give up control authority for increased efficiency in both shared autonomy [7, 11] and humanrobot teaming [9] settings. Given this discrepancy, we believe more detailed studies are needed to understand precisely what is causing user dissatisfaction. Our cost function could then be modified to explicitly avoid dissatisfying behavior. Additionally, our study indicates that users with different preferences interact with the system in very different ways. This suggests a need for personalized learning of cost functions for assistance.
Implicit in our model is the assumption that users do not consider assistance when providing inputs  and in particular, that they do not adapt their strategy to the assistance. We hope to alleviate this assumption in both prediction and assistance by extending our model as a stochastic game.
Acknowledgments
This work was supported in part by NSF GRFP No. DGE1252522, NSF Grant No. 1227495, the DARPA Autonomous Robotic Manipulation Software Track program, the Okawa Foundation, and an Office of Naval Research Young Investigator Award.
Footnotes
 In prior works where users preferred greater control authority, task completion times were indistinguishable [13].
References
 Daniel Aarno, Staffan Ekvall, and Danica Kragic. Adaptive virtual fixtures for machineassisted teleoperation tasks. In IEEE International Conference on Robotics and Automation, 2005.
 Peter Aigner and Brenan J. McCarragher. Human integration into robot control utilising potential fields. In IEEE International Conference on Robotics and Automation, 1997.
 Saleema Amershi, Maya Cakmak, W. Bradley Knox, and Todd Kulesza. Power to the people: The role of humans in interactive machine learning. AI Magazine, 2014.
 Tirthankar Bandyopadhyay, Kok Sung Won, Emilio Frazzoli, David Hsu, Wee Sun Lee, and Daniela Rus. Intentionaware motion planning. In Workshop on the Algorithmic Foundations of Robotics, 2012.
 Edwin K. P. Chong, Robert L. Givan, and Hyeong Soo Chang. A framework for simulationbased network control via hindsight optimization. In IEEE Conference on Decision and Control, 2000.
 Jacob W. Crandall and Michael A. Goodrich. Characterizing efficiency on human robot interaction: a case study of shared–control teleoperation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2002.
 Anca Dragan and Siddhartha Srinivasa. A policy blending formalism for shared control. The International Journal of Robotics Research, May 2013.
 Andrew H. Fagg, Michael Rosenstein, Robert Platt, and Roderic A. Grupen. Extracting user intent in mixed initiative teleoperator control. In Proceedings of the American Institute of Aeronautics and Astronautics Intelligent Systems Technical Conference, 2004.
 Matthew Gombolay, Reymundo Gutierrez, Giancarlo Sturla, and Julie Shah. Decisionmaking authority, team efficiency and human worker satisfaction in mixed humanrobot teams. In Robotics: Science and Systems, 2014.
 Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise. In International Conference on Machine Learning, 2011.
 Kris K. Hauser. Recognition, prediction, and planning for assisted teleoperation of freeform tasks. Autonomous Robots, 35, 2013.
 Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 1998.
 DaeJin Kim, Rebekah HazlettKnudsen, Heather CulverGodfrey, Greta Rucks, Tara Cunningham, David Portee, John Bricout, Zhao Wang, and Aman Behal. How autonomy impacts performance and satisfaction: Results from a study with spinal cord injured subjects using an assistive robot. IEEE Transactions on Systems, Man, and Cybernetics, Part A, 42, 2012.
 Jonathan Kofman, Xianghai Wu, Timothy J. Luu, and Siddharth Verma. Teleoperation of a robot manipulator using a visionbased humanrobot interface. IEEE Transactions on Industrial Electronics, 2005.
 Hema Koppula and Ashutosh Saxena. Anticipating human activities using object affordances for reactive robotic response. In Robotics: Science and Systems, 2013.
 Michael Koval, Nancy Pollard, and Siddhartha Srinivasa. Pre and postcontact policy decomposition for planar contact manipulation under uncertainty. In Robotics: Science and Systems, 2014.
 Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Efficient pointbased pomdp planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems, 2008.
 Michael L. Littman, Anthony R. Cassandra, and Leslie Pack Kaelbling. Learning policies for partially observable environments: Scaling up. In International Conference on Machine Learning, 1995.
 Owen Macindoe, Leslie Pack Kaelbling, and Tomás LozanoPérez. Pomcop: Belief space planning for sidekicks in cooperative games. In Artificial Intelligence and Interactive Digital Entertainment Conference, 2012.
 TruongHuy Dinh Nguyen, David Hsu, WeeSun Lee, TzeYun Leong, Leslie Pack Kaelbling, Tomás LozanoPérez, and Andrew Haydn Grant. Capir: Collaborative action planning with intention recognition. In Artificial Intelligence and Interactive Digital Entertainment Conference, 2011.
 Richard D. Smallwood and Edward J. Sondik. The optimal control of partially observable markov processes over a finite horizon. Operations Research, 21, 1973.
 Alan Fern Prasa Tadepalli. A computational decision theory for interactive assistants. In Neural Information Processing Systems, 2010.
 Zhikun Wang, Katharina Mülling, Marc Peter Deisenroth, Heni Ben Amor, David Vogt, Bernhard Schölkopf, and Jan Peters. Probabilistic movement modeling for intention inference in humanrobot interaction. The International Journal of Robotics Research, 2013.
 Sung Wook Yoon, Alan Fern, Robert Givan, and Subbarao Kambhampati. Probabilistic planning via determinization in hindsight. In AAAI Conference on Artificial Intelligence, 2008.
 Sungwook Yoon, Alan Fern, and Robert Givan. Ffreplan: A baseline for probabilistic planning. In International Conference on Automated Planning and Scheduling, 2007.
 Erkang You and Kris Hauser. Assisted teleoperation strategies for aggressively controlling a robot arm with 2d input. In Robotics: Science and Systems, 2011.
 Wentao Yu, Redwan Alqasemi, Rajiv V. Dubey, and Norali Pernalete. Telemanipulation assistance based on motion intention recognition. In IEEE International Conference on Robotics and Automation, 2005.
 Brian D. Ziebart, Andrew Maas, J. Andrew (Drew) Bagnell, and Anind Dey. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, July 2008.
 Brian D. Ziebart, Nathan Ratliff, Garratt Gallagher, Christoph Mertz, Kevin Peterson, J. Andrew (Drew) Bagnell, Martial Hebert, Anind Dey, and Siddhartha Srinivasa. Planningbased prediction for pedestrians. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009.
 Brian D. Ziebart, Anind Dey, and J. Andrew (Drew) Bagnell. Probabilistic pointing target prediction via inverse optimal control. In International Conference on Intelligence User Interfaces, 2012.