Planning to Give Information in Partially Observed Domains with a Learned Weighted Entropy Model
Abstract
In many realworld robotic applications, an autonomous agent must act within and explore a partially observed environment that is unobserved by its human teammate. We consider such a setting in which the agent can, while acting, transmit declarative information to the human that helps them understand aspects of this unseen environment. Importantly, we should expect the human to have preferences about what information they are given and when they are given it. In this work, we adopt an informationtheoretic view of the human’s preferences: the human scores a piece of information as a function of the induced reduction in weighted entropy of their belief about the environment state. We formulate this setting as a pomdp and give a practical algorithm for solving it approximately. Then, we give an algorithm that allows the agent to sampleefficiently learn the human’s preferences online. Finally, we describe an extension in which the human’s preferences are timevarying. We validate our approach experimentally in two planning domains: a 2D robot mining task and a more realistic 3D robot fetching task.
Planning to Give Information in Partially Observed Domains with a Learned Weighted Entropy Model
Rohan Chitnis Leslie Pack Kaelbling Tomás LozanoPérez MIT Computer Science and Artificial Intelligence Laboratory {ronuchit, lpk, tlp}@mit.edu
1 Introduction
As autonomous agents become increasingly capable of completing tasks at human levels of performance, it will be common to see such agents dispatched in partially observed environments considered unsafe or undesirable for humans to explore. For instance, a searchandrescue robot may be tasked with exploring the site of a natural disaster for trapped victims, or a deepsea navigation robot may operate in highly pressurized underwater settings to gather data about marine life. In such settings, a human collaborator may be unable to observe the environment that the agent is acting in, and instead must gather knowledge on the basis of information periodically transmitted by the agent. As the agent takes actions and receives observations in its partially observed environment, it should be able to transmit relevant information that helps the human gain insight into the true environment state.
We treat this as a sequential decisionmaking problem where the agent can, on each timestep, choose to transmit information to the human as it acts in the environment. An important consideration is that the human will have preferences about what information the agent gives. One could imagine only wanting information pertaining to certain entities in the environment: in the searchandrescue setting, the human would want to be notified when the robot encounters a victim, but not every time it encounters a pile of rubble. To model this, we suppose the human gives the agent a score on each timestep based on the transmitted information (if any). The agent’s objective is to act optimally in the environment and, conditioned on this, to give information such that the total score from the human over the trajectory is maximized. We begin by formulating this problem setting as a pomdp and giving a practical algorithm for solving it approximately.
Then, we model the human’s score function informationtheoretically. First, we suppose that the human maintains a belief state, a probability distribution over the set of possible environment states. This belief gets updated whenever information is received from the agent. Next, we let the score for a given piece of information be a function of the reduction in weighted entropy induced by the belief update. This weighting is crucial: it captures the intuition described earlier that the human has preferences over which entities in the environment they should be informed about.
We give an algorithm that allows the agent to sampleefficiently learn the human’s preferences online through exploration, assuming the score function follows this informationtheoretic model. Here, online learning is important: the agent must explore giving a variety of information to the human in order to learn the human’s preferences. Afterward, we describe an extension of this setting in which the human’s preferences are timevarying. We validate our approach experimentally in two planning domains: a 2D robot mining task and a more realistic 3D robot fetching task. Our results demonstrate the flexibility of our model and show that our approach is feasible in practice.
2 Related Work
The problem we consider in this work, informationgiving as a sequential decision task for reducing the entropy of the human’s belief, has connections to many related problems in humanrobot interaction. Our work is unique in its use of a weighted measure of entropy to capture the varying degrees of importance of the entities in the environment.
Informationtheoretic perspective on belief updates. The idea of taking actions that lower the entropy of a belief state (in partially observed environments) has been studied for decades. Originally, this idea was applied to robot navigation [Cassandra et al., 1996] and localization [Burgard et al., 1997]. More recently, it has also been used in humanrobot interaction settings [Deits et al., 2013, Tellex et al., 2013]: the robot asks the human clarifying questions about its environment to lower the entropy of its own belief, which helps it plan more safely and robustly. By contrast, in our method the robot is concerned with estimating the entropy of the human’s belief, like in work by Roy et al. [2000]. Also, we use a weighted measure of entropy, so that all world states are not equally important.
Estimating the human’s mental state. Having a robot make decisions based on its current estimate of the human’s preferences has been studied in humanrobot collaborative settings [Devin and Alami, 2016, Lemaignan et al., 2017, Trafton et al., 2013]. In these methods, the robot first estimates the human’s belief about the world state and goal, then uses this information to build a humanaware policy for the collaborative task. Explicitly representing the human’s belief also allows the robot to exhibit other desirable behaviors. For instance, it can plan to signal its intentions in order to avoid surprising the human, or it can do perspectivetaking, in which the robot incorporates the human’s visual perspective of a scene (e.g., which objects are occluded from their view) in its decisionmaking.
Modeling user preferences with active learning. The idea of using active learning to understand the human’s preferences has received attention recently [Racca and Kyrki, 2018, Sadigh et al., 2017]. Typically in these methods, the agent gathers information from the human through some channel (e.g., pairs of states with the humanpreferred one marked, or answers to queries issued by the robot), uses this information to estimate a reward function, and acts based on this estimated reward. Our method for learning the human’s preferences online works in a similar way, but we assume the score function has an informationtheoretic structure, which makes learning efficient.
Explainable policies. Our work can be viewed through the lens of optimizing a policy for explainability, based on the human’s preferences. Much prior work has been devoted to this area. Hayes and Shah [2017] develop a system that answers queries about the policy, by using graph search in the induced mdp to determine states matching the query. Huang et al. [2017] use algorithmic teaching to allow a robot to communicate goals. They build an approximateinference model of how humans learn from watching trajectories of optimal robot behavior. In contrast to these methods, our approach adaptively gives information using a learned entropybased model of the human’s preferences.
3 Background
3.1 Weighted Entropy and Information Gain
Weighted entropy is a generalization of Shannon entropy that was first presented and analyzed by Guiaşu [1971]. We give an overview and basic intuition in this section, and refer the interested reader to the original work for more details. The entropy of a (discrete) probability distribution , given by , is a measure of the expected amount of information carried by samples from the distribution, and can also be viewed as a measure of the distribution’s uncertainty. Thus, a Kronecker delta function has zero entropy, while a uniform distribution on a bounded set has maximum entropy. The information gain in going from a distribution to another is .
Definition 1
The weighted entropy of a (discrete) probability distribution is given by:
where all . The weighted information gain from distribution to another is .
When all are equal, the original definition of Shannon entropy is recovered (to within a scaling factor). Weighted entropy captures the intuition that in some settings, one may want certain outcomes of the distribution to have more impact on the computation of its uncertainty. Of course, the earlier interpretation of entropy as the expected amount of information carried by samples has been lost.
Intuition. Figure 1 helps give intuition about weighted entropy by plotting it for the case of a 3outcome distribution. In the figure, we only let vary freely and set , so that the plot can be visualized in two dimensions. When only one outcome is possible (), the entropy is always 0 regardless of the setting of weights, but as approaches 1 from the left, the entropy drops off more quickly the higher is (relative to and ). If all weight is placed on (the orange curve), then when the entropy also goes to 0, because the setting of weights conveys that distinguishing between and gives no information. However, if no weight is placed on (the green curve), then when we have , and the entropy is high because the setting of weights conveys that all of the uncertainty lies in telling and apart.
3.2 Partially Observable Markov Decision Processes and Belief States
Our work considers agentenvironment interaction in the presence of uncertainty, which is often formalized as a partially observable Markov decision process (pomdp) [Kaelbling et al., 1998]. An undiscounted pomdp is a tuple : is the state space; is the action space; is the observation space; is the transition distribution with ; is the observation model with ; and is the reward function with . Some states in are said to be terminal, ending the episode. The agent’s objective is to maximize its overall expected reward, . A solution to a pomdp is a policy that maps the history of observations and actions to the next action to take. Some popular approaches for generating policies in pomdps are online planning [Silver and Veness, 2010, Somani et al., 2013, Bonet and Geffner, 2000] and finding a policy offline with a pointbased solver [Kurniawati et al., 2008, Pineau et al., 2003].
The sequence of states is not seen by the agent, so it must instead maintain a belief state, a probability distribution over the space of possible states. This belief is updated on each timestep based on the received observation and taken action. Representing the full distribution exactly is prohibitively expensive for even moderatelysized pomdps, so a typical alternative approach is to use a factored representation. Here, we assume the state can be decomposed into variables (features), each with a value; the factored belief then maps each variable to a distribution over potential values.
A Markov decision process (mdp) is a simplification of a pomdp where the states are fully observed by the agent, so and are not needed. The objective remains the same.
4 General Problem Setting
In this section, we formulate our problem setting as a pomdp from the agent’s perspective, then give a practical algorithm for solving it approximately. There are three entities at play: the agent (robot), the partially observed environment, and the human. At each timestep, the agent takes an action in the environment and chooses a piece of information (or null if it chooses not to give any) to transmit to the human based on its current belief about the environment. Figure 2 shows one timestep of activity.
Assumption. Our problem formulation will make an assumption that the human’s belief state about the environment is fully observed by the agent. Alternatively, we can say that the agent knows 1) the human’s initial belief, 2) that the human makes Bayesian belief updates when given information, and 3) that only this information can induce updates. We make this assumption in order to focus on learning preferences; the belief state is an objective measure of world state probabilities computed from the information, whereas the score function is what enables the human to show unique preferences.
For ease of presentation, we begin by supposing the environment is fully observed, then afterward show how the formulation extends to a partially observed environment. Let be an mdp that describes the agentenvironment interaction. can be continuous or discrete. The human maintains a belief state over , updated based only on information transmitted by the agent, and gives the agent a score on each timestep for this information. This score can be any real number. We model the human as a tuple : is the space of all belief states over ; is the space of information that the agent can transmit (defined by the human); is the information model with ; is the human’s initial belief; and is the human’s score for information , a function of the belief update induced by , with . The Bayesian belief update equation is . Note that the information model gives our formulation the ability to capture noise in the transmission of information.
We define the agent’s objective as follows: to act optimally in the environment (maximizing the expected sum of rewards ) and, conditioned on this, to give information such that the sum of the human’s scores over the trajectory is maximized. Like the human, the agent maintains its own belief state over , updated based on its own interactions with the environment.
The full mdp for this setting (from the agent’s perspective) is a tuple :

. A state is a pair of environment state and human’s belief .

. An action is a pair of environment action and transmitted information . We require the information to be consistent with the agent’s belief about the environment.

equals if , else 0.

, the reward, is a pair with the comparison operation , and similar for . This operation makes maximizing the expected sum of rewards correctly optimize the objective.
Partially observed environment. If the agentenvironment interaction is instead described by a pomdp , then becomes a pomdp , where: are the same as in the fully observed case; ; ; and the portion of the state corresponding to the human’s belief is still fully observed. Going forward, the notation will refer to this pomdp.
Practical approximation algorithm. Note that is a continuousstate pomdp and can thus be hard to solve optimally in nontrivial domains. Instead, we leverage the structure of the objective to design a practical determinizeandreplan strategy [Platt Jr. et al., 2010, HadfieldMenell et al., 2015] that does not have optimality guarantees but often works well in practice: determinize the pomdp, then decompose the task into solving for a plan for acting in the environment and (conditioned on ) a plan for giving information to the human. See Algorithm 1 for pseudocode. This procedure is repeated any time the optimistic assumptions are found to have been violated.
Line 1 determinizes the agentenvironment portion of and solves it to produce an acting plan , which crucially contains no branches. Line 2 generates the sequence of the agent’s beliefs induced by ; if had branches, would be a tree, and the search process would be too computationally expensive. Line 4 uses this sequence to figure out what information the agent could legally give to the human at each timestep – information must be consistent with the agent’s belief.
Then, the algorithm constructs a directed acyclic graph (dag) whose states are tuples of (human belief, timestep). An edge exists between and iff the agent can legally give some information at timestep that causes the belief update ; the edge weight is the human’s score for . The longest weighted path through is precisely the scoremaximizing informationgiving plan . In our implementation, we do not actually build the full dag : we prune the search for the longest weighted path using domainspecific heuristics.
5 InformationTheoretic Score Function
In this section, we model the human’s score function informationtheoretically, then give an algorithm by which the agent can learn the human’s preferences online.
5.1 Model
We model the human’s score function as some function of the weighted information gain (Section 3.1) of the induced belief update. This update occurs at each timestep based on the information , which could be null if the agent chooses not to give information. Thus, we have:
where the are the weights in the calculation of weighted entropy. Note that the range of is , the real numbers. We begin our discussion with fixed and , then explore the timevarying setting.
Assumptions. This model introduces two assumptions. 1) The human’s belief , which ideally is over the environment state space , must be over a discrete space in order to use weighted entropy. If is continuous, the human can achieve this by making a discrete abstraction of and maintaining over this abstraction instead. Note that replacing the summation with integration in the formula for entropy is not valid for continuous distributions, because the interpretation of entropy as a measure of uncertainty gets lost: for instance, the integral can be negative. 2) If the belief is factored, we calculate the total entropy by summing the entropy of each factored distribution. This is an upper bound on the true entropy, arising from an assumption of independence among the factors.
Motivation. Assuming structure in the form of makes it easier for the agent to learn the human’s preferences. The principle of weighted entropy is particularly compelling as a structural choice. Recall that the human’s belief state represents their perceived likelihood of each possible environment state in (or its discrete abstraction). It is reasonable to expect that the human would care more about certain states than others. For instance, in the natural disaster setting, states in which trapped victims exist are particularly important. Each term in the entropy formula corresponds to an environment state, so the allow the human to encode preferences over which states are important.
Interpretation of . Different choices of allow the human to exhibit various preferences. Choosing as identity means that the human wants the agent to greedily transmit the (valid) piece of information that maximizes the weighted information gain at each timestep. The human may instead prefer for to impose a threshold : if the gain is smaller than , then could return a negative score to penalize the agent for not being sufficiently informative (with respect to the weights ). A sublinear rewards the agent for splitting up information into subparts and transmitting it over multiple timesteps, while a superlinear rewards the agent for giving maximally informative statements on single timesteps.
5.2 Learning Preferences Online
We now give an algorithm that allows the agent to sampleefficiently learn the human’s preferences online through exploration, using this informationtheoretic . See Algorithm 2 for pseudocode.
For improved online learning we take inspiration from Deep QNetworks [Mnih et al., 2015] and store transitions in a replay buffer, which breaks temporal dependencies in the minibatched data used for gradient descent. In Line 8, the agent explores the human’s preferences using an greedy policy that gives a random valid piece of information with probability and otherwise follows , which is a policy that solves the pomdp (Section 4) under the current estimates and . We use an exponentially decaying that starts at 1 and goes to 0 as the estimates and become more reliable. Also in Line 8, the agent receives a noisy model target (the human’s score) to use as a supervision signal. Our experiments implement this by having the human add noise drawn from to their weighted information gain, before applying . The loss for a minibatch is the mean squared error (mse) between the predictions and these noisy targets.
5.3 Extensions: TimeVarying Preferences and Incorporating History into
It is easy to extend our approach to a setting where the human’s preferences are timevarying. This is an important and realistic setting to consider: preferences are always changing, and the information the human wants to receive today may not be the same tomorrow. Algorithm 2 needs to do two things every time or changes. First, the exploration probability must be reset appropriately so that the agent is able to explore and discover information that the human now finds interesting. Second, the replay buffer should either be emptied (as in our experiments) or have its contents downweighted.
Another important extension is a setting where the human can score information based on not only their weighted information gain, but also the history of transmitted information. This would allow, for instance, the human to reward the agent for exhibiting stylistic variety in the information it transmits. To allow to depend on information transmitted in the last timesteps, states in the pomdp (Section 4) must be augmented with this history so it can be used to calculate , and Algorithm 2 must store this history into the replay buffer so Line 11 can pass it into .
6 Experiments
We show results for three settings of the function : identity, square, and natural logarithm. All three use a threshold : if the argument is less than , then returns , penalizing the agent. Also, if the information is null (agent did not give information), returns 0. The threshold is domainspecific and should be chosen based on the weights and type of information transmitted; our experiments fix . Section 5.1 describes how these different choices of should be expected to impact the agent’s informationgiving policy; note that the squared is superlinear and the logarithmic is sublinear.
We implemented the models for and (see Algorithm 2) in Tensorflow [Abadi et al., 2016] as fully connected networks with hidden layer sizes [100, 50], embedded within a differentiable module that computes according to the equation in Section 5.1. The input to the module is []. We used a gradient descent optimizer with learning rate , regularization scale , and sigmoid activations (ReLU did not perform as well). We used batch size 100 and made , the exploration probability, exponentially decay from 1 to roughly over the first 20 episodes. The human uses a uniform information model over all valid pieces of information.
Experiment  Score from Human  # Info / Timestep  Alg. 1 Runtime (sec) 

N=4, M=1, f=id  375  0.34  6.2 
N=4, M=5, f=id  715  0.25  6.7 
N=6, M=5, f=id  919  0.24  24.1 
N=4, M=1, f=sq  13274  0.25  4.7 
N=4, M=5, f=sq  33222  0.2  6.7 
N=6, M=5, f=sq  41575  0.19  23.6 
N=4, M=1, f=log  68  0.39  5.6 
N=4, M=5, f=log  91  0.32  5.7 
N=6, M=5, f=log  142  0.3  23.8 
Experiment  Score from Human  # Info / Timestep  Alg. 1 Runtime (sec) 

N=5, M=5, f=id  362  0.89  0.4 
N=5, M=10, f=id  724  1.12  2.0 
N=10, M=10, f=id  806  1.56  48.4 
N=5, M=5, f=sq  37982  0.52  0.4 
N=5, M=10, f=sq  99894  0.67  1.8 
N=10, M=10, f=sq  109207  0.71  39.7 
N=5, M=5, f=log  19  1.05  0.4 
N=5, M=10, f=log  31  1.39  1.8 
N=10, M=10, f=log  39  1.7  42.7 
6.1 Domain 1: Discrete 2D Mining Task
Our first experimental domain is a 2D gridworld in which locations are organized in a discrete grid, minerals are scattered across the environment, and the robot is tasked with detecting and mining these minerals. Each mineral is of a particular type (such as gold, calcite, or quartz); the world of mineral types is known and fixed, but all types need not be present in the environment. The actions that the robot can perform on each timestep are as follows: Move by one square in a cardinal direction, with reward 1; Detect whether a mineral of a given type is present at the current location, with reward 5; and Mine the given mineral type at the current location, which succeeds with reward 20 if there is a mineral of that type there, otherwise fails with reward 100.
A terminal state in this pomdp is one in which all minerals have been mined. To initialize an episode, we randomly assign each mineral a type and a unique grid location. The factored belief representation for both the robot and the human maps each grid location to a distribution over what mineral type (or nothing) is located there, initialized uniformly. Intuitively, the human may care more about receiving information on certain mineral types, such as gold or silver, than others. These preferences are captured by the human’s weights , where the correspond to each mineral type and the empty location value is given weight 0. The space of information is: At() for every mineral type and location ; NotAt() for every mineral type and location ; and null (no information). Note that the agent is only allowed to give information consistent with its current belief. Our experiments vary the grid size , the number of minerals , the human’s choice of weights , and the human’s choice of . We also experimented with the extensions discussed in Section 5.3. Table 1 and Figure 3 show some of our results and discuss important trends.
6.2 Domain 2: Continuous 3D Fetching Task
Our second experimental domain is a more realistic 3D robotic environment implemented in pybullet [Coumans et al., 2018]. There are objects in the world with continuousvalued positions, scattered across N “zones” which partition the position space, and the robot is tasked with fetching (picking) them all. The actions that the robot can perform on each timestep are as follows: Move to a given pose, with reward 1; Detect all objects within a cone of visibility in front of the current pose, with reward 5; and Fetch the closest object within a cone of reachability in front of the current pose, which succeeds with reward 20 if such an object exists, otherwise fails with reward 100.
A terminal state in this pomdp is one in which all objects have been fetched. To initialize an episode, we place each object at a random collisionfree position. The factored belief representation for the robot maps each known object to a distribution over its position, whereas the one for the human (which must be discrete) maps each known object to a distribution over which of the zones it could be in; both are initialized uniformly. Intuitively, the human may care more about receiving information regarding certain zones than others: perhaps the zones represent different sections of the ocean floor or rooms of a building on fire. These preferences are captured by the human’s weights , where the correspond to each zone. The space of information is: At() for every object and zone ; NotAt() for every object and zone ; and null (no information). Note that the agent is only allowed to give information consistent with its current belief. Our experiments vary the number of zones , the number of objects , the human’s choice of weights , and the human’s choice of . We also experimented with the extensions discussed in Section 5.3. Table 1 and Figure 4 show some of our results and discuss important trends.
7 Conclusion and Future Work
We have formulated a problem setting in which an agent must act in a partially observed environment while transmitting declarative information to a human teammate that optimizes for their preferences. Our approach was to model the human’s score as a function of the weighted information gain of their belief about the environment. We also gave an algorithm for learning the human’s preferences online.
One direction for future work is to extend our model to work with continuous distributions, which can be done using the notion of the limiting density of discrete points. This is an adjustment to the formula for differential entropy, which simply replaces the summation with integration in the formula for entropy, that correctly retains the intuitions of the discrete setting. Another direction is to have the agent generate good candidate information intelligently, rather than naively consider all valid options. Finally, we hope to explore natural language as the medium of communication.
References
 Abadi et al. [2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 Bonet and Geffner [2000] Blai Bonet and Hector Geffner. Planning with incomplete information as heuristic search in belief space. In Proceedings of the Fifth International Conference on Artificial Intelligence Planning Systems, pages 52–61, 2000.
 Burgard et al. [1997] Wolfram Burgard, Dieter Fox, and Sebastian Thrun. Active mobile robot localization by entropy minimization. In Advanced Mobile Robots, 1997. Proceedings., Second EUROMICRO workshop on, pages 155–162. IEEE, 1997.
 Cassandra et al. [1996] Anthony R Cassandra, Leslie Pack Kaelbling, and James A Kurien. Acting under uncertainty: Discrete bayesian models for mobilerobot navigation. In Intelligent Robots and Systems’ 96, IROS 96, Proceedings of the 1996 IEEE/RSJ International Conference on, volume 2, pages 963–972. IEEE, 1996.
 Coumans et al. [2018] Erwin Coumans, Yunfei Bai, and Jasmine Hsu. Pybullet physics engine. 2018. URL http://pybullet.org/.
 Deits et al. [2013] Robin Deits, Stefanie Tellex, Pratiksha Thaker, Dimitar Simeonov, Thomas Kollar, and Nicholas Roy. Clarifying commands with informationtheoretic humanrobot dialog. Journal of HumanRobot Interaction, 2(2):58–79, 2013.
 Devin and Alami [2016] Sandra Devin and Rachid Alami. An implemented theory of mind to improve humanrobot shared plans execution. In HumanRobot Interaction (HRI), 2016 11th ACM/IEEE International Conference on, pages 319–326. IEEE, 2016.
 Guiaşu [1971] Silviu Guiaşu. Weighted entropy. Reports on Mathematical Physics, 2(3):165–179, 1971.
 HadfieldMenell et al. [2015] Dylan HadfieldMenell, Edward Groshev, Rohan Chitnis, and Pieter Abbeel. Modular task and motion planning in belief space. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 4991–4998, 2015.
 Hayes and Shah [2017] Bradley Hayes and Julie A Shah. Improving robot controller transparency through autonomous policy explanation. In Proceedings of the 2017 acm/ieee international conference on humanrobot interaction, pages 303–312. ACM, 2017.
 Huang et al. [2017] Sandy H Huang, David Held, Pieter Abbeel, and Anca D Dragan. Enabling robots to communicate their objectives. arXiv preprint arXiv:1702.03465, 2017.
 Kaelbling et al. [1998] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101:99–134, 1998.
 Kurniawati et al. [2008] Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Efficient pointbased pomdp planning by approximating optimally reachable belief spaces. In Robotics: Science and systems, volume 2008. Zurich, Switzerland., 2008.
 Lemaignan et al. [2017] Séverin Lemaignan, Mathieu Warnier, E Akin Sisbot, Aurélie Clodic, and Rachid Alami. Artificial cognition for social human–robot interaction: An implementation. Artificial Intelligence, 247:45–69, 2017.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Pineau et al. [2003] Joelle Pineau, Geoff Gordon, Sebastian Thrun, et al. Pointbased value iteration: An anytime algorithm for pomdps. In IJCAI, volume 3, pages 1025–1032, 2003.
 Platt Jr. et al. [2010] Robert Platt Jr., Russ Tedrake, Leslie Kaelbling, and Tomas LozanoPerez. Belief space planning assuming maximum likelihood observations. 2010.
 Racca and Kyrki [2018] Mattia Racca and Ville Kyrki. Active robot learning for temporal task models. In Proceedings of the 2018 ACM/IEEE International Conference on HumanRobot Interaction, pages 123–131. ACM, 2018.
 Roy et al. [2000] Nicholas Roy, Joelle Pineau, and Sebastian Thrun. Spoken dialogue management using probabilistic reasoning. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 93–100. Association for Computational Linguistics, 2000.
 Sadigh et al. [2017] Dorsa Sadigh, Anca D Dragan, Shankar Sastry, and Sanjit A Seshia. Active preferencebased learning of reward functions. In Robotics: Science and Systems (RSS), 2017.
 Silver and Veness [2010] David Silver and Joel Veness. Montecarlo planning in large pomdps. In Advances in neural information processing systems, pages 2164–2172, 2010.
 Somani et al. [2013] Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. Despot: Online pomdp planning with regularization. In Advances in neural information processing systems, pages 1772–1780, 2013.
 Tellex et al. [2013] Stefanie Tellex, Pratiksha Thaker, Robin Deits, Dimitar Simeonov, Thomas Kollar, and Nicholas Roy. Toward information theoretic humanrobot dialog. Robotics, page 409, 2013.
 Trafton et al. [2013] Greg Trafton, Laura Hiatt, Anthony Harrison, Frank Tamborello, Sangeet Khemlani, and Alan Schultz. Actr/e: An embodied cognitive architecture for humanrobot interaction. Journal of HumanRobot Interaction, 2(1):30–55, 2013.