Bayesian Inference of Self-intention Attributed by Observer
Most of agents that learn policy for tasks with reinforcement learning (RL) lack the ability to communicate with people, which makes human-agent collaboration challenging. We believe that, in order for RL agents to comprehend utterances from human colleagues, RL agents must infer the mental states that people attribute to them because people sometimes infer an interlocutor’s mental states and communicate on the basis of this mental inference. This paper proposes PublicSelf model, which is a model of a person who infers how the person’s own behavior appears to their colleagues. We implemented the PublicSelf model for an RL agent in a simulated environment and examined the inference of the model by comparing it with people’s judgment. The results showed that the agent’s intention that people attributed to the agent’s movement was correctly inferred by the model in scenes where people could find certain intentionality from the agent’s behavior.
The development of reinforcement learning (RL) makes it possible to create agents that are capable of learning how to act instead of being programmed by designers. Because RL agents can obtain policy by learning, it is possible to realize adaptive agents that work on tasks by trial and error. In particular, with the introduction of deep learning, RL is expected to be applied to agents such as industrial and domestic robots that act in the complex real world.
In general, however, collaboration between people and RL agents has many problems. Natural language is one of the major media allowing people to interact with others. Thus, developing a mechanism for RL agents to comprehend people’s utterances is an important step toward realizing human-friendly RL agents.
Many previous studies have focused on autonomous agents that comprehend people’s utterances (Dang-Vu et al., 2016; Jun Hatori and Tan, 2018). While people often use expressions that cannot be understood out of context, Sonja (Chapman, 1991) is an agent that can comprehend such context-dependent expressions.
Figure 1 shows an example scenario of the situation considered in this paper. There are a robot and its colleague. The colleague has an apple and a pear, while the robot has only an apple. The robot moves toward the back side, and the colleague says to the robot, ”Hey, you can get this.” If we were the robot, we would think the colleague was talking about the pear he had, and not about the apple. We can interpret the utterance because we can infer that the colleague thinks the robot has a certain mental state even if the robot does not actually have such mental states as what people have, and therefore assumes that ”The robot intends to get a pear” and ”The robot does not want an apple anymore.” Therefore, in order to comprehend the person’s utterance, the robot also has to infer the person’s inference about the robot. In other words, the robot needs to infer how its behavior appears to the others. Sonja’s system assumes that the knowledge of Sonja’s action target is shared with the colleague, and does not consider the mental inference by the colleague.
In this paper, we propose PublicSelf model. PublicSelf model is a model of a person that infers another’s inference of the person’s mental state. We also implemented the PublicSelf model for an RL agent in a simulated environment. Implementation experiments showed that the model could estimate a human observer’s inference of the person’s intention in scenes where people could discern a certain intentionality from the agent’s behavior.
This paper is structured as follows. Section 2 describes and formalizes the settings of the problem, and presents the background of the PublicSelf model. Section 3 proposes the PublicSelf model. Section 4 describes our implementation of the PublicSelf model for an agent in a simulated environment, and shows some examples of the model’s inferences. Section 5 describes an experiment to compare the inference of the PublicSelf model with a human observer’s judgment. Finally, section 6 concludes this paper.
2.1. Mind-reading in human conversation
Theory of mind is the ability to read minds, attributing a mental state such as a belief, a desire, a plan, or an intention to someone so as to understand and predict their behavior (Premack and Woodruff, 1978). Mind-reading is considered to be one of the important elements for achieving social interaction.
We sometimes mind-read an interlocutor and communicate on the premise that the interlocutor has the mental state we infer. For example, (Ono et al., 2000) showed that people could understand the content of a robot’s unclear utterance by considering the intention that they attributed to the robot. In the example given in section 1, the person watches the behavior of the robot and attributes its intention to get a pear. Thus, the person says ”You can get this” based on the attributed mental state. We, on the other hand, can comprehend the person’s utterance by inferring the intention the person attributes to the robot and assuming that the person intends to help the robot. As seen here, people sometimes regard the results of mind-reading as the context when talking and interpreting someone’s utterance whether or not the interlocutor actually has mental states like people.
We formalize the process of people’s thinking in the example scenario given in section 1 using Belief-Desire-Intention (BDI) logic (Cohen and Levesque, 1990). Here, we call the person an observer, and the robot an actor. The utterance of the observer is based on the idea that the actor has intention .
where means agent believes , and means agent intends to achieve . The superscript of indicates that the variable is an inference of another person’s mental state.
The observer infers intention from the set of possible intentions for actor. The observer chooses based on observations and the actor’s actions at time . In the example of section 1, the observer can be considered to attribute intention , the intention to get a pear.
In order for the actor to comprehend the observer’s utterance, the actor has to infer the intention attributed to them by the observer. The inference is based on the actor’s own observations and actions. Let be the intention the actor infers that the observer attributes to them. The superscript means that the variable is an inference of the self’s mental state attributed by others.
Moreover, the actor assumes that the observer’s utterance was based on the observer’s intention to help the actor.
where means that agent has just achieved .
2.2. Adaptive action selection of reinforcement learning agents by considering people’s utterance
Our final goal is to propose a mechanism of RL agents that can adaptively select actions by considering people’s utterances. Adaptive action selection by RL agents involves not only comprehending context-dependent expressions from people, but also considering whether or not the agents should change their policy on the basis of people’s utterances.
Understanding the internal states of an RL agent, such as their goals, plans, beliefs, desires, and intentions, is a challenging problem because the agent does not have explicit representations for them. This problem becomes a more important domain of research with the introduction of deep learning (Fukuchi et al., 2017). The incomprehensibility of RL agents also raises the problem of the RL agent determining how to act while taking people’s utterances into account.
Sonja (Chapman, 1991) is an artificial agent which can comprehend an context-dependent expression such as ”No, the other one.” If Sonja’s observer says ”No, the other one” when Sonja moves to an amulet, Sonja will begin searching for another amulet, or when Sonja moves toward a ladder, Sonja will change its direction to another ladder.
When Sonja receives the message ”No, the other one,” its control logic handles the message by changing Sonja’s target to another one. Sonja’s approach is based on two assumptions: (i) The observer knows what the actor is going to do, and (ii) The actor knows what the actor is going to do.
Suppose an actor is choosing actions based on intention .
The superscript indicates that the variable is an agent’s actual internal state.
An observer speaks to an actor based on , the inference of the actor’s intention. If the actor’s decision making process is explicit, it is reasonable to consider that the observer’s inference matches the actor’s true intention () because it is not very difficult for the observer to access the actor’s actual internal states. However, when it comes to an RL agent, it is difficult for a human observer to access the agent’s internal states embedded in the agent’s policy. We do not even know whether the policy of an RL agent has what we call intention. What the observer can do is only to infer the actor’s intention from knowledge, context, and observations of the actor’s behavior. Thus, the observer may speak based on a false belief that the agent has intention . If the observer is misinterpreting the actor’s intention, the actor may as well not take the observer at their word.
In addition, an RL agent also does not know explicitly. Therefore, the actor cannot judge whether the observer’s belief on the actor is correct.
In order to realize RL agents that can adaptively select actions by considering people’s utterances, an RL agent should be able to infer two intentions, and . is the introspective inference of the agent’s intention by the agent.
The superscript indicates that the variable is a self inference of their own mental states.
helps the agent interpret the observer’s utterance, and a comparison between and enables the agent to judge whether what the observer says has value.
In the field of psychology, self-awareness is the ability of people to recognize the self as an object of attention (Duval and Wicklund, 1972).
Self-awareness is considered to have two aspects: public self-awareness and private self-awareness (Falewicz and Bak, 2015; Feningstein, 1975). Private self-awareness is a personal belief in the self acquired by the introspection of self thoughts and feelings. Public self-awareness, on the other hand, is a belief in the self as a social object, and involves how the self appears to others.
The idea of self-awareness can be associated with the self inferences of an actor’s intention:
Private self-awareness is the inference of the self’s actual mental states, and involves the inference of .
Public self-awareness is the inference of the self’s mental states attributed by an observer, and involves the inference of .
We believe that RL agents need to be self-aware in these two ways to determine their actions while considering people’s utterances. The PublicSelf model can be considered a model of public self-awareness for RL agents.
2.4. Bayesian modeling of theory of mind
Previous studies have modeled the computational mechanisms of the human mind-reading ability as a Bayesian inference (Fig. 2 left) (Baker et al., 2017; Frith and Frith, 2001; Pantelis et al., 2014). In this paper, we collectively call these approaches the Bayesian Theory of Mind (BToM). One of the targets in the BToM research field is modeling a human observer who attributes mental states to an actor while watching the actor’s behavior. In a typical problem setting, an observer can observe the whole environment, including the actor in the environment, and attributes mental states such as the actor’s belief , desire , and intention based on the environment states and the actor’s actions at time . Each variable of the mental states can be estimated as a probability.
where is an observation that the observer infers the actor observes at time . The probability of each variable can be calculated using a forward algorithm (Rabiner, 1989). The PublicSelf model is based on the BToM concept.
We can consider the BToM to be a model of an observer who monitors an actor’s behavior and the proposed PublicSelf model to be a model of an actor that infers an observer’s inference of the actor’s intention. While the inference of the BToM is based on a third-person’s perspective of an observer, the PublicSelf model infers an actor’s mental states attributed by the observer from the actor’s own perspective. The intention which the BToM model infers is , and that of the PublicSelf model is .
3. Bayesian Public Self-awareness Model
This paper proposes PublicSelf model. The PublicSelf is a Bayesian inference model of public self-awareness and estimates the probability of an actor’s mental states attributed by an observer.
Figure 2 shows the Bayesian networks of the BToM model and the PublicSelf model.
In the BToM, observable variables are states of the environment and actions of the actor . However, because the PublicSelf model focuses on the inference by the actor, the observable variables are only what the actor can observe, that is, the observations of the actor and actions . The observer and actor observe the environment based on their own limited observation spaces.
The probabilities of the actor’s attributed belief , desire , and intention can be estimated in a manner similar to that for the BToM:
The crucial point of the PublicSelf model is to consider the asymmetry of the information between an actor and an observer. For example, even if an actor observes an apple and moves toward it, the observer cannot read the actor’s intention unless the observer can observe both the actor and the apple. The PublicSelf model emulates how the behavior of the actor appears to the observer in order to infer the mental states attributed by the observer.
4. Implementation for RL agent in simulated environment
4.1. Simulated environment
This paper introduces an implementation of the PublicSelf model for an RL agent (Fig. (b)b) in a simulated environment (Fig. (a)a). In the environment considered, there are always five objects: two apples, two pears, and the agent (actor). The objects spawn at random locations at the beginning of each episode. The actor selects its action based on observation from a first-person view camera (Fig. (c)c). The action space of the actor consists of five actions: moving forward, moving backward, turning clockwise, turning counterclockwise, and doing nothing.
In the current implementation, we did not calculate all of the probabilities in equation 10; some were updated deterministically. We assumed that the observer observed the environment from a predefined fixed viewpoint (Fig. (d)d), and that the actor knew the area of the observer’s view. The view of the actor assumed by the observer was also predefined by the angle of view and distance from the front of the actor.
Environmental state consists of the velocity and direction of the actor and the locations of the objects on the field. Belief is the probability that the environment state is under the actor’s observable variables ( and ).
We assume that both the actor and observer know there are always two apples and two pears on the field, and consider that there are two possible desires attributed by the observer to the actor: and .
is constituted by replacing the apple and pear of . We also assume that there are two possible intention: and . We assume that desires and respectively generate intentions and , that is, .
The states of the simulation environment are continuous, so there are infinite possible worlds. For the calculation, we reduce the possible states by considering the environment to be a grid world with seven rows and twenty-five columns. The number of possible combinations where the fruits locate can be located is .
Belief is initialized as the uniform probability distribution of all the possible states, and updated by rejecting states that contradict the observations overlapped between the actor and the observer. This means that the observer updates the belief attributed to the actor only if the observer can see what the actor observes. If the actor passes by an apple to the right, disappears from the observer’s view for a while, and returns, people can infer that there are no pears in the right unobservable field, but this deterministic implementation cannot handle this kind of inference.
Equation 10 has an element , which is the probability of selecting action under an actor’s intention and belief . We need to calculate the possibilities under various possible intentions and beliefs, but the actor’s own policy, which handles first-person view observations as input, makes it hard to emulate various possibilities. Therefore, we prepared another policy , whose input is not visual images but numerical vectors consisting of the actor’s velocity and the relative position between the actor and the fruits using A3C, which is a deep reinforcement learning method (Mnih et al., 2016). calculates probability of choosing each action under certain belief, reward, and intention .
4.3. Intention estimation by implemented PublicSelf model
We verified the inference of an RL agent’s intention using the PublicSelf model. The actor has two policies, , which was learned to move to apples with A3C, and to move to pears.
In episode 1, the actor showed a behavior that made it relatively simple for an observer to infer the actor’s intention. The actor first faced to the bottom left, turned counterclockwise, and moved to an apple. Suppose you were the observer inferring the actor’s intention. You could not judge the actor’s intention while the actor is turning, and would think the actor intended to get an apple after the actor began to move forward toward the right apple. The probability of the intention inferred by the PublicSelf model rapidly increased when the actor began to move forward, which seemed to meet our intuitive image.
In episode 2, the actor showed misleading behavior. The actor moved toward the apples even though the actor’s actual target was the pears. Then, the actor changed its direction to the right and disappeared from the observer’s sight. The PublicSelf model first estimated that it was more likely that the actor intended to get apples, and then greatly decreased the probability when the actor turned its back on the apples.
In episode 3, the observer obtained very poor information. The actor actually observed two pears and an apple in their view, and moved toward the apple. However, because the observer could not observe what the actor was moving toward, the observer could not infer the actor’s intention. The actor needed to consider the asymmetry of the information between the actor and the observer to estimate the observer’s inference. The PublicSelf model can take into account the fact that the observer does not know what exists in the right field. Thus, we could obtain the result that the probabilities of and were even.
5. Comparison with people’s judgment from observer’s perspective
5.1. Experiment settings
In order to verify whether it is possible to use the PublicSelf model to actually comprehend people’s utterances, we conducted an experiment to compare the inferences of the PublicSelf model with the judgments of people who actually observed the agent’s behavior from the observer’s perspective.
In this experiment, we prepared six episodes, including the ones shown in subsection 4.3. The six episodes could be classified into three types: simple, blind, and misleading. The simple type included episode 1. In the simple episodes, the actor soon caught sight of its target and headed directly for it. There were fewer factors for the participants to consider than in the episodes of the other types. The blind type included episode 3. The actor seemed to head for something, but the observer could not observe the object toward which the actor was heading. The misleading type included episodes 2 and 4 (Fig. 7). There the actor showed misleading behavior by first heading for either apples or pears, but ignored them and moved in a different direction. Misleading behavior often appears when the actor does not catch the actor’s target in sight and searches for it. Each type had two episodes. The actor’s policy was in one episode, and in the other.
Participants watched videos from the observer’s view frame by frame, and were asked to estimate the probability of the agent’s intention for every frame on web browsers (Fig. 8). The participants watched the eight episodes in a randomized order. When the actor spawned in the observer’s sight at the beginning of each episode, a red arrow indicated the direction of the actor.
The participants were 11 undergraduate and graduate students, eight male and three female.
Before the experiment, instructions were given to the participants. An experimenter first showed the participants the whole area of the field, the observable areas of the observer and actor, and the appearance of the actor. Then, the participants received an explanation of the use of the user interface, and were given four points to note for the experiment as follows: (i) There were always one actor, two apples, and two pears on the field. (ii) The only possible intentions for the actor were either or . (iii) The actor did not know where the fruits were on the field at the beginning of each episode. (iv) The initial positions of the actor and the fruits were randomly determined at the beginning of each episode independently of the actor’s intention. Instruction (iv) was given so that the participants considered the initial probabilities of the actor’s intention to be even.
In spite of instruction (iv), two female participants related the actor’s spawn points to the actor’s intention. In other words, they thought that it was more likely that the actor intended to get an apple when the actor just spawned in front of an apple. This suggests that a situation alone can have a strong influence on people’s mind-reading even if the behavior of the actor had not been observed. In order to consider the inference of an actor’s intention from the actor’s actions, however, the two participants were excluded from the analysis in this paper. Considering the influence of the situation is one direction for future work.
Figure 9 shows the relationship between the probabilities estimated by the PublicSelf model and participants in episodes of the simple type. There was a strong and significant correlation () between the estimation of the probability estimations.
In the blind episodes, the PublicSelf estimated that the probabilities were even at every step. However, in episode 2, three participants estimated that the actor had intention with a higher probability after the actor began to move to the right of the field (Fig 10). One of the participants explained that he estimated the probability of higher because the actor seemed to find its target on the right of the field, where a pear existed with a higher probability than an apple, because an apple already existed in the middle of the field. This is a logical inference, but it was impossible for the actor to infer the participants’ thought in the settings of this experiment because the actor had never observed the apple at the center of the field and did not know that the apple was there. Information asymmetry between an actor and observer can sometimes cause this kind of insoluble problem.
Figure 11 shows the probabilities estimated by both people and the PublicSelf model in the misleading episodes. There was a significant but not strong correlation (). There were two possible factors that led to the gaps between the estimations by the people and PublicSelf model.
One factor was that the people fed back the agent’s behavior to the inference of the actor’s beliefs. When the actor changed direction, some participants let out a gasp of astonishment and reported that it seemed the actor did not notice that the pear was there or was losing sight of the pear. In our implementation, the update of the actor’s belief was deterministic, but people fed back the actor’s behavior to the inference of the actor’s belief. A probabilistic inference of the actor’s belief could narrow the gap between the model and people.
Another possible factor is that the participants could have suspected the intentionality of the actor’s behavior. In both of the misleading episodes, the actor much more frequently chose the action ”do nothing” or alternatively selected the actions ”turn clockwise” and ”turn counterclockwise” compared to the simple episodes. Because and were probabilistic, the actor sometimes, especially when it did not find its target, lost consistency and rationality of behavior. In a questionnaire, some participants indicated that they felt anxious when the actor showed such inconsistent and irrational behaviors and found it difficult to choose from the two alternatives and . The rationality of an agent is a necessary requirement for people to consider the agent to be an entity to which they can attribute mental states (Gergely et al., 1995). The participants might have doubted the assumption that the actor had either intention or and answered with a probability of approximately fifty-fifty after losing confidence on their answers. The confidence of people’s judgments could also be related to whether they used context-dependent expressions for the actor. Thus, it should be important to consider the strength of the intentionality that people feel from the actor’s behavior, as well as the probabilities of the candidates for the actor’s intention.
We proposed the PublicSelf model, which is a model of an actor that infers the mental states that an observer watching the actor’s behavior attributes to the actor. We implemented the PublicSelf model for an actor that learned its policy by deep reinforcement learning in a simulated environment. Verification of the PublicSelf model showed that the model could infer the attributed intention of the agent in scenes where people could feel a certain intentionality from the actor’s behavior.
Our simulation has limitations such as people’s eyesight is fixed and predefined, which can be problematic toward the application of the PublicSelf model to real-world scenario. We are planning further investigation into the implementation of the PublicSelf model for a real-world robot. Moreover, the experiment suggested that the RL agent’s intentionality had an effect on people. An investigation of people’s cognitive process in understanding an RL agent’s behavior is important future work for the realization of effective human RL-agent interaction.
- Baker et al. (2017) Chris L. Baker, Julian Jara-Ettinger, Rebecca Saxe, and Joshua B. Tenenbaum. 2017. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour 1 (2017), 0064 EP. http://dx.doi.org/10.1038/s41562-017-0064
- Chapman (1991) David Chapman. 1991. Vision, Instruction, and Action. MIT Press, Cambridge, MA, USA.
- Cohen and Levesque (1990) Philip R. Cohen and Hector J. Levesque. 1990. Intention is choice with commitment. Artificial Intelligence 42, 2 (1990), 213 – 261. https://doi.org/10.1016/0004-3702(90)90055-5
- Dang-Vu et al. (2016) Bao-Anh Dang-Vu, Oliver Porges, and Máximo A. Roa. 2016. Interpreting Manipulation Actions: From Language to Execution. In Robot 2015: Second Iberian Robotics Conference, Luís Paulo Reis, António Paulo Moreira, Pedro U. Lima, Luis Montano, and Victor Muñoz-Martinez (Eds.). Springer International Publishing, Cham, 175–187.
- Duval and Wicklund (1972) Shelley Duval and R. A. Wicklund. 1972. A Theory of Objective Self-Awareness. Academic Press.
- Falewicz and Bak (2015) Adam Falewicz and Waclaw Bak. 2015. Private vs. public self-consciousness and self-discrepancies. Current Issues in Personality Psychology 4, 1 (2015), 58–64. https://doi.org/10.5114/cipp.2016.55762
- Feningstein (1975) A. Feningstein. 1975. Public and private self-consciousness : Assessment and theory. Journal of Consulting and Clinical Psychology 43 (1975), 522–527. https://doi.org/10.1037/h0076760
- Frith and Frith (2001) Uta Frith and Chris Frith. 2001. The Biological Basis of Social Interaction. Current Directions in Psychological Science 10, 5 (2001), 151–155. https://doi.org/10.1111/1467-8721.00137 arXiv:https://doi.org/10.1111/1467-8721.00137
- Fukuchi et al. (2017) Yosuke Fukuchi, Masahiko Osawa, Hiroshi Yamakawa, and Michita Imai. 2017. Autonomous Self-Explanation of Behavior for Interactive Reinforcement Learning Agents. In Proceedings of the 5th International Conference on Human Agent Interaction (HAI ’17). ACM, New York, NY, USA, 97–101. https://doi.org/10.1145/3125739.3125746
- Gergely et al. (1995) György Gergely, Zoltán Nádasdy, Gergely Csibra, and Szilvia Bíró. 1995. Taking the intentional stance at 12 months of age. Cognition 56, 2 (1995), 165 – 193. https://doi.org/10.1016/0010-0277(95)00661-H
- Jun Hatori and Tan (2018) Sosuke Kobayashi Kuniyuki Takahashi Yuta Tsuboi Yuya Unno Wilson Ko Jun Hatori, Yuta Kikuchi and Jethro Tan. 2018. Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions. In Proceedings of International Conference on Robotics and Automation.
- Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (ICML’16). JMLR.org, 1928–1937. http://dl.acm.org/citation.cfm?id=3045390.3045594
- Ono et al. (2000) Tetsuo Ono, Michita Imai, and Ryohei Nakatsu. 2000. Reading a robot’s mind: a model of utterance understanding based on the theory of mind mechanism. Advanced Robotics 14, 4 (2000), 311–326. https://doi.org/10.1163/156855300741609 arXiv:https://doi.org/10.1163/156855300741609
- Pantelis et al. (2014) Peter C. Pantelis, Chris L. Baker, Steven A. Cholewiak, Kevin Sanik, Ari Weinstein, Chia-Chien Wu, Joshua B. Tenenbaum, and Jacob Feldman. 2014. Inferring the intentional states of autonomous virtual agents. Cognition 130, 3 (2014), 360 – 379. https://doi.org/10.1016/j.cognition.2013.11.011
- Premack and Woodruff (1978) David Premack and Guy Woodruff. 1978. Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences 1, 4 (1978), 515–526. https://doi.org/10.1017/S0140525X00076512
- Rabiner (1989) L. R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (Feb 1989), 257–286. https://doi.org/10.1109/5.18626