Autonomous Self-Explanation of Behavior for Interactive Reinforcement Learning Agents
In cooperation, the workers must know how co-workers behave. However, an agent’s policy, which is embedded in a statistical machine learning model, is hard to understand, and requires much time and knowledge to comprehend. Therefore, it is difficult for people to predict the behavior of machine learning robots, which makes Human Robot Cooperation challenging. In this paper, we propose Instruction-based Behavior Explanation (IBE), a method to explain an autonomous agent’s future behavior. In IBE, an agent can autonomously acquire the expressions to explain its own behavior by reusing the instructions given by a human expert to accelerate the learning of the agent’s policy. IBE also enables a developmental agent, whose policy may change during the cooperation, to explain its own behavior with sufficient time granularity.
2017 \setcopyrightauthorversion \conferenceinfoHAI ’17,October 17–20, 2017, Bielefeld, Germany \isbn978-1-4503-5113-3/17/10 \acmPrice$15.00 \doihttps://doi.org/10.1145/3125739.3125746
Human Robot Cooperation (HRC), in which people and robots work on the same task together in a shared environment, is an effective concept for both industrial and domestic robots [amor2014interaction, dimeas2016online]. By working in a complementary manner, both robots and people overcome the disadvantages of each other in order to achieve difficult tasks that cannot be achieved by either of them. Cooperative robots require the ability to deal with complicated real-world information, and machine learning technics, such as deep reinforcement learning (DRL), are expected to realize the real-world information processing.
In cooperation, the workers must know how the other co-workers behave, in order to avoid dangerous misunderstandings and decide what roles to take in the situation [hayes2013challenges]. However, it is difficult for people to predict the machine learning agent’s behavior. The control logic embedded in a statistical machine learning model, especially in a deep learning model, is incomprehensible for most people, and requires much time and knowledge to understand.
Previous studies have shown that the interaction between people and robots develops the people’s understandings of robots and improves the performance of the cooperation. Andresen et al. proposed a robot that projects the robot’s own intentions and instructions on real-world objects . This robot detects the locations and shapes of nearby objects, and projects information directly on them. The projection improved the effectiveness and efficiency of collaboration tasks. Hayes et al. proposed an answering system that explains autonomous agents’ policy [hayes2017improving]. This system builds a statistical model for the autonomous agent’s actions, and deals with some sets of natural language questions based on the model, so that people can get insights into the control logic of the agent. The proposed method successfully summarized the policies of both hand-coded and machine learning agents in some domains, regardless of internal representation of the agent’s control logic.
However, all the information provided by the projection robot is designed by programmers. More complex behaviors acquired autonomously by machine learning technics make it impossible for people to design information to show their co-workers. The natural language answering system also requires designers to prepare a mapping from the agent’s action to a communicable predicate that explains the action. In addition, Hayes et al. assumed that the policy of an agent does not change after building a model of the agent’s behavior; therefore, this method does not work on a developmental agent that renews its policy gradually in actual human robot collaboration. Moreover, the work attempts to assign a communicable predicate to an action in one time step; however, explaining one step action is usually fine-spun when we consider a machine learning model that controls complex behavior of a robot. An agent’s behavior that people can recognize is the result of a sequence of actions. In order to explain the behavior of an agent to people, we have to consider the agent’s actions with longer time granularity.
This paper proposes Instruction-based Behavior Explanation (IBE), which is a method that explains the future behavior of a reinforcement learning agent in any situation. In IBE, we consider a setting of Interactive Reinforcement Learning (IRL) [thomaz2005real]. IRL is a framework in which a machine learning agent receives expert’s instructions to accelerate the agent’s policy acquisition. The IBE reuses the instruction as representations to explain an agent’s behavior (Fig. 1). However, in contrast to IRL, the designer or the instructor of agents does not have to give the relationship between instructions and agent’s actions explicitly. In IBE, an agent guesses the meanings of instructions on the assumption that, when an agent receives more rewards, it is more likely that the behavior of the agent followed the instruction. With this assumption, an agent can autonomously acquire the expression to explain the behavior. Besides, IBE estimates an agent’s behavior by simulating the transitions of the environment in each time step. The successive simulations make it possible to deal with a developmental agent whose policy is changeable. Moreover, by broadening the time span of the simulation, IBE can output information with sufficient time granularity.
2.1 Reinforcement Learning (RL)
RL is a type of learning, which acquires an agent’s policy autonomously in a sequential decision making process [sutton1998reinforcement]. An agent observes the state of the environment and selects an action in time . The state of the environment changes to by the agent’s action , and the agent receives a reward from the environment. An agent decides the action based on its policy , where is the probability of the agent to take an action in the environment state . The goal of an RL agent is to find the optimal policy that maximizes the total reward . In this paper, we consider an agent that learns its policy with RL.
2.2 Interactive Reinforcement Learning (IRL)
In a complex situation in which the state spaces and action spaces are very large, the learning process of an RL becomes excessively long [knox2009interactively]. In order for a cooperative agent to acquire its policy in the real-world with machine learning technics, it is necessary to deal with an increase in the search time. IRL is an approach that can solve the search time problem. In IRL, a human or an agent expert instructs a beginner agent in real time so that the beginner can learn the policy efficiently [cruz2016training]. Narrowing down the search spaces with the instruction can also help a cooperative agent learn the policy in the real-world.
Therefore, in this study, we consider a scenario in which a human expert instructs to an agent (Fig. 2). An expert gives an instruction signal to a beginner agent for every time step. is a real number, which represents an instruction from an expert to an agent. The instruction spaces are much narrower than the state spaces. Therefore, it is expected that the instructions can be linked to the actions more quickly than the environment states while agents will be able to select actions with the environment state information more delicately because the instruction signal is less informative than the state information.
3 Instruction-based Behavior Explanation (IBE)
In this paper, we propose IBE, a method to explain an autonomous agent’s behavior with the expressions given by a human expert as instruction (Fig. 3). IBE consists of two steps: (i) estimating the target of the agent’s actions by simulation and (ii) acquiring a mapping from the target of the agent’s actions to the expressions, in order to explain the action target based on the instruction signal given by a human expert.
3.1 Estimation of the action target
In this study, we define the target of the agent’s actions at time as a change in the environment state after the agent’s actions in steps .
With the introduction of the time span , IBE can output the behavior explanation with human-understandable time granularity.
IBE estimates with the agent’s policy and (Algorithm 1). predicts the environment state in the next time step with current state and agent’s action as input.
3.2 Mapping from the agent’s behavior to the explanation signal
Next, the IBE decides an expression to explain the change in the environment . In other words, we consider the mapping .
By autonomously acquiring the relationship between instructions and states of the environment, we can also use the relationship to accelerate the learning process. For example, we will be able to judge whether the agent followed the instructions and give additional feedback to the agent’s actions.
First of all, we collect the history of the environment state , rewards , and the instruction at the time . Then we choose the episodes in which the total reward is top , and calculate the changes in the environment caused by the agent’s actions . After that, we divide the sets of into clusters with a clustering method. The classifier acquired in the clustering process can divide any into a cluster. The mapping is obtained by determining the explanation for each cluster ().
is a set of instructions accompanied by .
To decide the explanation , we assume that it is more likely for the agent to have followed an expert’s instruction when the agent received more rewards. We calculate the expected values of the instructions for each cluster.
Finally, we normalize between the clusters to be .
4 Case study
We constructed a game environment based on Lunar Lander v2, which was released on Open AI gym [brockman2016openai] to evaluate the IBE. The goal of the game is to soft-land a rocket on the moon (Fig. 4). The available actions are as follows: do nothing, fire left orientation engine, fire main engine, and fire right orientation engine. The landing pad of the original Lunar Lander v2 is always in the center; however, we randomly changed the landing location to the left, center, and right, to make it more difficult for people to anticipate the agent’s behavior. In every time step, the reward for a rocket agent is calculated based on five parameters: the distance to the goal, the speed of the rocket, the degree of inclination, the amount of use of fuel, and whether the legs of rockets are grounded on the moon.
We prepared two rocket agents with deep reinforcement learning models [mnih2015human]. Agent A is in the middle of the policy acquisition, and possibility to soft-land on the goal is 63.3%. Agent B has better policy than Agent A, and possibility is 83.3 %. We prepared agent A in order to consider the applicability of the IBE to agents in an earlier stage of action acquisition. Unskilled agents can act unexpected immature behavior, so it is better to be able to explain the agent’s behavior as soon as possible.
We decided the instruction for an agent as given by formula 5.
means “Fall to the left,” “Fall straight down,” and “Fall to the right,” respectively.
In the experiments, we prepared the predictor module using the same game engine as the Lunar Lander v2 to eliminate the uncertainty of the prediction.
4.1 Preliminary Experiment
Firstly, we inspected the assumption that when an agent received more rewards, it is more likely that the behavior of the agent followed the instruction in the Lunar Lander. The histogram in Fig. 5 shows the probability distribution of the amount of the agents’ moves in the horizontal direction , when the experts told the agents to “fall to the left” () and “fall straight down” (). is negative when the agent moves left, and zero when the agent falls straight down. We divided the episodes into two groups: episodes whose total reward is the top 25 %, and the others. The time spun for the simulation .
Fig 5 shows that in the high-score episodes, agents follow the instruction more often than in the low-score episodes, especially for the low-score agent (Agent A). The result shows that considering the total reward in an episode helps in extracting the the agents’ behavior that follows the expert’s instructions.
4.2 Prediction Task
We conducted an experiment to inspect the effect of explanation of an agent’s behavior by IBE. We flattened the ground and got rid of the flags so that the participants could not know where the goal is, and recorded the game scenes of agent B. Then we selected 20 episodes whose length was more than 80 frames, and cut out 80 frames until landing. The participants of the experiment watched the first 20 frames to predict where the agent landed with or without the explanation by IBE, and then checked the actual behavior of the agent. Nine male students aged 21 to 28 who had never played the game participated in the experiment. We showed the output of the IBE to five of the participants, and the others predicted the landing spot without the output.
The number of clusters was eight, and the time spun for the simulation was 60. The outputs of IBE was normalized between -1 to 1, and visualized as shown in Fig. 6. The of IBE was Box2d [catto2011box2d], which is the same 2D game engine as Lunar Lander v2. Before the clustering of the change in the environment , we normalized and used k-means clustering.
We compared the accuracy of the participants’ predictions. The calculation of T-test confirmed significant differences between the participants with the IBE’s explanation (group A) and without the explanation (group B) in two episodes (Fig. 7). In episode 1 (Fig. 8) group A was significantly more accurate than group B, and episode 2 was the reverse of episode 1 (Fig. 9).
The IBE generated complementary expressions from the three expressions given by the experts (Fig. 6). The complementary expression makes it possible to explain the degree of agent’s behavior, whereas the expert’s instruction does not. In the first 20 frames of episode 1, the rocket agent fell linearly, but deviated greatly to the right. We can say that in episode 1, it is difficult for group B to predict the rocket’s behavior, because the movements of the agent in the first 20 frames and the last 60 frames are quite different. However, the outputs of IBE was stuck to the right; therefore, the participants in group A could anticipate that the agent moved to the right considerably.
On the other hand, Fig. 6 shows that the clusters are unevenly distributed. Since the clusters are skewed to the right, it is possible that the resolution of the explanation was low when the agent moved to the left. In other words, IBE can explain the movements to the right in more detail than left. In the episode 2, rocket gently fell to the left (Fig. 9); however, IBE did not output . The output of IBE gathered around zero. Therefore, the participants misunderstood that the rocket fell straight down. The result suggests that we need to consider how to divide the environment change to assign an explanation signal.
This paper proposed Instruction-based Behavior Explanation, a method to guess the meaning of an expert’s instruction and reuse the expression of the instruction to explain the agent’s behavior. With IBE, the designer of an agent does not have to prepare a mapping from the agent’s behavior to an expression, in order to explain the behavior. By simulating the agent’s behavior, we can deal with a developmental agent whose policy changes during the interaction with the environment. Simulating the long-spun behavior of an agent also makes it possible to explain an agent’s behavior with sufficient time granularity. The results of the experiments showed the partial contribution that the explanation autonomously acquired by the IBE enriched people’s understandings of the agent’s future behavior.
Meanwhile, the IBE still has challenges. The results of the experiments also suggested the difficulty in dividing the state space to assign an explanation signal. Prediction of environmental change in world with high complexity is still a challenging topic of research. Moreover, we fixed the spun of simulation in this paper, but in human communication, time granularity of the explanation differs depending on the context. The problem of time granularity also occurs when an agent interprets the meaning of an instruction. In future works, we wish to consider the appropriate time spun for the explanation of the behavior. In addition, we are seeking the possibility of the application of IBE for agent’s policy acquisition.
- footnotetext: Keio University, Yokohama, Japan
- footnotetext: Research Fellow of Japan Society for the Promotion of Science, Tokyo, Japan
- footnotetext: Dwango Artificial Intelligence Laboratory, Tokyo, Japan
- footnotetext: The Whole Brain Architecture Initiative, Tokyo, Japan