Autonomous Self-Explanation of Behavior for Interactive Reinforcement Learning Agents

Autonomous Self-Explanation of Behavior for Interactive Reinforcement Learning Agents


In cooperation, the workers must know how co-workers behave. However, an agent’s policy, which is embedded in a statistical machine learning model, is hard to understand, and requires much time and knowledge to comprehend. Therefore, it is difficult for people to predict the behavior of machine learning robots, which makes Human Robot Cooperation challenging. In this paper, we propose Instruction-based Behavior Explanation (IBE), a method to explain an autonomous agent’s future behavior. In IBE, an agent can autonomously acquire the expressions to explain its own behavior by reusing the instructions given by a human expert to accelerate the learning of the agent’s policy. IBE also enables a developmental agent, whose policy may change during the cooperation, to explain its own behavior with sufficient time granularity.

Human Robot Cooperation; Interactive Reinforcement Learning; Instruction-based Behavior Explanation



2017 \setcopyrightauthorversion \conferenceinfoHAI ’17,October 17–20, 2017, Bielefeld, Germany \isbn978-1-4503-5113-3/17/10 \acmPrice$15.00 \doi


1 Introduction

Human Robot Cooperation (HRC), in which people and robots work on the same task together in a shared environment, is an effective concept for both industrial and domestic robots [amor2014interaction, dimeas2016online]. By working in a complementary manner, both robots and people overcome the disadvantages of each other in order to achieve difficult tasks that cannot be achieved by either of them. Cooperative robots require the ability to deal with complicated real-world information, and machine learning technics, such as deep reinforcement learning (DRL), are expected to realize the real-world information processing.

In cooperation, the workers must know how the other co-workers behave, in order to avoid dangerous misunderstandings and decide what roles to take in the situation [hayes2013challenges]. However, it is difficult for people to predict the machine learning agent’s behavior. The control logic embedded in a statistical machine learning model, especially in a deep learning model, is incomprehensible for most people, and requires much time and knowledge to understand.

Previous studies have shown that the interaction between people and robots develops the people’s understandings of robots and improves the performance of the cooperation. Andresen et al. proposed a robot that projects the robot’s own intentions and instructions on real-world objects [7745145]. This robot detects the locations and shapes of nearby objects, and projects information directly on them. The projection improved the effectiveness and efficiency of collaboration tasks. Hayes et al. proposed an answering system that explains autonomous agents’ policy [hayes2017improving]. This system builds a statistical model for the autonomous agent’s actions, and deals with some sets of natural language questions based on the model, so that people can get insights into the control logic of the agent. The proposed method successfully summarized the policies of both hand-coded and machine learning agents in some domains, regardless of internal representation of the agent’s control logic.

Figure 1: Instruction-based Behavior Explanation

However, all the information provided by the projection robot is designed by programmers. More complex behaviors acquired autonomously by machine learning technics make it impossible for people to design information to show their co-workers. The natural language answering system also requires designers to prepare a mapping from the agent’s action to a communicable predicate that explains the action. In addition, Hayes et al. assumed that the policy of an agent does not change after building a model of the agent’s behavior; therefore, this method does not work on a developmental agent that renews its policy gradually in actual human robot collaboration. Moreover, the work attempts to assign a communicable predicate to an action in one time step; however, explaining one step action is usually fine-spun when we consider a machine learning model that controls complex behavior of a robot. An agent’s behavior that people can recognize is the result of a sequence of actions. In order to explain the behavior of an agent to people, we have to consider the agent’s actions with longer time granularity.

This paper proposes Instruction-based Behavior Explanation (IBE), which is a method that explains the future behavior of a reinforcement learning agent in any situation. In IBE, we consider a setting of Interactive Reinforcement Learning (IRL) [thomaz2005real]. IRL is a framework in which a machine learning agent receives expert’s instructions to accelerate the agent’s policy acquisition. The IBE reuses the instruction as representations to explain an agent’s behavior (Fig. 1). However, in contrast to IRL, the designer or the instructor of agents does not have to give the relationship between instructions and agent’s actions explicitly. In IBE, an agent guesses the meanings of instructions on the assumption that, when an agent receives more rewards, it is more likely that the behavior of the agent followed the instruction. With this assumption, an agent can autonomously acquire the expression to explain the behavior. Besides, IBE estimates an agent’s behavior by simulating the transitions of the environment in each time step. The successive simulations make it possible to deal with a developmental agent whose policy is changeable. Moreover, by broadening the time span of the simulation, IBE can output information with sufficient time granularity.

Figure 2: The settings of IBE

2 background

2.1 Reinforcement Learning (RL)

RL is a type of learning, which acquires an agent’s policy autonomously in a sequential decision making process [sutton1998reinforcement]. An agent observes the state of the environment and selects an action in time . The state of the environment changes to by the agent’s action , and the agent receives a reward from the environment. An agent decides the action based on its policy , where is the probability of the agent to take an action in the environment state . The goal of an RL agent is to find the optimal policy that maximizes the total reward . In this paper, we consider an agent that learns its policy with RL.

2.2 Interactive Reinforcement Learning (IRL)

In a complex situation in which the state spaces and action spaces are very large, the learning process of an RL becomes excessively long [knox2009interactively]. In order for a cooperative agent to acquire its policy in the real-world with machine learning technics, it is necessary to deal with an increase in the search time. IRL is an approach that can solve the search time problem. In IRL, a human or an agent expert instructs a beginner agent in real time so that the beginner can learn the policy efficiently [cruz2016training]. Narrowing down the search spaces with the instruction can also help a cooperative agent learn the policy in the real-world.

Therefore, in this study, we consider a scenario in which a human expert instructs to an agent (Fig. 2). An expert gives an instruction signal to a beginner agent for every time step. is a real number, which represents an instruction from an expert to an agent. The instruction spaces are much narrower than the state spaces. Therefore, it is expected that the instructions can be linked to the actions more quickly than the environment states while agents will be able to select actions with the environment state information more delicately because the instruction signal is less informative than the state information.

3 Instruction-based Behavior Explanation (IBE)

In this paper, we propose IBE, a method to explain an autonomous agent’s behavior with the expressions given by a human expert as instruction (Fig. 3). IBE consists of two steps: (i) estimating the target of the agent’s actions by simulation and (ii) acquiring a mapping from the target of the agent’s actions to the expressions, in order to explain the action target based on the instruction signal given by a human expert.

Figure 3: The flow to explain an agent’s behavior in IBE

3.1 Estimation of the action target

In this study, we define the target of the agent’s actions at time as a change in the environment state after the agent’s actions in steps .


With the introduction of the time span , IBE can output the behavior explanation with human-understandable time granularity.

IBE estimates with the agent’s policy and (Algorithm 1). predicts the environment state in the next time step with current state and agent’s action as input.

0:  : current state, : range of the time steps
0:  : transition of the environment state
  for  to  do
  end for
  return ;
Algorithm 1 Estimation of the action target

3.2 Mapping from the agent’s behavior to the explanation signal

Next, the IBE decides an expression to explain the change in the environment . In other words, we consider the mapping .

By autonomously acquiring the relationship between instructions and states of the environment, we can also use the relationship to accelerate the learning process. For example, we will be able to judge whether the agent followed the instructions and give additional feedback to the agent’s actions.

First of all, we collect the history of the environment state , rewards , and the instruction at the time . Then we choose the episodes in which the total reward is top , and calculate the changes in the environment caused by the agent’s actions . After that, we divide the sets of into clusters with a clustering method. The classifier acquired in the clustering process can divide any into a cluster. The mapping is obtained by determining the explanation for each cluster ().


is a set of instructions accompanied by .


To decide the explanation , we assume that it is more likely for the agent to have followed an expert’s instruction when the agent received more rewards. We calculate the expected values of the instructions for each cluster.


Finally, we normalize between the clusters to be .

4 Case study

We constructed a game environment based on Lunar Lander v2, which was released on Open AI gym [brockman2016openai] to evaluate the IBE. The goal of the game is to soft-land a rocket on the moon (Fig. 4). The available actions are as follows: do nothing, fire left orientation engine, fire main engine, and fire right orientation engine. The landing pad of the original Lunar Lander v2 is always in the center; however, we randomly changed the landing location to the left, center, and right, to make it more difficult for people to anticipate the agent’s behavior. In every time step, the reward for a rocket agent is calculated based on five parameters: the distance to the goal, the speed of the rocket, the degree of inclination, the amount of use of fuel, and whether the legs of rockets are grounded on the moon.

We prepared two rocket agents with deep reinforcement learning models [mnih2015human]. Agent A is in the middle of the policy acquisition, and possibility to soft-land on the goal is 63.3%. Agent B has better policy than Agent A, and possibility is 83.3 %. We prepared agent A in order to consider the applicability of the IBE to agents in an earlier stage of action acquisition. Unskilled agents can act unexpected immature behavior, so it is better to be able to explain the agent’s behavior as soon as possible.

We decided the instruction for an agent as given by formula 5.


means “Fall to the left,” “Fall straight down,” and “Fall to the right,” respectively.

In the experiments, we prepared the predictor module using the same game engine as the Lunar Lander v2 to eliminate the uncertainty of the prediction.

Figure 4: Modified Lunar Lander v2. An agent receives the location of the goal with the environment state, and learns to soft-land on the goal.

4.1 Preliminary Experiment

Firstly, we inspected the assumption that when an agent received more rewards, it is more likely that the behavior of the agent followed the instruction in the Lunar Lander. The histogram in Fig. 5 shows the probability distribution of the amount of the agents’ moves in the horizontal direction , when the experts told the agents to “fall to the left” () and “fall straight down” (). is negative when the agent moves left, and zero when the agent falls straight down. We divided the episodes into two groups: episodes whose total reward is the top 25 %, and the others. The time spun for the simulation .

Figure 5: The histogram of in which the agent received the instructions. is negative when the agent moves left, and zero when the agent falls straight down. The vertical bars indicate the average values of . In (a), the blue areas, which indicates the high-score episodes, are distributed to the left side more than the red areas, which indicates the low-score episodes. The result shows that, with the assumption, the IBE can extract the agent’s behavior which follows the expert’s instructions.

Fig 5 shows that in the high-score episodes, agents follow the instruction more often than in the low-score episodes, especially for the low-score agent (Agent A). The result shows that considering the total reward in an episode helps in extracting the the agents’ behavior that follows the expert’s instructions.

4.2 Prediction Task

We conducted an experiment to inspect the effect of explanation of an agent’s behavior by IBE. We flattened the ground and got rid of the flags so that the participants could not know where the goal is, and recorded the game scenes of agent B. Then we selected 20 episodes whose length was more than 80 frames, and cut out 80 frames until landing. The participants of the experiment watched the first 20 frames to predict where the agent landed with or without the explanation by IBE, and then checked the actual behavior of the agent. Nine male students aged 21 to 28 who had never played the game participated in the experiment. We showed the output of the IBE to five of the participants, and the others predicted the landing spot without the output.

The number of clusters was eight, and the time spun for the simulation was 60. The outputs of IBE was normalized between -1 to 1, and visualized as shown in Fig. 6. The of IBE was Box2d [catto2011box2d], which is the same 2D game engine as Lunar Lander v2. Before the clustering of the change in the environment , we normalized and used k-means clustering.

Figure 6: to explain the target of the action of an agent in a cluster . The distribution of is biased to the right. IBE can explain the movements to the right in more detail than left because of the bias.

4.3 Result

We compared the accuracy of the participants’ predictions. The calculation of T-test confirmed significant differences between the participants with the IBE’s explanation (group A) and without the explanation (group B) in two episodes (Fig. 7). In episode 1 (Fig. 8) group A was significantly more accurate than group B, and episode 2 was the reverse of episode 1 (Fig. 9).

Figure 7: The percentage of error against the width of the game in episode 1 and 2.

The IBE generated complementary expressions from the three expressions given by the experts (Fig. 6). The complementary expression makes it possible to explain the degree of agent’s behavior, whereas the expert’s instruction does not. In the first 20 frames of episode 1, the rocket agent fell linearly, but deviated greatly to the right. We can say that in episode 1, it is difficult for group B to predict the rocket’s behavior, because the movements of the agent in the first 20 frames and the last 60 frames are quite different. However, the outputs of IBE was stuck to the right; therefore, the participants in group A could anticipate that the agent moved to the right considerably.

Figure 8: Explanation was effective in episode 1. The upper right portion shows agent B’s behavior in the first 20 frames. The upper left portion shows the visualization of the output of IBE in the first 20 frames. The lower portion shows the agent B’s behavior in the last 60 frames. At first agent B fell linearly, but then bent the course to the right widely.
Figure 9: Explanation misled the participants in episode 2. Agent B went a little to the left, but the outputs of IBE gathered in the center, which misled the participants to think the rocket fell straight down.

On the other hand, Fig. 6 shows that the clusters are unevenly distributed. Since the clusters are skewed to the right, it is possible that the resolution of the explanation was low when the agent moved to the left. In other words, IBE can explain the movements to the right in more detail than left. In the episode 2, rocket gently fell to the left (Fig. 9); however, IBE did not output . The output of IBE gathered around zero. Therefore, the participants misunderstood that the rocket fell straight down. The result suggests that we need to consider how to divide the environment change to assign an explanation signal.

5 Conclusion

This paper proposed Instruction-based Behavior Explanation, a method to guess the meaning of an expert’s instruction and reuse the expression of the instruction to explain the agent’s behavior. With IBE, the designer of an agent does not have to prepare a mapping from the agent’s behavior to an expression, in order to explain the behavior. By simulating the agent’s behavior, we can deal with a developmental agent whose policy changes during the interaction with the environment. Simulating the long-spun behavior of an agent also makes it possible to explain an agent’s behavior with sufficient time granularity. The results of the experiments showed the partial contribution that the explanation autonomously acquired by the IBE enriched people’s understandings of the agent’s future behavior.

Meanwhile, the IBE still has challenges. The results of the experiments also suggested the difficulty in dividing the state space to assign an explanation signal. Prediction of environmental change in world with high complexity is still a challenging topic of research. Moreover, we fixed the spun of simulation in this paper, but in human communication, time granularity of the explanation differs depending on the context. The problem of time granularity also occurs when an agent interprets the meaning of an instruction. In future works, we wish to consider the appropriate time spun for the explanation of the behavior. In addition, we are seeking the possibility of the application of IBE for agent’s policy acquisition.



  1. footnotetext: Keio University, Yokohama, Japan
  2. footnotetext: Research Fellow of Japan Society for the Promotion of Science, Tokyo, Japan
  3. footnotetext: Dwango Artificial Intelligence Laboratory, Tokyo, Japan
  4. footnotetext: The Whole Brain Architecture Initiative, Tokyo, Japan
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description