"I’m sorry Dave, I’m afraid I can’t do that" Deep Q-Learning From Forbidden Actions

"I’m sorry Dave, I’m afraid I can’t do that"
Deep Q-Learning From Forbidden Actions

Mathieu Seurin
Université de Lille
&Philippe Preux
Université de Lille, CRIStAL
Olivier Pietquin
Google Brain
Contact Author

The use of Reinforcement Learning (RL) is still restricted to simulation or to enhance human-operated systems through recommendations. Real-world environments (e.g. industrial robots or power grids) are generally designed with safety constraints in mind implemented in the shape of valid actions masks or contingency controllers. For example, the range of motion and the angles of the motors of a robot can be limited to physical boundaries. Violating constraints thus results in rejected actions or entering in a safe mode driven by an external controller, making RL agents incapable of learning from their mistakes. In this paper, we propose a simple modification of a state-of-the-art deep RL algorithm (DQN), enabling learning from forbidden actions. To do so, the standard -learning update is enhanced with an extra safety loss inspired by structured classification. We empirically show that it reduces the number of hit constraints during the learning phase and accelerates convergence to near-optimal policies compared to using standard DQN. Experiments are done on a Visual Grid World Environment and Text-World domain.

1 Introduction

Despite the success of Reinforcement Learning (RL) (sutton2018reinforcement) in different domains: Games (Silver1140), Ressources Management (mao2016resource), Chemical reaction (zhou2017optimizing), many reasons are deterring industrial from using reinforcement learning in the real world. One of them is the agent’s unpredictability in unknown situations. Those problems are especially harmful when the algorithm is embedded in a physical system such as traffic-light management (el2013multiagent), or data-cooling center (lazic2018data) where a bad action can lead to catastrophic consequences (Damaging material, putting people in danger).

On the other hand, many real-world systems are equipped with security locks, in the form of forbidden actions or external controller taking over when the system is behaving wrongly. For example, most servo-motors present in robots are equipped with over-temperature monitoring, and over-voltage detection features or cleaning robots automatically u-turn when an obstacle is present.

Another motivation, taking inspirations from Natural Language Processing and Dialogue, is the integration of external rejection signal such as syntax parser or autocorrect mechanism. If integrated correctly, those tools could simplify language acquisition by removing unnecessary or wrong sentences.

However, model-free reinforcement learning is not designed to take this type of information into account. In the current Markov Decision Process (MDP) framework (puterman2014markov), a rejected action, from the agent point of view, is seen as a transition to the same state. The state-action value is only decreased by a factor (more details in subsection 3.1) misrepresenting the fact that the action is potentially harmful. The only way to modify the agent behavior is to tweak the reward function, giving a negative reward when a forbidden action is taken. This technique is a form of reward shaping (ng1999policy), which is known to be hard and potentially changes the optimal policy in unpredictable ways.

In this paper, we propose a better integration of forbidden actions into a -learning-like algorithm by adding a classification loss that maintains -values of forbidden actions below valid ones. We show empirically that it reduces the number of calls to forbidden actions by the agent, and it accelerates the convergence to near-optimal policies compared to standard DQN. Those experiments are conducted on two tasks: a visual grid world and a textual environment.

2 Context: Reinforcement Learning

In reinforcement learning (sutton2018reinforcement), an agent learns to interact with an environment so as to maximize a cumulative function of rewards. At each time step , the agent is in a state , where it selects an action according its policy . It move to the next state according to a transition kernel and receives a reward drawn from the environment’s reward function . The quality of the policy is assessed by the Q-function defined by for all where is the discount factor. The optimal -value is defined as , from which the optimal policy is derived.

We here use Deep -learning (DQN) to approximate the optimal -function with neural networks and perform off-policy updates by sampling transitions from a replay buffer (mnih2015human).

3 Method

3.1 Feedback Signal and MDP-F

We augment the MDP model with a Feedback Signal, a Boolean indicating whether an action was accepted by the environment or rejected. A MDP-F is then defined as a tuple where is a function mapping a state and action to a binary value with 0 meaning the action is valid and 1 meaning unsafe/rejected action.

Why -learning algorithm should integrate this information

Vanilla -learning struggles to differentiate between actions flagged as forbidden and valid ones. Consider the following example: An agent is in a state takes action flagged as forbidden, . When applying the -learning update (), since the action was rejected, and . Thus the update becomes . In current Deep Reinforcement Learning setup is usually set between (mnih2015human) and (pohlen2018observe). So a DQN-like algorithm will require a lot of transitions to get the -function of forbidden actions to get smaller. As a result, it will try the forbidden action many time. We are emphasizing that an invalid action indicates an action that could be harmful, so rapidly identifying and avoiding those potentially dangerous situations is essential.

3.2 Frontier loss

Link to Imitation Learning

In Imitation learning, few expert demonstrations are available and extracting as much information from those is the key. For example hester2018deep; Piot2014 slightly modify the -learning update to nudge expert actions-value above other actions. This is done by adding a secondary loss inspired by structured classification:

where when and otherwise. This nudges the -value of actions taken by the expert above the -value of other actions by at least a certain margin .

Similarly, we want to derive a loss that penalizes the -function when a forbidden action’s value excesses the value of a valid one.

Frontier loss

Ideally, for every state during training, we would like the -value of all forbidden actions to be below each valid actions, with a certain margin . We call this loss frontier loss :

Figure 1: Illustration of frontier loss
Figure 2: The training procedure is simple, the model predicts the validity for each action, and we only backprogate for the action the agent took

In our experiments, .

Frontier loss and classification

The main problem regarding this objective function is the need to know for every state which actions are valid. In most tasks, the agent encounters the majority of states only once. It’s needed to rely on function approximation to estimate which actions are valid in a given state. To achieve this, we train a neural network to predict in each state which action will be valid. The training procedure is illustrated in Figure 2. To consider an action as valid and to avoid early mis-classifications, we put a threshold after the sigmoid function. The action is considered to be valid if its value is above the threshold.

Frontier loss and DQN

Combining the frontier loss and Deep -learning is simple as it only requires to sum the two losses. We use a weighting factor . For all the experiments described below, we use and . Not much tweaking is required regarding this hyper-parameter.

Data: minibatch from replay buffer , Q-network , classification network
Result: Frontier loss
1 loss = 0;
2 for state , action , feedback in minibatch  do
3       if f == 1 then
4             valid_actions = C(s);
5             if [Q(valid_actions)] < Q(s, a) - m then
6                   loss = loss + [Q(valid_actions)] - Q(s, a) -m
7            ;
return loss;
Algorithm 1 Frontier loss and classification network

4 Experiments

We designed two experiments on two different domains to assess the quality of our method.

4.1 MiniGrid Enviroment

The first environment is a simple visual gridworld presented in (chevalier2019baby). The goal is to reach the green zone starting from a random point. Since we want to study how the agent can integrate feedback about action’s validity, we increase the action space size. The primary action space is composed of 3 actions (Turning Left, Turning Right, Going Forward), and each action is duplicated times. The action space size becomes . Then, we create different rooms in which the agent has to navigate, the color of the background indicating which set of actions is valid. For example, in the red room, only actions 11, 12, 13 are valid, and all the others are returning a not valid feedback. In our setup, we use making a total of 15 actions.

The state space is a compact encoded representation of the agent’s point of view (more details in (chevalier2019baby)). Since the environment is partially observable, we stack the last three frames as done is mnih2015human but we don’t use frame-skipping. An episode ends when the agent reaches the green zone or after 200 environment steps.

Figure 3: An instance of the MiniGrid problem. The state is a partial view of the maze (point of view of the agent) to avoid problem regarding partial observability, we stacked the last 3 frames
Figure 4: An example of interaction in TextWorld. The agent knows : what happened after his last action, a room description and its inventory content

4.2 TextWorld Environment

TextWorld (cote18textworld) is a text-based game where the agent interacts with the environment using short sentences. We generated a game composed of 3 rooms, 7 objects, and quest length of size 4. An example is shown on Figure 4. In this context, we modified the environment to fit our needs. The action space is composed of all <action> <object> pairs, creating a total of 46 actions. Most of the actions created will be rejected by the simulator since they will not fit the situation the agent is facing. For example, the action "take sword" will be rejected if no sword is available.

4.3 Model and architecture

During all experiments, we use Double DQN (hasselt2016deep) with uniform Experience Replay and -greedy exploration. In the Minigrid environment, we use a Convolution Neural Network (lecun1995convolutional) with a fully-connected layer on top. In TextWorld, inventory, observation, and room descriptions are each encoded by an LSTM (hochreiter1997long), concatenated, processed by a fully-connected layer on top.

The classification network matches exactly the architecture used by DQN, i.e. ConvNet for Minigrid and LSTM’s for TextWorld, the only difference resides in training (explained Figure 2)

5 Results

Figure 5: DQN+Frontier (yellow) DQN (blue). Results are averaged over five random seeds for Minigrid and nine random seeds for TextWorld. The shaded area represents one standard deviation. Left Performance on Minigrid. The first plot represents the number of times a forbidden action is taken, and the second one represents the percentage of success over time. Right Results on TextWorld, the high standard deviation indicates that DQN struggles to reach the level’s end consistently.

In Figure 5 we compare DQN and DQN plus the frontier loss. In the Minigrid domain, DQN struggles to find the optimal policy and reaches only 20% of the time the exit. Most of the time, DQN is able to solve one room but fails to find the set of actions for each room, performing forbidden actions over and over. On the contrary, the frontier loss is guiding DQN, reducing the number of feedback signals from the environment, and helping to find the optimal policy. Those results are echoed in the TextWorld experiment. DQN solves the game half of the time, and the other half doesn’t encounter the reward and as a result, can’t solve the game. This could be mitigated by having a better exploration strategy. Visualization of -values can be found in Appendix B

6 Related Works

Action Elimination

Closely related to our work is the notion of action elimination which was introduced in (even2006action). The main idea developed in that work, applied to Multi-Arm Bandits (lattimore2018bandit; lai1985asymptotically; robbins1952some) is to get rid of a sub-optimal action as soon as the value of this action is out of some confidence interval.

A similar idea was applied in Deep Reinforcement Learning by zahavy2018learn. This article shares similarities with ours as the authors are trying to eliminate actions based on a signal given by the environment, indicating if the action is valid or not. They are using a contextual bandit to assess the elimination signal’s certainty, and remove actions from the action set when the confidence is above a certain threshold. The main difference is in the way the elimination signal acts on -learning. In their case, the elimination signal doesn’t change -values but modifies the action set directly.

alshiekh2018safe define the term shielding similar to our notion of feedback. The simulator rejects potentially harmful actions. To learn from this process, the agent outputs a set of actions, ordered by preferences, and the simulator picks the best-allowed action.

Learning when the environment takes over

orseau2016safely design agent to not take into account feedback from the environment. For example, for an agent operating in real-time, it may be necessary for a human operator to prevent executing a harmful sequence and lead the agent into a safer situation. However, if the learning agent expects to receive rewards from this sequence, it may learn in the long run to avoid such interruptions, for example, by disabling the off-button. Under this setup, they showed that -learning could be interrupted safely, supporting the hypothesis that -learning is not integrating the feedback signal.

Large Discrete Action Space

A benefit of our method is to simplify policy space search when using bigger action space. Dealing with large discrete action space in Deep Reinforcement Learning was studied by dulac2015deep where they use an action embedding in continuous space and map it to a discrete action. Recent methods build upon this by also learning the action embedding instead of using a pre-computed one chandak2019learning; tennenholtz2019natural; chen2019learning; adolphs2019ledeepchef Another body of literature explores how to reduce combinatorial action space (he2016combi; he2016natural; dulac2012fast).

7 Conclusion

In this paper, we proposed a frontier loss, combined with a classification network, which nudges -values of rejected actions below -values of valid actions. We demonstrate its effectiveness on two simple benchmarks, a visual grid world, and a TextWorld domain. The frontier loss reduces the number of calls to rejected actions and guides early exploration, helping Deep -learning collecting more rewards.

Future Work

Generalizing the frontier loss in continuous action space would be a key to use this type of algorithm in robotics and more realistic settings. Combining the loss with action’s embedding could allow generalization to unseen actions. For example, learning that "Take sword" is rejected "Grab sword" shouldn’t be considered by the algorithm.


The authors would like to acknowledge the stimulating research environment of the SequeL Team. Special thanks to Florian Strub and Edouard Leurent for fruitful discussions We acknowledge the following agencies for research funding and computing support: Project BabyRobot (H2020-ICT-24-2015, grant agreement no.687831), and CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015-2020


Appendix A Appendix A: Training details

a.1 Minigrid Network and traininig

Convolution net, 3 layer, 16, 32, 64
Kernel size : 2,2,2
pooling : 2 on the first layer
FC 64 -> actions
1e-5 learning rate, decayed to 1e-7
weight decay 1e-4
replay buffer size 10000
target update every 10’000 steps
classif learning rate : 1e-4
sigmoid ceil classif : 0.9 (don’t consider action as valid if the sigmoid doesn’t excess this value)

a.2 TextWorld Network and training

word embedding size : 128
inventory rnn size : 256
description rnn size : 256
obs rnn size : 256
fc hidden : 350
1e-3 learning rate
weight decay 1e-4
replay buffer size 10000
target update every 2’000 steps
classif learning rate : 1e-4
sigmoid ceil classif : 0.9 (don’t consider action as valid if the sigmoid doesn’t excess this value)

Appendix B Appendix B : Visualisation of Q-values

Figure 6: Q-value, after 150 episodes of training (30’000 env steps) Left : Frontier loss Right : DQN
Figure 7: Q-value, after 100’000 env steps Left : Frontier loss Right : DQN
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description