Tactics of Adversarial Attack on Deep Reinforcement Learning Agents
We introduce two tactics, namely the strategically-timed attack and the enchanting attack, to attack reinforcement learning agents trained by deep reinforcement learning algorithms using adversarial examples. In the strategically-timed attack, the adversary aims at minimizing the agent’s reward by only attacking the agent at a small subset of time steps in an episode. Limiting the attack activity to this subset helps prevent detection of the attack by the agent. We propose a novel method to determine when an adversarial example should be crafted and applied. In the enchanting attack, the adversary aims at luring the agent to a designated target state. This is achieved by combining a generative model and a planning algorithm: while the generative model predicts the future states, the planning algorithm generates a preferred sequence of actions for luring the agent. A sequence of adversarial examples is then crafted to lure the agent to take the preferred sequence of actions. We apply the proposed tactics to the agents trained by the state-of-the-art deep reinforcement learning algorithm including DQN and A3C. In 5 Atari games, our strategically-timed attack reduces as much reward as the uniform attack (i.e., attacking at every time step) does by attacking the agent 4 times less often. Our enchanting attack lures the agent toward designated target states with a more than 70% success rate. Example videos are available at http://yenchenlin.me/adversarial_attack_RL/.
Deep neural networks (DNNs), which can extract hierarchical distributed representations from signals, are established as the de facto tool for pattern recognition, particularly for supervised learning. We, as a generation, have witnessed a trend of fast adoption of DNNs in various commercial systems performing image recognition [krizhevsky2012imagenet], speech recognition [hannun2014deep], and natural language processing [sutskever2014sequence] tasks. Recently, DNNs have also started to play a central role in reinforcement learning (RL)—a field of machine learning research where the goal is to train an agent to interact with the environment for maximizing its reward. The community has realized that DNNs are ideal function approximators for classical RL algorithms, because DNNs can extract reliable patterns from signals for constructing a more informed action determination process. For example, [mnih:human] use a DNN to model the action–value function in the Q-learning algorithm, and [mnih:asynchronous] use a DNN to directly model the policy. Reinforcement learning research powered by DNNs is generally referred to as deep reinforcement learning (Deep RL).
However, a constant question lingers while we enjoy using DNNs for function approximation in RL. Specifically, since DNNs are known to be vulnerable to the adversarial example attack [szegedy:intriguing], as a deep RL agent inherits the pattern recognition power from a DNN, does it also inherit its vulnerability to the adversarial examples? We believe the answer is yes and provide empirical evidence in the paper.
Adversarial attack on deep RL agents is different from adversarial attack on classification system in several ways. Firstly, an RL agent interacts with the environment through a sequence of actions where each action changes the state of the environment. What the agent received is a sequence of correlated observations. For an episode of steps, an adversary can determine whether to craft an adversarial example to attack the agent at each time step (i.e. there are choices). Secondly, an adversary to deep RL agents have different goals such as reducing the final rewards of agents or malevolently lure agents to dangerous states, which is different to an adversary to classification system that aims at lowering classification accuracy. In this paper, we focus on studying adversarial attack specific on deep RL agents. We argue this is important. As considering deep RL agents for controlling machines, we need to understand the vulnerability of the agents because it would limit their use in mission-critical tasks such as autonomous driving. Based on [kurakin:adversarial], which showed that adversarial examples also exist in the real world, an adversary can add maliciously-placed paint to the surface of a traffic stop to confuse an autonomous car. How could we fully trust deep RL agents if their vulnerability to adversarial attacks is not fully understood and addressed?
In a contemporary work, [sandy:adversarial] proposes an adversarial attack tactic where the adversary attacks a deep RL agent at every time step in an episode. We refer to such a tactic as the uniform attack and argue it is preliminary. First, the uniform attack ignores the fact that the observations are correlated. Moreover, the spirit of adversarial attack is to apply a minimal perturbation to the observation to avoid detection. If the adversary perturbs the observation at every time instance, it is more likely to be detected. A more sophisticated strategy would be to attack at selective time steps. For example, as shown in Fig. 1, attacking the deep RL agent has no consequence when the ball is far away from the paddle. However, when the ball is close to the paddle, attacking the deep RL agent could cause it to drop the ball. Therefore, the adversarial attacks at different time instances are not equally effective. Based on this observation, we propose the strategically-timed attack, which takes into account the number of times an adversarial example is crafted and used. It intends to reduce the reward with as fewer adversarial examples as possible. An adversarial example is only used when the attack is expected to be effective. Our experiment results show that an adversary exercising the strategically-timed attack tactic can reduce the reward of the state-of-the-art deep RL agents by attacking four times less often as comparing to an adversary exercising the uniform attack tactic.
In addition, we propose the enchanting attack for maliciously luring a deep RL agent to a certain state. While the strategically-timed attack aims at reducing the reward of a deep RL agent, the enchanting attack aims at misguiding the agent to a specified state. The enchanting attack can be used to mislead a self-driving car controlled by a deep RL agent to hit a certain obstacle. We implement the enchanting attack using a planning algorithm and a deep generative model. To the best of our knowledge, this is the first planning-based adversarial attack on a deep RL agent. Our experiment results show that the enchanting attack has a more than success rate in attacking state-of-the-art deep RL agents.
We apply our adversarial attack to the agents trained by state-of-the-art deep RL algorithms including A3C [mnih:asynchronous] and DQN [mnih:human] on 5 Atari games. We provide examples to evaluate the effectiveness of our attacks. We also compare the robustness of the agents trained by the A3C and DQN algorithms to these adversarial attacks. The contributions of the paper are summarized below:
We study adversarial example attacks on deep RL agents trained by state-of-the-art deep RL algorithms including A3C and DQN.
We propose the strategically-timed attack aiming at attacking a deep RL agent at critical moments.
We propose the enchanting attack (the first planning-based adversarial attack) aiming at maliciously luring an agent to a certain state.
We conduct extensive experiments to evaluate the vulnerability of deep RL agents to the two attacks.
2 Related Work
Following [szegedy:intriguing], several adversarial example generation methods were proposed for attacking DNNs. Most of these methods generated an adversarial example via seeking a minimal perturbation of an image that can confuse the classifier (e.g., [goodfellow:explaining, kurakin:adversarial]). [moosavi:deep] first estimated linear decision boundaries between classes of a DNN in the image space and iteratively shifted an image toward the closest of these boundaries for crafting an adversarial example.
While the existence of adversarial examples to DNNs has been demonstrated several times on various supervised learning tasks, the existence of adversarial examples to deep RL agents has remained largely unexplored. In a contemporary paper, [sandy:adversarial] proposed the uniform attack, which attacks a deep RL agent with adversarial examples at every time step in an episode for reducing the reward of the agent. Our work is different to [sandy:adversarial] in several aspects, including 1) we introduce a strategically-timed attack, which can reach the same effect of the uniform attack by attacking the agent four times less often on average; 2) we also introduce an enchanting attack tactic, which is the first planning-based adversarial attack to misguide the agent toward a target state.
In terms of defending DNNs from adversarial attacks, several approaches were recently proposed. [goodfellow:explaining] augmented the training data with adversarial examples to improve DNNs’ robustness to adversarial examples. [zheng:improving] proposed incorporating a stability term to the objective function, encouraging DNNs to generate similar outputs for various perturbed versions of an image. Defensive distillation is proposed in [papernot:distillation] for training a network to defend both the L-BFGS attack in [szegedy:intriguing] and the fast gradient sign attack in [goodfellow:explaining]. Interestingly, as anti-adversarial attack approaches were proposed, stronger adversarial attack approaches also emerged. [carlini-wagner:towards] recently introduced a way to construct adversarial examples that is immune to various anti-adversarial attack methods, including defensive distillation. A study in [rozsa:towards] showed that more accurate models tend to be more robust to adversarial examples, while adversarial examples that can fool a more accurate model can also fool a less accurate model. As the study of adversarial attack to deep RL agents is still in its infancy, we are unaware of earlier works on the anti-adversarial attack to deep RL agents.
3 Adversarial Attacks
In this section, we will first review the adversarial example attack to DNN-based classification systems. We will then generalize the attack to deep RL agents and introduce our strategically-timed and enchanting attacks.
Let be an image and be a DNN. An adversarial example to the DNN can be crafted through solving the following optimization problem:
where is an image similarity metric. In words, it looks for a minimal perturbation, , of an image that can change the class assignment of the DNN to the image.
An RL agent learns to interact with the environment through the rewards signal. At each time step, it performs an action based on the observation of the environment for maximizing the accumulated future rewards. The action determination is through a policy , which maps an observation to an action. Let the current time step be , the goal of an RL algorithm is then to learn a policy that maximizes the accumulated future rewards , where is the length of an episode.
In a deep RL algorithm, the policy is modeled through a DNN. An adversary can attack an agent trained by the deep RL algorithm by perturbing the observations (through crafting an adversarial example) to make the agent take non-preferred actions that can result in reduction of the accumulated future rewards.
3.2 Adversarial Attacks on RL
In a recent paper, [sandy:adversarial] propose the uniform attack tactic where the adversary attacks a deep RL agent at every time step, by perturbing each image the agent observes. The perturbation to an image is computed by using the fast gradient sign method [goodfellow:explaining]. The uniform attack tactic is regarded as a direct extension of the adversarial attack on a DNN-based classification system, since the adversarial example at each time step is computed independently of the adversarial examples at other time steps.
It does not consider several unique aspects of the RL problem. For example, during learning, an RL agent is never told which actions to take but instead discovers which actions yield the most reward. This is in contrast to the classification problem where each image has a ground truth class. Moreover, an adversarial attack to a DNN is considered a success if it makes the DNN outputs any wrong class. But the success of an adversarial attack on an RL agent is measured based on the amount of reward that the adversary takes away from the RL agent. Instead of perturbing the image to make the agent takes any non-optimal action, we would like to find a perturbation that makes the agent takes an action that can reduce most reward. Also, because the reward signal in many RL problems is sparse, an adversary need not attack the RL agent at every time step. Our strategically-timed attack tactic described in Section 3.3 leverages these unique characteristics to attack deep RL agents.
Another unique characteristic of the RL problem is that each action taken by the agent influenced its future observations. Therefore, an adversary could plan a sequence of adversarial examples to maliciously lure the agent toward a certain state that can lead to a catastrophic outcome. Our enchanting attack tactic described in Section 3.4 leverages this characteristic to attack RL agents.
3.3 Strategically-Timed Attack
In an episode, an RL agent observes a sequence of observations or states . Instead of attacking at every time step in an episode, the strategically-timed attack selects a subset of time steps to attack the agent. Let be a sequence of perturbations. Let be the expected return at the first time step. We can formulate the above intuition as an optimization problem as follows:
The binary variables denote when an adversarial example is applied. If , the perturbation is applied. Otherwise, we do not alter the state. The total number of attacks is limited by the constant . In words, the adversary minimizes the expected accumulated reward by strategically attacking less than time steps.
The optimization problem in (2) is a mixed integer programming problem, which is difficult to solve. Moreover, in an RL problem, the observation at time step depends on all the previous observations, which makes the development of a solver to (2) even more challenging since the problem size grows exponentially with . In order to study adversarial attack to deep RL agents, we bypass these limitations and propose a heuristic algorithm to compute (solving the when-to-attack problem) and (solving the how-to-attack problem), respectively. In the following, we first discuss our solution to the when-to-attack problem. We then discuss our solution to the how-to-attack problem.
3.3.1 When to attack
We introduce a relative action preference function for solving the when-to-attack problem. The function computes the preference of the agent in taking the most preferred action over the least preferred action at the current state (similar to [amir:action-gap]). The degree of preference to an action depends on the DNN policy. A large value implies that the agent strongly prefers one action over the other. In the case of Pong, when the ball is about to drop from the top of the screen (see in Fig. 1), a well-trained RL agent would strongly prefer an up action over a down action. But when the ball is far away from the paddle (see in Fig. 1), the agent has no preference on any actions, resulting a small value. We describe how to design the relative action preference function for attacking the agents trained by the A3C and DQN algorithms below.
For policy gradient-based methods such as the A3C algorithm, if the action distribution of a well-trained agent is uniform at state , it means that taking any action is equally good. But, when an agent strongly prefer a specific action (The action has a relative high probability.), it means that it is critical to perform the action; otherwise the accumulated reward will be reduced. Based on this intuition, we define the function as
where is the policy network which maps a state–action pair to a probability, representing the likelihood that the action is chosen. In our strategically-timed attack, the adversary attacks the deep RL agent at time step when the relative action preference function has a value greater a threshold parameter . In other words, if and only if . We note the parameter controls how often it attacks the RL agent and is related to .
For value-based methods such as DQN, the same intuition applies. We can convert the computed Q-values of actions into a probability distribution over actions using the softmax function (similar to [sandy:adversarial]) with the temperature constant .
3.3.2 How to attack
To craft an adversarial example at time step , we search for a perturbation to be added to the observation that can change the preferred action of the agent from the originally (before applying the perturbation) most preferred one to the originally least preferred one. We use the attack method introduced in [carlini-wagner:towards] where we treat the least-preferred action as the misclassification target (see Sec. 4.1 for details). This approach allows us to leverage the output of a trained deep RL agent as cue to craft effective adversarial example for reducing accumulated rewards.
3.4 Enchanting Attack
The goal of the enchanting attack is to lure the deep RL agent from current state at time step to a specified target state after steps. The adversary needs to craft a series of adversarial examples for this attack. The enchanting attack is therefore more challenging than the strategically-timed attack.
We break this challenging task into two subtasks. In the first subtask, we assume that we have full control of the agent to take arbitrary actions at each step. Hence, the task is reduced to planning a sequence of actions for reaching the target state from current state . In the second subtask, we craft an adversarial example to lure an agent to take the first action of planned action sequence using the method introduced in [carlini-wagner:towards]. After the agent observes the adversarial example and takes the first action planned by the adversary, the environment will return a new state . We progressively craft , one at a time, using the same procedure described in (Fig. 2) to lure the agent from state to the target state . Next, we describe an on-line planing algorithm, which makes use of a next frame prediction model, for generating the planned action sequence.
3.4.1 Future state prediction and evaluation
We train a video prediction model to predict a future state given a sequence of actions based on [oh:action], which used a generative model to predict a video frame in the future:
where is the given sequence of future actions beginning at step , is the current state, and is the predicted future state. For more details about the video prediction model, please refer to the original paper.
The series of actions take the agent to reach the state . Since the goal of the enchanting attack is to reach the target state , we can evaluate the success of the attack based on the distance between and , which is given by . The distance is realized using the -norm in our experiments. We note that other metrics can be applied as well. We also note that the state is given by the observed image by the agent.
3.4.2 Sampling-based action planning
We use a sampling-based cross-entropy method ( [rubinstein:cross]) to compute a sequence of actions to steer the RL agent toward our target state. Specifically, we sample action sequences of length : , and rank each of them based on the distance between the final state obtained after performing the action sequence and the target state . After that, we keep the best action sequences and refit our categorical distributions to them. In our experiments, the hyper-parameter values are , , and .
At the end of the last iteration, we set the sampled action sequence that results in a final state that is closest to our target state as our plan. Then, we craft an adversarial example with the target-action using the method introduced in [carlini-wagner:towards]. Instead of directly crafting the next adversarial example with target-action , we plan for another enchanting attack starting at state to be robust to potential failure in the previous attack.
We note that the state-transition model is different to the policy of the deep RL agent. We use the state-transition model to propose a sequence of actions that we want the deep RL agent to follow. We also note that both the state-transition model and the future frame prediction model are learned without assuming any information from the RL agent.
We evaluated our tactics of adversarial attack to deep RL agents on 5 different Atari 2600 games (i.e., MsPacman, Pong, Seaquest, Qbert, and ChopperCommand) using OpenAI Gym [brockman:gym]. These games represents a balanced collection. Deep RL agents can achieve an above-human level performance when playing Pong and a below-human level performance when playing Ms.Pacman. We discuss our experimental setup and results in details. Our implementation will be released.
4.1 Experimental Setup
For each game, the deep RL agents were trained using the state-of-the-art deep RL algorithms including the A3C and DQN algorithms. For the agents trained by the A3C algorithm, we used the same pre-processing steps and neural network architecture as in [mnih:asynchronous]. For the agents trained by the DQN algorithm, we also used the same network architecture for the Q-function as in the original paper [mnih:human]. The input to the neural network at time was the concatenation of the last 4 images. Each of the images was resized to . The pixel value was rescaled to . The output of the policy was a distribution over possible actions for A3C, and an estimate of Q values for DQN.
Although several existing methods can be used to craft an adversarial example (e.g., the fast gradient sign method [goodfellow:explaining], and Jacobian-based saliency map attack [papernot:limitations]), anti-adversarial attack measures were also discovered to limit their impact [goodfellow:explaining, papernot:distillation]. We adopted an adversarial example crafting method proposed by [carlini-wagner:towards], which can break several existing anti-adversarial attack methods. Specifically, it crafts an adversarial example by approximately optimizing (1) where the image similarity metric was given by norm. We early stopped the optimizer when , where is a small value set to . The value of temperature in Equation (4) is set to in the experiments.
4.2 Strategically-Timed Attack
For each game and for the agents trained by the DQN and A3C algorithms, we launched the strategically-timed attack using different values. Each value rendered a different attack rate, quantifying how often an adversary attacked an RL agent in an episode. We computed the collected rewards by the agents under different attack rates. The results are shown in Fig. 3 where the -axis is the accumulated reward and the -axis is the average portion of time steps in an episode that an adversary attacks the agent (i.e., the attack rate). We show the lowest attack rate where the reward reaches the the reward of uniform attack. From the figure, we found that on average the strategically-timed attack can reach the same effect of the uniform attack by attacking of the time steps in an episode. We also found an agent trained using the DQN algorithm was more vulnerable than an agent trained with the A3C algorithm in most games except Pong. Since the A3C algorithm is known to perform better than the DQN algorithm in the Atari games, this result suggests that a stronger deep RL agent may be more robust to the adversarial attack. This finding, to some extent, echoes the finding in [rozsa:towards], which suggested that a stronger DNN-based recognition system is more robust to the adversarial attack.
4.3 Enchanting Attack
The goal of the enchanting attack is to maliciously lure the agent toward a target state. In order to avoid the bias of defining target states manually, we synthesized target states randomly. Firstly, we let the agent to apply its policy by steps to reach an initial state and saved this state into a snapshot. Secondly, we randomly sampled a sequence of actions of length and consider the state reached by the agent after performing these actions as a synthesized target state . After recording the target state, we restored the snapshot and run the enchanting attack on the agent and compared the normalized Euclidean distance between the target state and the final reached state where the normalization constant was given by the image resolution.
We considered an attack was successful if the final state had a normalized Euclidean distance to the target state within a tolerance value of 1. To make sure the evaluation was not affected by different stages of the game, we set 10 initial time step equals to , where was the average length of the game episode played by the RL agents 10 times. For each initial time step, we evaluated different . Then, for each , we computed the success rate (i.e., number of times the adversary misguided the agent to reach the target state divided by number of trials). We expected that a larger would correspond a more difficult enchanting attack problem. In Fig. 4, we show the success rate (y-axis) as a function of in 5 games. We found that the agents trained by both the A3C and DQN algorithms were enchanted. When , the success rate was more than in several games (except Seaquest and ChopperCommand). The reason that enchanting attack was less effective on Seaquest and ChopperCommand was because both of the games include multiple random enemies such that our trained video prediction models were less accurate.
We introduced two novel tactics of adversarial attack on deep RL agents: the strategically-timed attack and the enchanting attack. In five Atari games, we showed that the accumulated rewards collected by the agents trained using the DQN and A3C algorithms were significantly reduced when they were attacked by the strategically-timed attack even with just of the time steps in an episode. Our enchanting attack combining video prediction and planning can lure deep RL agent toward maliciously defined target states in steps with more than success rate in 3 out of 5 games. In the future, we plan to develop a more sophisticated strategically-timed attack method. We also plan to improve video prediction accuracy of the generative model for improving the success rate of enchanting attack on more complicated games. Another important direction of future work is developing defenses against adversarial attacks. Possible methods including augmenting training data with adversarial examples (as in [goodfellow:explaining], or training a subnetwork to detect adversarial input at test time and deal with it properly.
We would love to thank anonymous reviewers, Chun-Yi Lee, Jan Kautz, Bryan Catanzaro, and William Dally for their useful comments. We also thank MOST 105-2815-C-007-083-E and MediaTek for their support.