Mitigation of Policy Manipulation Attacks on Deep QNetworks with ParameterSpace Noise
Abstract
Recent developments have established the vulnerability of deep reinforcement learning to policy manipulation attacks via intentionally perturbed inputs, known as adversarial examples. In this work, we propose a technique for mitigation of such attacks based on addition of noise to the parameter space of deep reinforcement learners during training. We experimentally verify the effect of parameterspace noise in reducing the transferability of adversarial examples, and demonstrate the promising performance of this technique in mitigating the impact of whitebox and blackbox attacks at both test and training times.
Keywords:
Deep Reinforcement Learning Adversarial Attacks Adversarial Examples Mitigation ParameterSpace Noise.Recent years has been the scene to growing interest and advances in deep Reinforcement Learning (RL). By exploiting the superior feature extraction and processing capabilities of deep neural networks, deep RL enables the learning of direct mappings from raw observations of the environment to actions. This enhancement enables the application of classic RL approaches to highdimensional and complex planning problems, and is shown to achieve humanlevel or superhuman performance in various cases such as learning to playing the game of Go [21], playing Atari games [14], robotic manipulation [10], and autonomous navigation of aerial [25] and ground [26] vehicles. While the interest in deep RL solutions is extending into numerous domains such as intelligent transportation systems [1], finance [6] and critical infrastructure [15], ensuring the security and reliability of such solutions in adversarial conditions is only at its preliminary stages. Recently, Behzadan and Munir [3] reported the vulnerability of deep reinforcement learning algorithms to both testtime and trainingtime attacks using adversarial examples [9]. This work was followed by a number of further investigations (e.g., [11], [12]), verifying the fragility of deep RL agents to such attacks. Currently, only a few reports (e.g., [4], [13], [19]) concentrate on mitigation and countermeasures, which are mostly focused on approaches based on adversarial training and prediction.
In this work, we aim to further the research on countering attacks on deep RL by proposing a potential mitigation technique based on employing parameterspace noise exploration during the training of deep RL agents. Recent reports in [20] and [8] demonstrate that addition of adaptive noise to the parameters of deep RL architectures greatly enhances the exploration behavior and convergence speed of such algorithms. Contrary to classical exploration heuristics such as greedy [22], parameterspace noise is iteratively and adaptively applied to the parameters of the learning model, such as weights of the neural network. Accordingly, we hypothesize that the randomness introduced via parameter noise, not only enhances the discovery of more creative and robust policies, but also reduces the effect of whitebox and blackbox adversarial example attacks at both testtime and trainingtime.
To this end, we evaluate the performance of Deep QNetwork (DQN) models trained with parameter noise, against the testtime and trainingtime adversarial example attacks introduced in [3]. Main contributions of this work are:

Proposal of parameterspace noise exploration as a mitigations technique against policy manipulation attacks at both testtime and trainingtime,

Development of an opensource platform for experimenting with adversarial example attacks on deep RL agents,

Experimental analysis of parameterspace noise for mitigation of testtime whitebox and blackbox attacks on DQN,

Experimental analysis of parameterspace noise for mitigation of trainingtime policy induction attacks on DQN.
The remainder of this paper is organized as follows: Section 1 reviews the relevant background of DQN, parameter noise training via the NoisyNet approach, and adversarial examples. Section 2 describes the attack model adopted in this study. Section 3 details the experiment setup, and presents the corresponding results. Section 4 concludes the paper with remarks on the obtained results.
1 Background
In this section, we present an overview of the fundamental concepts, upon which this work is based. It must be noted that this overview is not meant to be comprehensive, and thus the interested readers may refer to the suggested references for further details.
1.1 RL and Deep QNetworks
The generic RL problem can be formally modeled as a Markov Decision Process (MDP), described by the tuple , where is the set of reachable states in the process, is the set of available actions, is the mapping of transitions to the immediate reward, and represents the transition probabilities. At any given timestep , the MDP is at a state . The RL agent’s choice of action at time , causes a transition from to a state according to the transition probability . The agent receives a reward , where denotes the set of real numbers, for choosing the action at state .
Interactions of the agent with MDP are determined by the policy . When such interactions are deterministic, the policy is a mapping between the states and their corresponding actions. A stochastic policy represents the probability of optimality for implementing any action at state .
The objective of RL is to find the optimal policy that maximizes the cumulative reward over time at time , denoted by the return function , where is the discount factor representing the diminishing worth of rewards obtained further in time, hence ensuring that is bounded.
One approach to this problem is to estimate the optimal value of each action, defined as the expected sum of future rewards when taking that action and following the optimal policy thereafter. The value of an action in a state is given by the actionvalue function defined as:
(1) 
Where is the state that emerges as a result of action , and is a possible action in state . The optimal value given a policy is hence defined as: , and the optimal policy is given by
The Qlearning method estimates the optimal action policies by using the Bellman equation as the iterative update of a value iteration technique. Practical implementation of Qlearning is commonly based on function approximation of the parametrized Qfunction . A common technique for approximating the parametrized nonlinear Qfunction is via neural network models whose weights correspond to the parameter vector . Such neural networks, commonly referred to as Qnetworks, are trained such that at every iteration , the following loss function is minimized:
(2) 
where , and is a probability distribution over states and actions . This optimization problem is typically solved using computationally efficient techniques such as Stochastic Gradient Descent (SGD) [2].
Classical Qnetworks introduce a number of major problems in the Qlearning process. First, the sequential processing of consecutive observations breaks the iid (Independent and Identically Distributed) requirement of training data as successive samples are correlated. Furthermore, slight changes to Qvalues leads to rapid changes in the policy estimated by Qnetwork, thus enabling policy oscillations. Also, since the scale of rewards and Qvalues are unknown, the gradients of Qnetworks can be sufficiently large to render the backpropagation process unstable.
A Deep Qnetwork (DQN) [14] is a training algorithm designed to resolve these problems. To overcome the issue of correlation between consecutive observations, DQN employs a technique called experience replay: instead of training on successive observations, experience replay samples a random batch of previous observations stored in the replay memory to train on. As a result, the correlation between successive training samples is broken and the iid setting is reestablished. In order to avoid oscillations, DQN fixes the parameters of a network , which represents the optimization target . These parameters are then updated at regular intervals by adopting the current weights of the Qnetwork. The issue of unstability in backpropagation is also solved in DQN by normalizing the reward values to the range , thus preventing Qvalues from becoming too large.
Mnih et al. [14] demonstrate the application of this new Qnetwork technique to endtoend learning of Q values in playing Atari games based on observations of pixel values in the game environtment. To capture the movements in the game environment, Mnih et al. use stacks of 4 consecutive image frames as the input to the network. To train the network, a random batch is sampled from the previous observation tuples , where denotes the reward at time . Each observation is then processed by 2 layers of convolutional neural networks to learn the features of input images, which are then employed by feedforward layers to approximate the Qfunction. The target network , with parameters , is synchronized with the parameters of the original network at fixed periods intervals. i.e., at every th iteration, , and is kept fixed until the next synchronization. The target value for optimization of DQN thus becomes:
(3) 
Accordingly, the training process can be stated as:
(4) 
As for the exploration mechanism, the original DQN employs greedy, which monotonically decreases the probability of taking random actions as the training progresses [22].
1.2 NoisyNets
Introduced by Fortunato et al. [8], NoisyNet is a type of neural network whose biases and weights are iteratively perturbed during training by a parametric function of the noise. Such a neural network can be represented by , parametrized by the vector of noisy parameters , where is a set of vectors representing learnable parameters, is a vector of zeromean noise with fixed statistics, and is elementwise multiplication. In [8], the modified DQN algorithm is proposed as follows: first, greedy is omitted, and instead the value function is greedily optimized. Second, the fully connected layers of the value function are parametrized as a NoisyNet, whose parameter values are drawn from a noisy parameter distribution after every replay step. The noise distribution used in [8] is factorized Gaussian noise. During replay, the current NoisyNet parameter samples are held constant, while at the optimization of each action step, the parameters are resampled. The parametrized actionvalue function can be treated as a random variable, and is employed accordingly in the optimization function. Further details of this approach and a similar proposal can be found in [8] and [20], respectively.
1.3 Adversarial Examples
In [23], Szegedy et al. report an intriguing discovery: several machine learning models, including deep neural networks, are vulnerable to adversarial examples. That is, these machine learning models misclassify inputs that are only slightly different from correctly classified samples drawn from the data distribution. Furthermore, it was found [18] that a wide variety of models with different architectures trained on different subsets of the training data misclassify the same adversarial example.
This suggests that adversarial examples expose fundamental blind spots in machine learning algorithms. The issue can be stated as follows: Consider a machine learning system and a benign input sample which is correctly classified by the machine learning system, i.e. . According to the report of Szegedy [23] and many proceeding studies [18], it is possible to construct an adversarial example , which is perceptually indistinguishable from , but is classified incorrectly, i.e. .
Adversarial examples are misclassified far more often than examples that have been perturbed by random noise, even if the magnitude of the noise is much larger than the magnitude of the adversarial perturbation [9]. According to the objective of adversaries, adversarial example attacks are generally classified into the following two categories:

Misclassification attacks, which aim for generating examples that are classified incorrectly by the target network

Targeted attacks, whose goal is to generate samples that the target misclassifies into an arbitrary class designated by the attacker.
To generate such adversarial examples, several algorithms have been proposed, such as the Fast Gradient Sign Method (FGSM) by Goodfellow et al., [9], and the Jacobian Saliency Map Algorithm (JSMA) approach by Papernot et al., [18]. A grounding assumption in many of the crafting algorithms is that the attacker has complete knowledge of the target neural networks such as its architecture, weights, and other hyperparameters. In response, Papernot et al. [17] proposed the first blackbox approach to generating adversarial examples. This method exploits the transferability of adversarial examples: an adversarial example generated for a neural network classifier applies to most other neural network classifiers that perform the same classification task, regardless of their architecture, parameters, and even the distribution of training data. Accordingly, the approach of [17] is based on generating a replica of the target network. To train this replica, the attacker creates and trains over a dataset from a mixture of samples obtained by observing target’s interaction with the environment, and synthetically generated inputs and label pairs. Once trained, any of the algorithms that require knowledge of the target network for crafting adversarial examples can be applied to the replica. Due to the transferability of adversarial examples, the perturbed data points generated for the replica network can induce misclassifications in many of the other networks that perform the same task.
2 Attack Model
We consider an attacker whose goal is to perturb the optimality of actions taken by a DQN agent through either perturbing the observations of the agent the testtime, or inducing an arbitrary policy on the target DQN at training time. In whitebox attacks, the attacker has complete knowledge of the target. On the other hand, a blackbox attacker has no knowledge of the target’s exact architecture and parameters, but is assumed to be capable of estimating those based on the conventions applied to the input type (e.g. image and video input may indicate a convolutional neural network, speech and voice data point towards a recurrent neural network, etc.).
In this model, The attacker is assumed to have minimal a priori information of the target’s model and parameters, such as the type and format of inputs to the DQN, as well as its reward function and an estimate for the frequency of updating the network. Furthermore, the attacker has no direct influence on the target’s architecture and parameters, including its reward function, parameter noise, and the optimization mechanism. The only parameter that the attacker can directly manipulate is the configuration of the environment observed by the target. For instance, in the case of DQN agents learning to play Atari games [14], the attacker may change pixel values of the game’s frames, but not the score. We assume that the attacker is capable of changing the state before it is observed by the target by predicting future states, through approaches such as having a quicker action speed than the target’s sampling rate, or by introducing a delay between generation of the new environment and its observation by the target.
To avoid detection, we impose an extra constraint on the attack such that the magnitude of perturbations applied in each configuration must be smaller than a constant value denoted by . Also, we do not limit the attacker’s domain of perturbations.
As discussed in Section 1, the DQN framework of Mnih et al. [14] can be seen as consisting of two neural networks, one is the native Qnetwork which performs the image processing and function approximation, and the other is the target network network whose architecture and parameters are copies of the native network sampled once every iterations. DQN is trained through optimizing the loss function of equation 4 by SGD. Behzadan and Munir [3] demonstrated that the function approximators of DQN are also vulnerable to adversarial example attacks. In other words, the set of all possible inputs to the approximated function contains elements which cause the approximated functions to generate outputs that are different from the output of the original function.
Consequently, the attacker can manipulate the learning process of DQN by crafting states such that identifies an incorrect choice of optimal action at . If the attacker is capable of crafting adversarial inputs and such that the value of Equation 4 is minimized for a specific action , then the policy learned by DQN at this timestep is optimized towards suggesting as the optimal action given the state . At every time step of training this replica, the attacker observes interactions of its target with the environment . If the resulting state is not terminal, the attacker then calculates the perturbation vectors for the next state such that causes to generate its maximum when , i.e., the maximum reward at the next state is obtained when the optimal action taken at that state is determined by the attacker’s policy. The attacker then reveals the perturbed state to the target, and retrains the replica based on the new state and action.
This is procedurally similar to targeted misclassification attacks described in Section 1, which aim to find minimal perturbations to an input sample such that the classifier assigns the maximum value of likelihood to an incorrect target class. Therefore, the adversarial example crafting techniques developed for classifiers such as FGSM can be employed to obtain the perturbation vector .
Accordingly, Behzadan and Munir [3] divide this attack into the two phases of initialization and exploitation. The initialization phase implements processes that must be performed before the target begins interacting with the environment, which are:

Train a DQN based on attacker’s reward function to obtain the adversarial policy

Create a replica of the target’s DQN and initialize with random parameters
The exploitation phase implements the attack process and crafting adversarial inputs, such that the target DQN performs an action dictated by . This phase constitutes an attack cycle depicted in figure 1. The cycle initiates with the attacker’s first observation of the environment, and runs in tandem with the target’s operation.
3 Experimental Verification
To evaluate the effectiveness of NoisyNet in mitigation of adversarial example attacks, we study the performance of this architecture in comparison to the original DQN setup. Following the standard benchmarks of DQN, our experimental environments consist of 3 Atari 2600 games, namely Enduro, Assault, and Blackout. We train 4 models for each environment, 2 models based on the original DQN and greedy exploration, and 2 models based on the NoisyNet architecture. The neural network configuration of both models follows that of the original DQN proposal by Mnih et al. [14], while the parameter noise configuration is based on the setup presented in [8].
We implemented the experimentation platform in TensorFlow using OpenAI Gym [5] for emulating the game environment and Cleverhans [16] for crafting the adversarial examples. Our DQN implementation is a modified version of the module in OpenAI Baselines [7], while the NoisyNet implementation is based on the algorithm described in [8]. We have published our platform at [24] for opensource use in further research in this area.
For the purposes of this study, we consider FGSM for crafting adversarial examples, with the perturbation limit . Similar to the work in [12], the initiation of attacks occurs after the learned Qfunction begins converging towards the optimal value.
3.1 Testtime Attacks
Parameter noise training in NoisyNet is expected to enhance the exploration criteria of the agent and hence facilitate learning more creative and accurate policies. Accordingly, we hypothesize that the actionvalue function learned in NoisyNet is better generalized than the original, and can be more resilient to nontargeted adversarial example attacks at testtime. Similarly, the addition of random noise to the parameters of NoisyNet can potentially impede the transferability of adversarial examples, and hence enhance the resilience of NoisyNet to blackbox attacks. To test this hypothesis, we compare the performance of NoisyNet and DQN models to whitebox and blackbox attacks after iterations of training.
Figure 3.1 presents the results of this experiment. It is observed that in all three environments, the impact of adversarial example perturbation in the performance of NoisyNet is less severe than that of the original DQN, thereby verifying our general hypothesis. Furthermore, comparison of performance under blackbox attacks demonstrates significant improvements in Noisynets, as depicted in all three cases. A preliminary interpretation of this observation is that the randomization of model parameters reduces the transferability of adversarial examples generated for a replicated model.
3.2 Trainingtime Attacks
In [3] and [12], the impact of trainingtime adversarial example attacks on the policy learning is demonstrated. Similar to the case of testtime attacks, we hypothesize that the reduced transferability and enhanced generalization of NoisyNet can potentially provide greater resilience to blackbox adversarial example attacks during training. To this end, we investigated the performance of NoisyNet and DQN to the trainingtime attack methodology described in Section 2 [3].
Figure 3.2 presents the results of this experiment. It can be seen that in all three environments, performance of the original DQN consistently deteriorates under trainingtime attacks, as reported in [3] and [12]. On the other hand, while the performance of NoisyNet is also subject to deterioration, it demonstrates significantly stronger resilience to this attack, and in the case of Assault remains almost unaffected by adversarial perturbations. These results verify the original hypothesis, and hence the efficacy of parameter noise in mitigating the impact of trainingtime attacks.
4 Conclusion
Through experimental analysis, we investigated the effect of parameter noise in mitigation of adversarial example attacks on Deep QNetworks (DQN). Considering the reported enhancing effect of parameter noise in reinforcement learning and exploration, as well as the inherent randomization of such techniques, we demonstrated that compared to the original DQN, noisy DQN architectures provide better resilience to adversarial perturbations at testtime, and reduce susceptibility to transferability of adversarial examples. Furthermore, we demonstrate that noisy DQN is significantly more resilient to blackbox attacks at trainingtime, and learn in a greatly more robust manner in comparison to plain DQN architectures. These results present a promising starting point for further experimental and analytical analysis of employing parameterspace noise exploration for enhancement of resilience and robustness in deep reinforcement learning.
References
 [1] Atallah, R.: The Next Generation Intelligent Transportation System: Connected, Safe and Green. Ph.D. thesis, Concordia University (2017)
 [2] Baird III, L.C., Moore, A.W.: Gradient descent for general reinforcement learning. In: Advances in neural information processing systems. pp. 968–974 (1999)
 [3] Behzadan, V., Munir, A.: Vulnerability of deep reinforcement learning to policy induction attacks. arXiv preprint arXiv:1701.04143 (2017)
 [4] Behzadan, V., Munir, A.: Whatever does not kill deep reinforcement learning, makes it stronger. arXiv preprint arXiv:1712.09344 (2017)
 [5] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)
 [6] Deng, Y., Bao, F., Kong, Y., Ren, Z., Dai, Q.: Deep direct reinforcement learning for financial signal representation and trading. IEEE transactions on neural networks and learning systems 28(3), 653–664 (2017)
 [7] Dhariwal, P., Hesse, C., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y.: Openai baselines. https://github.com/openai/baselines (2017)
 [8] Fortunato, M., Azar, M.G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., et al.: Noisy networks for exploration. arXiv preprint arXiv:1706.10295 (2017)
 [9] Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
 [10] Gu, S., Holly, E., Lillicrap, T., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In: Robotics and Automation (ICRA), 2017 IEEE International Conference on. pp. 3389–3396. IEEE (2017)
 [11] Huang, S., Papernot, N., Goodfellow, I., Duan, Y., Abbeel, P.: Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284 (2017)
 [12] Kos, J., Song, D.: Delving into adversarial attacks on deep policies. arXiv preprint arXiv:1705.06452 (2017)
 [13] Lin, Y.C., Liu, M.Y., Sun, M., Huang, J.B.: Detecting adversarial attacks on neural network policies with visual foresight. arXiv preprint arXiv:1710.00814 (2017)
 [14] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Humanlevel control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
 [15] Mohammadi, M., AlFuqaha, A., Guizani, M., Oh, J.S.: Semisupervised deep reinforcement learning in support of iot and smart city services. IEEE Internet of Things Journal (2017)
 [16] Papernot, N., Goodfellow, I., Sheatsley, R., Feinman, R., McDaniel, P.: cleverhans v1. 0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768 (2016)
 [17] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical blackbox attacks against deep learning systems using adversarial examples. arXiv preprint arXiv:1602.02697 (2016)
 [18] Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The limitations of deep learning in adversarial settings. In: Security and Privacy (EuroS&P), 2016 IEEE European Symposium on. pp. 372–387. IEEE (2016)
 [19] Pattanaik, A., Tang, Z., Liu, S., Bommannan, G., Chowdhary, G.: Robust deep reinforcement learning with adversarial attacks. arXiv preprint arXiv:1712.03632 (2017)
 [20] Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R.Y., Chen, X., Asfour, T., Abbeel, P., Andrychowicz, M.: Parameter space noise for exploration. arXiv preprint arXiv:1706.01905 (2017)
 [21] Silver, D., Hassabis, D.: Alphago: Mastering the ancient game of go with machine learning. Research Blog (2016)
 [22] Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction, vol. 1. MIT press Cambridge (1998)
 [23] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
 [24] Vahid, B.: Crafting adversarial example attacks on policy learners. https://github.com/behzadanksu/rlattack (2017)
 [25] Zhang, T., Kahn, G., Levine, S., Abbeel, P.: Learning deep control policies for autonomous aerial vehicles with mpcguided policy search. In: Robotics and Automation (ICRA), 2016 IEEE International Conference on. pp. 528–535. IEEE (2016)
 [26] Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A., FeiFei, L., Farhadi, A.: Targetdriven visual navigation in indoor scenes using deep reinforcement learning. In: Robotics and Automation (ICRA), 2017 IEEE International Conference on. pp. 3357–3364. IEEE (2017)