Spatiotemporally Constrained Action Space Attacks On Deep Reinforcement Learning Agents
Robustness of Deep Reinforcement Learning (DRL) algorithms towards adversarial attacks in real world applications such as those deployed in cyber-physical systems (CPS) are of increasing concern. Numerous studies have investigated the mechanisms of attacks on the RL agent’s state space. Nonetheless, attacks on the RL agent’s action space (corresponding to actuators in engineering systems) are equally perverse, but such attacks are relatively less studied in the ML literature. In this work, we first frame the problem as an optimization problem of minimizing the cumulative reward of an RL agent with decoupled constraints as the budget of attack. We propose the white-box Myopic Action Space (MAS) attack algorithm that distributes the attacks across the action space dimensions. Next, we reformulate the optimization problem above with the same objective function, but with a temporally coupled constraint on the attack budget to take into account the approximated dynamics of the agent. This leads to the white-box Look-ahead Action Space (LAS) attack algorithm that distributes the attacks across the action and temporal dimensions. Our results showed that using the same amount of resources, the LAS attack deteriorates the agent’s performance significantly more than the MAS attack. This reveals the possibility that with limited resource, an adversary can utilize the agent’s dynamics to malevolently craft attacks that causes the agent to fail. Additionally, we leverage these attack strategies as a possible tool to gain insights on the potential vulnerabilities of DRL agents.
The spectrum of Reinforcement Learning (RL) applications ranges from engineering design and control [18, 29] to business  and creative design . As RL-based frameworks are increasingly deployed in real-world, it is imperative that the safety and reliability of these frameworks are well understood. While any adversarial infiltration of these systems can be costly, the safety of DRL systems deployed in cyber-physical systems (CPS) such as industrial robotic applications and self-driving vehicles are especially safety and life-critical.
A root cause of these safety concerns is that in certain applications, the inputs to an RL system can be accessed and modified adversarially to cause the RL agent to take sub-optimal (or even harmful) actions. This is especially true when deep neural networks (DNNs) are used as key components (e.g., to represent policies) of RL agents. Recently, a wealth of results in the ML literature demonstrated that DNNs can be fooled to misclassify images by perturbing the input by an imperceptible amount  or by introducing specific natural looking attributes . Such adversarial perturbations have also demonstrated the impacts of attacks on an RL agent’s state space as shown by .
Besides perturbing the RL agent’s state space, it is also important to consider adversarial attacks on the agent’s action space, which in engineering systems, represents physically manipulable actuators. We note that (model-based) actuator attacks have been studied in the cyber-physical security community, including vulnerability of continuous systems to discrete time attacks ; theoretical characteristics of undetectable actuator attacks ; and “defense” schemes that re-stabilizes a system when under actuation attacks . However, the issue of adversarial attacks on a RL agent’s action space has relatively been ignored in the DRL literature. In this work, we present a suite of novel attack strategies on a RL agent’s action space.
We formulate a white-box Myopic Action Space (MAS) attack strategy as an optimization problem with decoupled constraints.
We extend the formulation above by coupling constraints to compute a non-myopic attack that is derived from the agent’s state-action dynamics and develop a white-box Look-ahead Action Space (LAS) attack strategy. Empirically, we show that LAS crafts a stronger attack than MAS using the same budget.
We illustrate how these attack strategies can be used to understand a RL agent’s vulnerabilities.
We present analysis to show that our proposed attack algorithms leveraging projected gradient descent on the surrogate reward function (represented by the trained RL agent model) converges to the same effect of applying projected gradient descent on the true reward function.
|Method||Includes Dynamics||Method||Space of Attack|
|FGSM on Policies ||X||O||S|
|Gradient based Adversarial Attack ||X||O||S|
|Policy Induction Attacks ||X||O||S|
|Strategically-Timed and Enchanting Attack ||✓||O, M||S|
|Myopic Action Space (MAS)||X||O||A|
|Look-ahead Action Space (LAS)||✓||O||A|
Due to the large amount of recent progress in the area of adversarial machine learning, we only focus on reviewing the most recent attack and defense mechanisms proposed for DRL models. Table 1 presents the primary landscape of this area of research to contextualize our work.
Adversarial Attacks on RL Agent
Several studies of adversarial attacks on DRL systems have been conducted recently.  extended the idea of FGSM attacks in deep learning to RL agent’s policies to degrade the performance of a trained RL agent. Furthermore,  showed that these attacks on the agent’s state space are transferable to other agents. Additionally,  proposed attaching an Adversarial Transformer Network (ATN) to the RL agent to learn perturbations that will deceive the RL agent to pursue an adversarial reward. While the attack strategies mentioned above are effective, they do not consider the dynamics of the agent. One exception is the work by  that proposed two attack strategies. One strategy was to attack the agent when the difference in probability/value of the best and worst action crosses a certain threshold. The other strategy was to combine a video prediction model that predicts future states and a sampling-based action planning scheme to craft adversarial inputs to lead the agent to an adversarial goal, which might not be scalable.
Robustification of RL Agents
As successful attack strategies are being developed for RL models, various works on training RL agents to be robust against attacks have also been conducted.  proposed that a more severe attack can be engineered by increasing the probability of the worst action rather than decreasing the probability of the best action. They showed that the robustness of an RL agent can be improved by training the agent using these adversarial examples. More recently,  presented a method to robustify RL agent’s policy towards action space perturbations by formulating the problem as a zero-sum Markov game. In their formulation, a separate nominal and adversary policy are trained simultaneously with a critic network being updated over the mixture of both policies to improve both adversarial and nominal policies. Meanwhile,  proposed a method to detect and mitigate attacks by employing a hierarchical learning framework with multiple sub-policies. They showed that the framework reduces agent’s bias to maintain high nominal rewards in the absence of adversaries. We note that other methods to defend against adversarial attacks exist, such as studies done by  and . These works are done mainly in the context of a DNN but may be extendable to DRL agents that employs DNN as policies, however discussing these works in detail goes beyond the scope of this work.
We focus exclusively on model-free RL approaches. Below, let and denote the (continuous, possibly high-dimensional) vector variables denoting state and action, respectively, at time . We assume a state evolution function, and let denote the reward signal the agent receives for taking the action , given . The goal of the RL agent is to choose actions that maximizes the cumulative reward, , given access to the trajectory, , comprising all past states and actions. In value-based methods, the RL agent determines action at each time step by finding an intermediate quantity called the value function that satisfies the recursive Bellman Equations. One example of such method is Q-learning  where the agent discovers the Q-function, defined recursively as:
The optimal action (or “policy”) at each time step is to deterministically select the action that maximizes this Q-function conditioned on the observed state, i.e.,
In DRL, the Q-function in the above formulation is approximated via a parametric neural network ; methods to train these networks include Deep Q-networks .
In policy-based methods such as policy gradients , the RL agent directly maps trajectories to policies. In contrast with Q-learning, the selected action is sampled from the policy parameterized by a probability distribution, , such that the expected rewards (with expectations taken over ) are maximized:
In DRL, the optimal policy is the output of a parametric neural network , and actions at each time step are sampled; methods to train this neural network include proximal policy optimization (PPO) .
Our goal is to identify adversarial vulnerabilities in RL agents in a principled manner. To this end, We define a formal threat model, where we assume the adversary possesses the following capabilities:
Access to RL agent’s action stream. The attacker can directly perturb the agent’s nominal action adversarially (under reasonable bounds, elaborated below). The nominal agent is also assumed to be a closed-loop system and have no active defense mechanisms.
Access to RL agent’s training environment. This is required to perform forward simulations to design an optimal sequence of perturbations (elaborated below).
Knowledge of trained RL agent’s DNN. This is needed to understand how the RL agent acts under nominal conditions, and to compute gradients. In the adversarial ML literature, this assumption is commonly made under the umbrella of white-box attacks.
In the context of the above assumptions, the goal of the attacker is to choose a (bounded) action space perturbation that minimizes long-term discounted rewards. Based on how the attacker chooses to perturb actions, we define and construct two types of optimization-based attacks. We note that alternative approaches, such as training another RL agent to learn a sequence of attacks, is also plausible. However, an optimization-based approach is computationally more tractable to generate on-the-fly attacks for a target agent compared to training another RL agent (especially for high-dimensional continuous action spaces considered here) to generate attacks. Therefore, we restrict our focus on optimization-based approaches in this paper.
Myopic Action-Space (MAS) Attack Model
We first consider the case where the attacker is myopic, i.e., at each time step, they design perturbations in a greedy manner without regards to future considerations. Formally, let be the action space perturbation (to be determined) and be a budget constraint on the magnitude of each 111Physically, the budget may reflect a real physical constraint, such as the energy requirements to influence an actuation, or it may be a reflection on the degree of imperceptibility of the attack.. At each time step , the attacker designs such that the anticipated future reward is minimized
where denotes the -norm for some . Observe that while the attacker ostensibly solves separate (decoupled) problems at each time, the states themselves are not independent since given any trajectory, , where is the transition of the environment based on and . Therefore, is implicitly coupled through time since it depends heavily on the evolution of state trajectories rather than individual state visitations. Hence, the adversary perturbations solved above are strictly myopic and we consider this a static attack on the agent’s action space.
Look-ahead Action Space (LAS) Attack Model
Next, we consider the case where the attacker is able to look ahead and chooses a designed sequence of future perturbations. Using the same notation as above, let denote the sum of rewards until a horizon parameter , and let be the future sum of rewards from time . Additionally, we consider the (concatenated) matrix and denote a budget parameter. The attacker solves the optimization problem:
where denotes the -norm . By coupling the objective function and constraints through the temporal dimension, the solution to the optimization problem above is then forced to take the dynamics of the agent into account in an explicit manner.
In this section, we present two attack algorithms based on the optimization formulations presented in previous section.
Algorithm for Mounting MAS Attacks
We observe that (1) is a nonlinear constrained optimization problem; therefore, an immediate approach to solve it is via projected gradient descent (PGD). Specifically, let denote the ball of radius in the action space. We compute the gradient of the adversarial reward, w.r.t. the action space variables and obtain the unconstrained adversarial action using gradient descent with step size . Next, we calculate the unconstrained perturbation and project in onto to get :
Here, represents the nominal action. We note that this approach resembles the fast gradient-sign method (FGSM) , although we compute standard gradients here. As a variation, we can compute multiple steps of gradient descent w.r.t the action variable prior to projection; this is analogous to the basic iterative method (or iterative FGSM) . The MAS attack algorithm is shown in the supplementary material.
We note that in DRL approaches, only a noisy proxy of the true reward function is available: In value-based methods, we utilize the learned Q-function (for example, a DQN) as an approximate of the true reward function, while in policy-iteration methods, we use the probability density function returned by the optimal policy as a proxy of the reward, under the assumption that actions with high probability induces a high expected reward. Since DQN selects the action based on the argmax of Q-values and policy iteration samples the action with highest probability, the Q-values/action-probability remains a useful proxy for the reward in our attack formulation. Therefore, our proposed MAS attack is technically a version of noisy projected gradient descent on the policy evaluation of the nominal agent. We elaborate on this further in the theoretical analysis section.
Algorithm for Mounting LAS Attacks
The previous algorithm is myopic and can be interpreted as a purely spatial attack. In this section, we propose a spatiotemporal attack algorithm by solving Eq. (2) over a given time window . Due to the coupling of constraints in time, this approach is more involved. We initialize a copy of both the nominal agent and environment, called the adversary and adversarial environment respectively. At time , we sample a virtual roll-out trajectory up until a certain horizon using the pair of adversarial agent and environment. At each time step of the virtual roll-out, we compute action space perturbations by taking (possibly multiple) gradient updates. Next, we compute the norms of each and project the sequence of norms back onto an -ball of radius . The resulting projected norms at each time point now represents the individual budgets, , of the spatial dimension at each time step. Finally, we project the original obtained in the previous step onto the -balls of radii , respectively to get the final perturbations .222Intuitively, these steps represent the allocation of overall budget across different time steps. We note that to perform virtual roll-outs at every time step , the state of the has to be the same as the state of at . To accomplish this, we saved the history of all previous actions to re-compute the state of the at time from . While this current implementation may be time-consuming, we believe that this problem can be avoided by giving the adversary direct access to the current state of the nominal agent through platform API-level modifications; or explicit observations (in real-life problems).
In subsequent time steps, the procedure above is repeated with a reduced budget of and reduced horizon until reaches zero. The horizon is then reset again for planning a new spatiotemporal attack. An alternative formulation could also be shifting the window without reducing its length until the adversary decides to stop the attack. However, we consider the first formulation such that we can compare the performance of LAS with MAS for an equal overall budget. This technique of re-planning the at every step while shifting the window of is similar to the concept of receding horizons regularly used in optimal control . It is evident that using this form of dynamic re-planning mitigates the planning error that occurs when the actual and simulated state trajectories diverge due to error accumulation . Hence, we perform this re-planning at every to account for this deviation. The pseudocode is provided in Alg. 1.
We can show that projected gradient descent on the surrogate reward function (modeled by the RL agent network) to generate both MAS and LAS attacks provably converges; this can be accomplished since gradient descent on a surrogate function is akin to a noisy gradient descent on the true adversarial reward.
As described in previous sections, our MAS/LAS algorithms are motivated in terms of the adversarial reward . However, if we use either DQN or policy gradient networks, we do not have direct access to the reward function, but only its noisy proxy, defined via a neural network. Therefore, we need to argue that performing (projected) gradient descent using this proxy loss function is a sound procedure. To do this, we appeal to a recent result by , who prove convergence of noisy gradient descent approximately converges to a local minimum. More precisely, consider a general constrained nonlinear optimization problem:
where is an arbitrary (differentiable, possibly vector-valued) function encoding the constraints. Define define the constraint set. We attempt to minimize the objective function via noisy (projected) gradient updates:
Theorem 1. (Convergence of noisy projected gradients.) Assume that the noise terms are i.i.d., satisfying , almost surely. Assume that both the constraint function and the objective function is -smooth, -Lipschitz, and possesses -Lipschitz Hessian. Assume further that the objective function is -bounded. Then, there exists a learning rate such that with high probability, in iterations, noisy projected gradient descent converges to a point that is -close to some local minimum of .
In our case, and depends on the structure of the RL agent’s neural network. (Smoothness assumptions of can perhaps be justified by limiting the architecture of the network, but the iid-ness assumption on is hard to verify). As such, it is difficult to ascertain whether the assumptions of the above theorem are satisfied in specific cases. Nonetheless, an interesting future theoretical direction is to understand Lipschitz-ness properties of specific families of DQN/policy gradient agents.
We defer further analysis of the double projection step onto mixed-norm balls used in our proposed LAS algorithms to the supplementary material.
Experimental Results & Discussion
To demonstrate the effectiveness and versatility of our methods, we implemented them on RL agents with continuous action environments from OpenAI’s gym  as they reflect the type of action space in most practical applications 333Codes and links to supplementary are available at https://github.com/xylee95/Spatiotemporal-Attack-On-Deep-RL-Agents. For policy-based methods, we trained a nominal agent using the PPO algorithm and a DoubleDQN (DDQN) agent  for value-based methods444The only difference in implementation of policy vs value-based methods is that in policy methods, we take analytical gradients of a distribution to compute the attacks (e.g., in line 10 of Algorithm 1) while for value-based methods, we randomly sample adversarial actions to compute numerical gradients.. Additionally, we utilize Normalized Advantage Functions  to convert the discrete nature of DDQN’s output to continuous action space. For succinctness, we present the results of the attack strategies only on PPO agent for the Lunar-Lander environment. Additional results of DDQN agent in Lunar Lander and Bipedal-Walker environments and PPO agent in Bipedal-Walker, Mujoco Hopper, Half-Cheetah and Walker environments are provided in the supplementary materials. As a baseline, we implemented a random action space attack, where a random perturbation bounded by the same budget is applied to the agent’s action space at every step. For MAS attacks, we implemented two different spatial projection schemes, projection based on  that represents a sparser distribution and projection that represents a denser distribution of attacks. For LAS attacks, all combinations of spatial and temporal projection for and were implemented.
Comparison of MAS and LAS Attacks
Fig. 2 shows distributions of cumulative rewards obtained by the PPO agent across ten episodes in a Lunar Lander environment, with each subplot representing different combinations of budget, and horizon, . Top three subplots show experiments with a value of 5 time steps and value of 3, 4, and 5 from left to right respectively. Bottom row of figures show a similar set of experiments but with a value of 10. For a direct comparison between MAS and LAS attacks with equivalent budgets across time, we have assigned the corresponding MAS budget values as . This assumes that the total budget is allocated uniformly across every time step for a given , while LAS has the flexibility to allocate the attack budget non-uniformly in the same interval, conditioned on the dynamics of the agent.
We note that keeping constant while increasing provides both MAS and LAS with a higher budget to inject to the nominal actions. We observe that with a low budget of (Fig. 2a), only LAS is successful in attacking the RL agent, as seen by the corresponding decrease in rewards. With a higher budget of 5 (Fig. 2c), MAS has a more apparent effect on the performance of the RL agent while LAS reduces the performance of the agent severely.
With constant, increasing allows the allocated to be distributed along the increased time horizon. In other words, LAS virtually looks-ahead further into the future. In the most naive case, a longer horizon dilutes the severity of each in compared to shorter horizons. By comparing similar budget values of different horizons (i.e horizons 5 and 10 for budget 3, Fig. 2a and Fig. 2d respectively), attacks for are generally less severe than their counterparts. For all and combinations, we observe that MAS attacks are generally less effective compared to LAS. We note that this is a critical result of the study as most literature on static attacks have shown that the attacks can be ineffective below a certain budget. Here, we demonstrate that while MAS attacks can seemingly look ineffective for a given budget, a stronger and more effective attack can essentially be crafted using LAS with the same budget.
In the following sections, we further study the difference between MAS and LAS as well as demonstrate how the attacks can be utilized to understand the vulnerabilities of the agent in different environments.
Action Dimension Decomposition of LAS Attacks
Fig. 3 shows action dimension decomposition of LAS attacks. The example shown in Fig. 3 is the result of projection in action space with projection in time. From Fig. 3a, we observe that through all the episodes of LAS attacks, one of the action dimension (i.e., Up - Down direction of lunar lander) is consistently perturbed more, i.e., accumulates more attack, than Left-Right direction.
Fig. 3b shows a detailed view of action dimension attacks for an episode (Episode 1). It is evident from the figure that the Up-Down actions of the lunar lander are more prone to attacks throughout the episode than Left-Right actions. Additionally, Left-Right action attacks are restricted after certain time steps and only the Up-Down actions are attacked further. Fig. 3c further corroborates the observation in the Lunar Lander environment. As the episode progresses in Fig. 3c, the lunar lander initially lands on the ground in frame 3, but lifts up and hovers until the episode ends in frame 5. This observation supports the fact that the proposed attacks are effective in perturbing the action dimensions in an optimal manner; as in this case, perturbing the lunar lander in the horizontal direction will not further decrease rewards. On the other hand, hovering the lunar lander will cause the agent to use more fuel, which consequently decreases the total reward. From these studies, it can be concluded that LAS attacks (correlated with projections of actions in time) can clearly isolate vulnerable action dimension(s) of the RL agent to mount a successful attack.
Ablation Study of Horizon and Budget
Lastly, we performed multiple ablation studies to compare the effectiveness of LAS and MAS attacks. While we have observed that LAS attacks are generally stronger than MAS, we hypothesize that there will be an upper limit to LAS’s advantage as the allowable budget increases. We take the difference of each attack’s reduction in rewards (i.e. attack - nominal) and visualize how much rewards LAS reduces as compared to MAS under different conditions of and . In the case of PPO in Lunar Lander, we observe that the reduction in rewards of LAS vs MAS becomes less drastic as budget increases, hence showing that LAS has diminishing returns as both MAS and LAS saturates at higher budgets. We defer detailed discussions and additional figures of the ablation study to the supplementary materials.
Conclusion & Future Work
In this study, we present two novel attack strategies on an RL agent’s action space; a myopic attack (MAS) and a non-myopic attack (LAS). The results show that LAS attacks, that were crafted with explicit use of the agent’s dynamics information, are more powerful than MAS attacks. Additionally, we observed that applying LAS attacks on RL agents reveals the possible vulnerable actuators of an agent, as seen by the non-uniform distribution of attacks on certain action dimensions. This can be leveraged as a tool to identify the vulnerabilities and plan a mitigation strategy under similar attacks. Possible future works include extending the concept of LAS attacks to state space attacks where the agent’s observations are perturbed instead of the agent’s actions while taking into account the dynamics of the agent. Additionally, while we did not focus on the imperceptibility and deployment aspects of the proposed attacks in this study, defining a proper metric in terms of detectability in action space and optimizing the budget to remain undetected for different environments will be a future research direction.
This work was supported in part by NSF grants CNS-1845969 and CCF-2005804, and AFOSR YIP Grant FA9550-17-1-0220.
-  (2016-12) Undetectable sensor and actuator attacks for observer based controlled cyber-physical systems. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Vol. , pp. 1–7. External Links: Cited by: Introduction.
-  (2018) Adversarial examples construction towards white-box q table variation in dqn pathfinding training. In 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), pp. 781–787. Cited by: Adversarial Attacks on RL Agent.
-  (2017) Vulnerability of deep reinforcement learning to policy induction attacks. In Machine Learning and Data Mining in Pattern Recognition, P. Perner (Ed.), Cham, pp. 262–275. External Links: Cited by: Table 1, Adversarial Attacks on RL Agent.
-  (2004) Convex optimization. Cambridge university press. Cited by: Look-ahead Action Space (LAS) Attack Model.
-  (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: Experimental Results & Discussion.
-  (2016) Fast projection onto the simplex and the ball. Mathematical Programming 158 (1-2), pp. 575–585. Cited by: item 1.
-  (2016) Fast projection onto the simplex and the l1 ball. Mathematical Programming 158 (1-2), pp. 575–585. Cited by: Experimental Results & Discussion.
-  (2015) Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pp. 797–842. Cited by: Theoretical Analysis.
-  (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, External Links: Cited by: Introduction, Algorithm for Mounting MAS Attacks.
-  (2016) Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829–2838. Cited by: Experimental Results & Discussion.
-  (2018) Online robust policy learning in the presence of unknown adversaries. In Advances in Neural Information Processing Systems, pp. 9916–9926. Cited by: Robustification of RL Agents.
-  (2018) Inference aided reinforcement learning for incentive mechanism design in crowdsourcing. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 5507–5517. Cited by: Introduction.
-  (2017) Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284. Cited by: Table 1, Introduction, Adversarial Attacks on RL Agent.
-  (2018-12) Reliable control policy of cyber-physical systems against a class of frequency-constrained sensor and actuator attacks. IEEE Transactions on Cybernetics 48 (12), pp. 3432–3439. External Links: Cited by: Introduction.
-  (2019-10) Semantic adversarial attacks: parametric transformations that fool deep classifiers. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Introduction.
-  (2016-12) Zero-stealthy attack for sampled-data control systems: the case of faster actuation than sensing. In 2016 IEEE 55th Conference on Decision and Control (CDC), Vol. , pp. 5956–5961. External Links: Cited by: Introduction.
-  (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: Algorithm for Mounting MAS Attacks.
-  (2019) A case study of deep reinforcement learning for engineering design: application to microfluidic devices for flow sculpting. Journal of Mechanical Design 141 (11). Cited by: Introduction.
-  (2017) Tactics of adversarial attack on deep reinforcement learning agents. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 3756–3762. Cited by: Table 1, Adversarial Attacks on RL Agent.
-  (1990-07) Receding horizon control of nonlinear systems. IEEE Transactions on Automatic Control 35 (7), pp. 814–824. External Links: Cited by: Algorithm for Mounting LAS Attacks.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: Preliminaries.
-  (2018) Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2040–2042. Cited by: Table 1, Robustification of RL Agents.
-  (2018) Deepmimic: example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG) 37 (4), pp. 143. Cited by: Introduction.
-  (2003) A survey of industrial model predictive control technology. Control engineering practice 11 (7), pp. 733–764. Cited by: Algorithm for Mounting LAS Attacks.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Preliminaries.
-  (2018) Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, External Links: Cited by: Robustification of RL Agents.
-  (2012) Fast projections onto mixed-norm balls with applications. Data Mining and Knowledge Discovery 25 (2), pp. 358–377. Cited by: Appendix B.
-  (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: Preliminaries.
-  (2019) Deep reinforcement learning for adaptive traffic signal control. arXiv preprint arXiv:1911.06294. Cited by: Introduction.
-  (2019) Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pp. 6215–6224. Cited by: Table 1, Robustification of RL Agents.
-  (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: Robustification of RL Agents.
-  (2018) Sequential attacks on agents for long-term adversarial goals. In 2. ACM Computer Science in Cars Symposium, Cited by: Table 1, Adversarial Attacks on RL Agent.
-  (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: Experimental Results & Discussion.
-  (1992-05-01) Q-learning. Machine Learning 8 (3), pp. 279–292. External Links: Cited by: Preliminaries.
-  (2018) A pca-based model to predict adversarial examples on q-learning of path finding. In 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), pp. 773–780. Cited by: Adversarial Attacks on RL Agent.
Appendix A Pseudocode of MAS Attack
Appendix B Analysis
Projections onto Mixed-norm Balls
The Look-ahead Action Space (LAS) Attack Model described above requires projecting onto the mixed-norm -ball of radius in a vector space. Below, we show how to provably compute such projections in a computationally efficient manner. For a more complete treatment, we refer to . Recall the definition of the -norm. Let be partitioned into sub-vectors of length-. Then,
Due to scale invariance of norms, we can assume . We consider the following special cases:
: this reduces to the case of the ordinary -norm in . Projection onto the unit -ball can be achieved via soft-thresholding every entry in :
where is a KKT parameter that can be discovered by a simple sorting the (absolute) values of . See .
: this reduces to the case of the isotropic -norm in . Projection onto the unit -ball can be achieved by simple normalization:
: this is a “hybrid” combination of the above two cases, and corresponds to the procedure that we use in mounting our LAS attack. Projection onto this ball can be achieved by a three-step method. First, we compute the -dimensional vector, , of column-wise -norms. Then, we project onto the unit -ball; essentially, this enables us to “distribute” the (unit) budget across columns. Since -projection is achieved via soft-thresholding, a number of coordinates of this vector are zeroed out, and others undergo a shrinkage. Call this (sparsified) projected vector . Finally, we perform an projection, i.e., we scale each column of by dividing by its norm and multiplying by the entries of :
Appendix C Additional Experiments
Comparison of Attacks Mounted on PPO Agent in Bipedal-Walker Environment
The results in Fig. 4 depicts the comparison between the MAS and LAS attacks applied on a PPO agent in the Bipedal-Walker environment. A similar trend is observed where LAS attacks are generally more severe than MAS attacks. We acknowledge that in this environment, MAS attacks are sometimes effective in reducing the rewards as well. However, this can be attributed to the Bipedal Walker having more dimensions (4 dimensions) in terms of it’s action space in compared to the Lunar-Lander (2 dimensions) environment. In addition, the actions of the Bipedal Walker is also highly coupled, in compared to the actions of the Lunar Lander. Hence, the agent for Bipedal-Walker is more sensitive towards perturbations, which explains the increase efficacy of MAS attacks.
Comparison of Attacks Mounted on DDQN Agent in Lunar Lander and Bipedal-Walker Environments
In Figures 5 and 6, we present additional results on the efficacy of different attack strategies for a DoubleDQN agent trained in the Lunar Lander and Bipedal Walker environment. An interesting observation is that in both environment, the effects of the attacks are more severe for the DDQN agent in compared to the PPO agent as seen in the box-plots. We conjecture that this is due to the deterministic nature of DDQN method where the optimal action is unique. In contrast, the PPO agent was trained with a certain stochastic in the process, which might explain the additional robustness of PPO agent to noise or perturbations. Nevertheless, in the DDQN agent, we still observe a similar trend where given the same about of budget, more severe attacks can be crafted using the LAS strategy in compared to MAS or random attacks.
Comparison of Attacks Mounted on PPO Agent Mujoco Control Environments
In addition to the attacks mounted on the agents in the 2 environments above, we also compare the effect of the attacks on a PPO agent trained in 3 different Mujoco control environments. Figures 7, 8, 9 illustrates the distribution of rewards obtained by the agent in the Hopper, Half-Cheetah and Walker environments respectively. In all three environments, we observe that LAS attacks are generally more effective in reducing the rewards of the agent across different values of budget and horizon, which reinforces the fact that LAS attacks are stronger than MAS attacks. However, it is also interesting to note that agents in different environments have different sensitivity to the budget of attacks. In Hopper(Figure 7) and Walker(Figure 9), we see that MAS attacks have the effect of shifting the median and increasing the variance of the reward distortion with respect to the nominal, which highlights the fact that there are some episodes which MAS attack fails to affect the agent. In contrast, in the Half-Cheetah environment(Figure 8), we see that MAS attacks shifts the whole distribution of rewards downwards, showing that the agent is more sensitive in to MAS in this environment as compared to the other two environments. This suggests that the budget and horizon values are also hyper-parameters which should be tuned according to the environment.
Comparison of Temporal Projections in LAS
In this section, we present additional visualizations to further understand why temporal projections results in more severe attacks in compared to temporal projections. Figure 10, 11, 12 and 13 presents the usage plot across 100 time steps for both PPO and DDQN in the Lunar Lander and Bipedal-Walker environment. The left subplot represents projections in the spatial dimension while the right subplot represents projections in the spatial dimensions. These plots directly compare the difference in amount of used between and temporal projections for both and spatial attacks.
In most cases with the exception of Figure 12, we see a clear trend that temporal projections results in a sparser but more concentrated peaks of utilization (corresponding to a few instance of strong attacks). In contrast, temporal projections results in a more distributed but frequent form of of utilization (corresponding to more frequent but weak instances of attacks). We note that while projections produces stronger attacks, there is a diminishing return on allocating more attacks to a certain time point as after a certain limit. Hence, this explains the weaker effect of temporal projections since it concentrates the attacks to a few points but ultimately gives time for the agent to recover. In contrast, temporal projections distributes the attacks more frequently that causes the agent to follow a diverging trajectory that is hard to recover from.
As an anecdotal example in the Lunar Lander environment, we observe that attacks with temporal projection tend to turn off the vertical thrusters of the lunar lander. However, due to the sparsity of the attacks, the RL agent could possibly be fire the upward thrusters in time to prevent a free-fall landing. With temporal projections, the agent is attacked continuously. Consequently, the agent has no chance to return to a nominal state and quickly diverges towards a terminal state.
For this section, we present an ablation study to investigate the effect of different budget and horizon parameters on the effectiveness of LAS vs MAS. As mentioned in the main manuscript, we take the difference of each attack’s reduction in rewards (i.e. attack - nominal) and visualize how much rewards LAS reduces as compared to MAS under different conditions of and . Fig 14 illustrates the ablation study of a PPO agent in Lunar Lander. The figure is categorized by different spatial projections, where spatial projections are shown on the left figure while spatial projections are shown on the right. Both subplots are shown for time projection attacks. Each individual subplot shows three different lines with different , with each line visualizing the change in mean cumulative reward as budget increases along the x-axis. As budget increases, attacks in both and spatial projection shows a monotonic decrease in cumulative rewards. Attacks in each spatial projection with a value of 5 shows different trends, where decreases linearly with increasing budget while became stagnant after value of 3. This can be attributed to the fact that the attacks are more sparsely distributed in attacks, causing most of the perturbations to be distributed into one action dimension. Thus, as budget increases, we see a diminishing return of LAS attacks since attacking a single action dimension beyond a certain limit doesn’t decrease reward any further. The study was also conducted for PPO Bipedal-Walker and both DDQN Lunar Lander and Bipedal-Walker as shown in Figure 16, 15 and 17 respectively. We only consider attacks in temporal projection attacks for both and spatial projections. At a glance, we see different trends across each figures due to the different environment dynamics. However, in all cases, the decrease in reduction of rewards is always lesser than or equals to zero, which infers that LAS attacks are at least as effective than MAS attacks. We observed that attacks for horizon value of 5 becomes ineffective after a certain budget value. This shows that after some budget value, MAS attacks are as effective as LAS attacks because LAS might be operating at maximum attack capacity. Interesting to note that Bipedal-Walker for PPO needed a higher budget compared to the DDQN counterpart due to the PPO being more robust to attacks.
Effect of Horizon Parameter in LAS
In this section, we further describe the effect of horizon parameter on the effectiveness of LAS attacks that we empirically observed. defines a fixed time horizon (e.g., steps in DRL environments) to expend a given budget . For a fixed and short , LAS favors injecting stronger perturbations in each step. Hence, we would intuitively hypothesize that given a shorter , the severity of LAS attacks will increase as decrease, as shown by the reduction in rewards between MAS and LAS in Figure 3 of the main paper. In most cases, the reduction is negative, hence showing that LAS attacks are indeed more severe. However in some cases as shown in Figure 8, 9 & 10, a shorter does result in LAS being not as effective as a longer (though still stronger than MAS as evident from negative values of y-axis). This can be attributed to the nonlinear reward function of the environments and consequent failure modes of the agent. For example, attacks on Lunar Lander PPO agent causes failure by constantly firing thruster engines to prevent Lunar Lander from landing, hence accumulating negative rewards over a long time. In contrast, attacks on the DQN agent causes Lunar Lander to crash immediately, hence terminating the episode before too much negative reward is accumulated. Thus, while the effect of on LAS attacks sometimes do not show a consistent trend, we it is a key parameter that can be tuned to control the failure modes of the RL agent.
Action Space Dimension Decomposition
We provide additional results on using the LAS attack scheme as a tool to understand the RL agent’s action space vulnerabilities for a DoubleDQN agent in both Lunar Lander and Bipedal Walker environment. It is interesting to note in Figure 18, even with agents trained with a different philosophy (value-based vs policy based, shown in main manuscript), the attack scheme still distributes the attack to a similar dimension (Up-Down action for Lunar Lander), which highlights the importance of the that particular dimension. In Figure 19, we show the outcome of LAS attack scheme on Bipedal Walker environment having four action space dimensions. The four joints of the bipedal walker, namely Left Hip, Left Knee, Right Hip and Right Knee are attacked in this case, and from Figure 19, we see that the left hip is attacked more than any other action dimension in most of the episodes. This supports our inference that LAS attacks can bring out the vulnerabilities in the action space dimensions (actuators in case of CPS RL agents).