Online Data Poisoning Attacks
Abstract
We study data poisoning attacks in the online setting where training items arrive sequentially, and the attacker may perturb the current item to manipulate online learning. Importantly, the attacker has no knowledge of future training items nor the data generating distribution. We formulate online data poisoning attack as a stochastic optimal control problem, and solve it with model predictive control and deep reinforcement learning. We also upper bound the suboptimality suffered by the attacker for not knowing the data generating distribution. Experiments validate our control approach in generating nearoptimal attacks on both supervised and unsupervised learning tasks.
Online Data Poisoning Attacks
Xuezhou Zhang Department of Computer Sciences University of Wisconsin–Madison zhangxz1123@cs.wisc.edu Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison jerryzhu@cs.wisc.edu Laurent Lessard Department of Electrical and Computer Engineering University of Wisconsin–Madison laurent.lessard@wisc.edu
noticebox[b]Preprint. Under review.\end@float
1 Problem Statement
Protecting machine learning from adversarial attacks is of paramount importance [35, 18, 40]. To do so one much first understand various types of adversarial attacks. Data poisoning is a type of attack where an attacker contaminates the training data in order to force a nefarious model on the learner [38, 29, 9, 11, 19, 24]. Prior work on data poisoning focused almost exclusively on the batch setting, where the attacker poisons a batch training set, and then the victim learns from the batch [7, 30, 38, 29, 33, 10]. However, the batch setting misses the threats posed by the attacker on sequential learners. For example, in ecommerce applications usergenerated data arrives at the learner sequentially. Such applications are particularly susceptible to poisoning attacks, because it is relatively easy for the attacker to manipulate data items before they arrive at the learner. Furthermore, the attacker may observe the effect of previous poisoning on the learner and adaptively decides how to poison next. This adaptivity makes online data poisoning a potentially more severe threat compared to its batch counterpart.
This paper presents a principled study of online data poisoning attacks. Our key contribution is an optimal control formulation of such attacks. We provide theoretical analysis to show that the attacker can attack nearoptimally even without full knowledge of the underlying data generating distribution. We then propose two practical attack algorithms—one based on traditional modelbased optimal control, and the other based on deep reinforcement learning—and show that they achieve nearoptimal attack performance in synthetic and realdata experiments. Taken together, this paper builds a foundation for future studies of defense against online data poisoning.
The online data poisoning problem in this paper is shown in Figure 1. There are three entities: a stochastic environment, a sequential learning victim, and the online attacker. In the absence of attacks, at time the environment draws a training data point i.i.d. from a timeinvariant distribution : For example, can be a featurelabel pair in supervised learning or just the features in unsupervised learning. The victim maintains a model . Upon receiving , the victim performs one step of the sequential update defined by the function :
(1) 
For example, can be gradient descent under learner loss and step size . We now introduce the attacker by defining its knowledge, allowed actions, and goals:

The attacker has knowledge of the victim’s update function , the victim’s initial model , data generated by the environment so far, and optionally “preattack” data points . However, at time the attacker does not have the clairvoyant knowledge of future data points , nor does it have the knowledge of the environment distribution .

The attacker can perform only one type of action: once the environment draws a data point , the attacker perturbs the data point into a potentially different point . The attacker incurs a perturbation cost , which reflects the price to attack. For example, if is endowed with an appropriate norm. The attacker then gives to the victim, who proceeds with model update (1) using instead of .

The attacker’s goal, informally, is to force the victim’s learned models to satisfy certain nefarious properties at each step while paying a small cumulative perturbation cost. These “nefarious properties” (rather the inability to achieve them) are captured by a nefarious cost . It can encode a variety of attack goals considered in the literature such as: (i) targeted attack to drive the learned model toward an attackerdefined target model (the dagger is a mnemonic for attack); (ii) aversion attack (note the sign) to push the learned model away from a good model , such as the one estimated from preattack data; (iii) backdoor attack , in which the goal is to plant a backdoor such that the learned model behaves unexpectedly on special examples [24, 33, 10]. To balance nefarious properties with perturbation cost, the attacker defines a running cost at time :
(2) where is a weight chosen by the attacker to balance the two. The attacker desires small cumulative running costs, which is the topic of Section 3.
2 Related Work
Data poisoning attacks have been studied against a wide range of learning systems. However, this body of prior work has almost exclusively focused on the offline or batch settings, where the attacker observes and can poison the whole training set or an entire batch of samples at once, respectively. In contrast, our paper focuses on the online setting, where the attacker has to act multiple times and sequentially during training. Examples of offline or batch poisoning attacks against SVM include [7, 9, 38]. Such attacks are generalized into a bilevel optimization framework against general offline learners with a convex objective function in [29]. A variety of attacks against other learners have been developed, including neural networks [21, 30], autoregressive models [2, 11], linear and stochastic bandits [19, 27], collaborative filtering [24], and models for sentiment analysis [31].
There is an intermediate attack setting between offline and online, which we call clairvoyant online attacks, where the attacker performs actions sequentially but has full knowledge of all future input data Examples include heuristic attacks against SVM learning from data streams [9] and binary classification with an online gradient descent learner [37]. Our paper focuses instead on the perhaps more realistic setting where the attacker has no knowledge of the future data stream. More broadly, our paper advocates for a general optimal control viewpoint that is not restricted to specific learners such as SVM.
The parallel line of work studying online teaching also considers the sequential control problem of machine learners, where the goal is to choose a sequence of training examples that accelerates learning [26, 23]. However, [26] solves the problem using a greedy heuristic that we show in Section 6 performs poorly compared to our optimal control approach. On the other hand, [23] finds optimal teaching sequences but is restricted to an ordinary linear regression learner.
The problem of optimal feedback control in the presence of noise, uncertain disturbances, or uncertain dynamics has been an area of study for the better part of the past century. Major subfields include stochastic control, when disturbances are stochastic in nature [3, 22], adaptive control, when unknown parameters must be learned in an online fashion [4, 32], and robust control, when a single controller is designed to control a family of systems within some uncertainty set [39, 34].
More recently, these classical problems have been revisited in the context of modern statistics, with the goal of obtaining tight sample complexity bounds. Examples include unknown dynamics [13], adversarial dynamics [1], adversarial cost [12] or unknown dynamics and cost [16]. These works typically restrict their attention to linear systems with quadratic or convex losses, which is a common and often reasonable assumption for control systems. However, for our problem of interest described in Section 1, the system dynamics (1) are the learner’s dynamics, which are nonlinear for all cases of practical interest, including simple cases like gradient descent. In the sections that follow, we develop tools, algorithms, and analysis for handling this more general nonlinear setting.
3 An Optimal Control Formulation
We now precisely define the notion of optimal online data poisoning attacks. To do so, we cast the online data poisoning attack setting in Section 1 as a Markov Decision Process (MDP) explained below.

The state at time is the stacked vector consisting of the victim’s current model and the incoming environment data point . The state space is .

The attacker’s action is the perturbed training point, i.e. . The action space is .

From the attacker’s perspective, the state transition probability , where is the probability simplex over , describes the conditional probability on the next state given current state and attack action. Specifically, . For concreteness, in this paper, we assume that the victim learning update is deterministic, and thus the stochasticity is solely in the component inside , which has a marginal distribution , i.e.
(3) 
The quality of control at time is specified by the running cost in (2), to be minimized. From now on, we overload the notation and write the running cost equivalently as . Note that this is the opposite of the reward maximization setup commonly seen in reinforcement learning.

We present online data poisoning attack with an infinite time horizon (the finite horizon case is similar but omitted due to space). We introduce a discounting factor to define a discounted cumulative cost.

The initial probability is the probability distribution of the initial state . In particular, we assume that the initial model is fixed while the first data point is sampled from , i.e. .
A policy is a function that the attacker uses to choose the attack action based on the current victim model and the current environment input . The value of a state is the expected discounted cumulative cost starting at and following policy :
(4) 
where the expectation is over the transition probability . Overall, the attacker wants to perform optimal control over the MDP, that is, to find an optimal control policy that minimizes the expected value at the initial state. Define the attacker’s objective as , and the attacker’s optimal attack policy as
Fortunately for the victim, the attacker cannot directly solve this optimal attack problem because it does not know the environment data distribution and thus cannot evaluate the expectation. However, as we show next, the attacker can use model predictive control to approximately and incrementally solve for the optimal attack policy while it gathers information about as the attack happens.
4 Practical Attack Algorithms via Model Predictive Control
The key obstacle that prevents the attacker from obtaining an optimal attack is the unknown data distribution . However, the attacker can build an increasingly accurate empirical distribution from and optionally the preattack data sampled from . Specifically, at time with in place of and with the model in place of , the attacker can construct a surrogate MDP , solve for the optimal policy on , and use to perform a onestep attack: .
As time goes on, the attacker repeats the process of estimating and applying the onestep attack . This repeated procedure of (re)planning ahead but only executing one action is called Model Predictive Control (MPC) [8, 28], and is widely used across the automotive, aerospace, and petrochemical industries, to name a few. At each time step , MPC would plan a sequence of attacks using the surrogate model (in our case instead of ), apply the first attack , update , and repeat. This allows the controller to continually adapt without committing to an inaccurate model.
Next, we present two algorithms that practically solve the surrogate MDP, one based on modelbased planning and the other based on modelfree reinforcement learning.
4.1 Algorithm NLP: Planning with Nonlinear Programming
In the NLP algorithm, the attacker further approximates the surrogate objective as . The first approximation truncates at steps after , making it a finitehorizon control problem. The second approximation does two things: (i) It replaces the expectation by one sampled trajectory of the future input sequence, i.e. . It is of course possible to use the average of multiple trajectories to better approximate the expectation, though empirically we found that one trajectory is sufficient. (ii) Instead of optimizing over a policy , it locally searches for the action sequence . The attacker now solves the following optimization problem at every time :
(5)  
s.t.  
Let be a solution. The NLP algorithm defines , then moves on to . The resulting attack problem in general has a nonlinear objective stemming from and in (2), and nonconvex equality constraints stemming from the victim’s learning rule in (1). Nonetheless, the attacker can solve modestsized problems using modern nonlinear programming solvers such as IPOPT [36].
4.2 Algorithm DDPG: Deep Deterministic Policy Gradient
Instead of truncating and sampling to approximate the surrogate attack problem with a nonlinear program, one can directly solve for the optimal parametrized policy using reinforcement learning. In this paper, we utilize deep deterministic policy gradient (DDPG) [25] to handle a continuous action space. DDPG learns a deterministic policy with an actorcritic framework. Roughly speaking, it simultaneously learns an actor network parametrized by and a critic network parametrized by . The actor network represents the currently learned policy while the critic network estimate the actionvalue function of the current policy, whose functional gradient guides the actor network to improve its policy. Specifically, the policy gradient can be written as: in which the expectation is taken over , the discounted state visitation distribution for the current policy . The critic network is updated using TemporalDifference (TD) learning. We refer the reader to the original paper [25] for a more detailed discussion of this algorithm and other deep learning implementation details.
There are two advantages of this policy learning approach to the direct approach NLP. Firstly, it actually learns a policy which can generalize to more than one step of attack. Secondly, it is a modelfree method and doesn’t require knowledge of the analytical form of the system dynamic , which is necessary for the direct approach. Therefore, DDPG also applies to the blackbox attack setting, where the learner’s dynamic is unknown. To demonstrate the generalizability of the learned policy, in our experiments described later, we only allow the DDPG method to train once at the beginning of the attack on the surrogate MDP based on and the preattack data . The learned policy is then applied to all later attack rounds without retraining.
5 Theoretical Analysis
The fundamental statistical limit to a realistic attacker is its lack of knowledge on the environment data distribution . An idealized attacker with knowledge of can find the optimal control policy that achieves the optimal attack objective . In contrast, a realistic attacker only has an estimated , hence an estimated state transition , and ultimately an estimated MDP . The realistic attacker will find an optimal policy with respect to its estimated MDP : , but is in general suboptimal with respect to the true MDP . We are interested in the optimality gap . Note both are evaluated on the true MDP.
We present a theoretical analysis relating the optimality gap to the quality of estimated . Our analysis is a natural extension to the Simulation Lemma in tabular reinforcement learning [20] and that of [5]. We assume that both and are compact, and the running cost is continuous and thus bounded on its compact domain. WLOG, we assume . It is easy to see that then the range of value is bounded: for both , any policy, and any state. Note the value function (4) satisfies the Bellman equation: .
Proposition 5.1.
Consider two MDPs that differ only in state transition, induced by and , respectively. Assume that . Let denote the optimal policy on and the optimal policy on . Then, .
Proposition 5.1 implies that optimality gap is at most linear in . Classic Results on Kernel Density Estimation (KDE) suggest that the distance between and the kernel density estimator based on samples converges to zero asymptotically in a rate of for some constant (e.g. Theorem 9 in [17]).
In the experiment section below, the environment data stream is generated from a uniform distribution on a finite data set, in which case is a multinomial distribution. Under this special setting, we are able to provide a finitesample bound of order that matches with the best achievable asymptotic rate above, i.e. as .
Theorem 5.2.
Consider an MDP induced by a multinomial distribution with support cardinality , and a surrogate MDP induced by the empirical distribution on i.i.d. samples, i.e. . Denote the optimal policy on and the optimal policy on . Then, with probability at least , we have .
6 Experiments
In this section, we empirically evaluate our attack algorithms NLP and DDPG in Section 4 against several baselines on synthetic and real data. As an empirical measure of attack efficacy, we compare the attack methods by their empirical discounted cumulative cost , where the attack actions are chosen by each method. Note that is computed on the actual instantiation of the environment data stream . Better attack methods tend to have smaller . We compare our algorithms with the following Baseline Attackers:
Null Attack: This is the baseline without attack, namely for all . We expect the null attack to form an upper bound on any attack method’s empirical discounted cumulative cost .
Greedy Attack: The greedy strategy is applied widely as a practical heuristic in solving sequential decision problems ([26, 23]). For our problem at time step the greedy attacker uses a timeinvariant attack policy which minimizes the current step’s running cost . Specifically, the greedy attack policy can be written as If we instantiate for a target model and , we exactly recover the algorithm in [26]. Both null attack and greedy attack can be viewed as timeinvariant policies that do not utilize the information in .
Clairvoyant Attack: A clairvoyant attacker is an idealized attacker who knows the time horizon and the whole data sequence upfront. In most realistic online data poisoning settings an attacker only know at time . Therefore, the clairvoyant attacker has strictly more information, and we expect it to form a lower bound on realistic attack methods in terms of . The clairvoyant attacker solves a finite timehorizon optimal control problem, equivalent to the formulation in [37] but without terminal cost: subject to given, given (clairvoyant), and .
6.1 Poisoning Task Specification
To specify a poisoning task is to define the victim learner in (1) and the attacker’s running cost in (2). We evaluate all attacks on two types of victim learners: online logistic regression, a supervised learning algorithm, and online soft kmeans clustering, an unsupervised learning algorithm.
Online logistic regression: Online logistic regression performs a binary classification task. The incoming data takes the form of , where is the feature vector and is the binary label. In the experiments, we focus on attacking the feature part of the data, as is done in a number of prior works [29, 37, 21]. The learner’s update rule is one step of gradient descent on the log likelihood with step size : The attacker wants to force the victim learner to stay close to a target parameter , i.e. this is a targeted attack. The attacker’s cost function is a weighted sum of two terms: the nefarious cost is the negative cosine similarity between the victim’s parameter and the target parameter, and the perturbation cost is the distance between the perturbed feature vector and the clean one, i.e. . Recall .
Online soft kmeans: Online soft kmeans performs a kmeans clustering task. The incoming data contains only the feature vector, i.e. . Its only difference from traditional kmeans is that instead of updating only the centroid closest to the current data point, it updates all the centroids but the updates are weighted by their squared distances to the current data point using the softmax function [6]. Specifically, the learner’s update rule is one step of soft kmeans update with step size on all centroids, i.e. , , where . Recall . Similar to online logistic regression, we consider a targeted attack objective. The attacker wants to force the learned centroids to each stay close to the corresponding target centroid . The attacker’s cost function is a weighted sum of two terms: the nefarious cost function is the sum of the squared distance between each of the victim’s centroid and the corresponding target centroid, and the perturbation cost is the distance between the perturbed feature vector and the clean one, i.e.
6.2 Synthetic Data Experiments
We first show a synthetic data experiment where the attack policy can be visualized. The environment is a mixture of two 1D Gaussian: with and . The victim learner is online soft kmeans with and initial parameter . The attack target is and , namely the opposite of how the victim’s parameters should move. We set the learning rate , cost regularizer , discounting factor , evaluation length and lookahead horizon for MPC . For attack methods that requires solving a nonlinear program, including GREEDY, NLP and Clairvoyant, we use the JuMP modeling language [15] and the IPOPT interiorpoint solver [36]. Following the above specification, we run each attack method on the same data stream and compare their behavior.
Results: Figure 2 shows the empirical discounted cumulative cost as the attacks go on. On this toy example, the null attack baseline achieves at . The greedy attacker is only slight more effective at . NLP and DDPG (curve largely overlap and hidden under NLP) achieve and , respectively, almost matching Clairvoyant’s . As expected, the null and clairvoyant attacks form upper and lower bounds on .
Figure 2bf shows the victim’s trajectory as attacks go on. Without attack (null), converges to the true parameter and . The greedy attack only perturbs each data point slightly, failing to force toward attack targets. This failure is due to its greedy nature: the immediate cost at each round is indeed minimized, but not enough to move the model parameters close to the target parameters. In contrast, NLP and DDPG (trajectory similar to NLP, not shown) exhibit a different strategy in the earlier rounds. They inject larger perturbations to the data points and sacrifice larger immediate costs in order to drive the victim’s model parameters quickly towards the target. In later rounds they only need to stabilize the victim’s parameters near the target with smaller perstep cost.
6.3 Real Data Experiments
In the real data experiments, we run each attack method on 10 data sets across two victim learners.
Datasets: We use 5 datasets for online logistic regression: Banknote Authentication (with feature dimension ), Breast Cancer (), Cardiotocography (), Sonar (), and MNIST 1 vs. 7 (), and 5 datasets for online kmeans clustering: User Knowledge (), Breast Cancer (), Seeds (), posture (), MNIST 1 vs. 7 (). All datasets except for MNIST can be found in the UCI Machine Learning Repository [14]. Note that two datasets, Breast Cancer and MNIST, are shared across both tasks.
Preprocessing: To reduce the running time, for datasets with dimensionality , we reduce the dimension to via PCA projection. Then, all datasets are normalized so that each feature has mean 0 and variance 1. Each dataset is then turned into a data stream by random sampling. Specifically, each training data point is sampled uniformly from the dataset with replacement.
Experiment Setup: In order to demonstrate the general applicability of our methods, we draw both the victim’s initial model and the attacker’s target at random from a standard Gaussian distribution of the appropriate dimension, for both online logistic regression and online kmeans in all 10 datasets. Across all datasets, we use the following hyperparameters: . For online logistic regression while for online kmeans .
For DDPG attacker we only perform policy learning at the beginning to obtain ; the learned policy is then fixed and used to perform all the attack actions in later rounds. In order to give it a fair chance, we give it a preattack dataset of size . For the sake of fair comparisons, we give the same preattack dataset to NLP as well. For NLP attack we set the lookahead horizon such that the total runtime to perform attacks does not exceed the DDPG training time, which is 24 hours on an Intel Core i76800K CPU 3.40GHz with 12 cores. This results in for online logistic regression on CTG, Sonar and MNIST, and in all other experiments.
Results: The experiment results are shown in figure 3. Interestingly, several consistent patterns emerge from the experiments: The clairvoyant attacker consistently achieves the lowest cumulative cost across all 10 datasets. This is not surprising, as the clairvoyant attacker has extra information of the future. The NLP attack achieves clairvoyantmatching performance on all 7 datasets in which it is given a large enough lookahead horizon, i.e. . DDPG follows closely next to MPC and Clairvoyant on most of the datasets, indicating that the pretrained policy can achieve reasonable attack performance in most cases. On the 3 datasets where for NLP, DDPG exceeds the shortsighted NLP, indicating that when the computational resource is limiting, DDPG has an advantage by avoiding the iterative retraining that NLP cannot bypass. GREEDY does not do well on any of the 10 datasets, achieving only a slightly lower cost than the NULL baseline. This matches our observations in the synthetic experiment.
Each of the attack methods also exhibits strategic behavioral patterns similar to what we observe in the synthetic experiment. In particular, the optimalcontrol based methods NLP and DDPG sacrifice larger immediate costs in earlier rounds in order to achieve smaller attack costs in later rounds. This is especially obvious in the online logistic regression plots 3be, where the cumulative costs rise dramatically in the first 50 rounds, becoming higher than the cost of NULL and GREEDY around that time. This early sacrifice pays off after where the cumulative cost starts to fall much faster. In 3ce, however, the shortsighted NLP (with ) fails to fully pick up this longterm strategy, and exhibits a behavior close to an interpolation of greedy and optimal. This is not surprising, as NLP with horizon is indeed equivalent to the GREEDY method. Thus, there is a spectrum of methods between GREEDY and NLP that can achieve various levels of performance with different computational costs.
7 Conclusion
In this paper, we formulated online poisoning attacks as a stochastic optimal control problem. We proposed two attack algorithms: a modelbased planning approach and a modelfree reinforcement learning approach, and showed that both are able to achieve near clairvoyantlevels of performance. We also provided analysis to characterize the optimality gap between a realistic attacker with no knowledge of and a clairvoyant attacker that knows in advance.
Acknowledgments
This work is supported in part by NSF 1750162, 1837132, 1545481, 1704117, 1623605, 1561512, the MADLab AF Center of Excellence FA95501810166, and the University of Wisconsin.
References
 [1] Naman Agarwal, Brian Bullins, Elad Hazan, Sham M Kakade, and Karan Singh. Online control with adversarial disturbances. arXiv preprint arXiv:1902.08721, 2019.
 [2] Scott Alfeld, Xiaojin Zhu, and Paul Barford. Data poisoning attacks against autoregressive models. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 [3] Karl J Åström. Introduction to stochastic control theory. Courier Corporation, 2012.
 [4] Karl J Åström and Björn Wittenmark. Adaptive control. Courier Corporation, 2013.
 [5] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pages 263–272, 2017.
 [6] James C Bezdek, Robert Ehrlich, and William Full. Fcm: The fuzzy cmeans clustering algorithm. Computers & Geosciences, 10(23):191–203, 1984.
 [7] Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389, 2012.
 [8] Francesco Borrelli, Alberto Bemporad, and Manfred Morari. Predictive control for linear and hybrid systems. Cambridge University Press, 2017.
 [9] Cody Burkard and Brent Lagesse. Analysis of causative attacks against svms learning from data streams. In Proceedings of the 3rd ACM on International Workshop on Security And Privacy Analytics, pages 31–36. ACM, 2017.
 [10] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
 [11] Yiding Chen and Xiaojin Zhu. Optimal adversarial attack on autoregressive models. arXiv preprint arXiv:1902.00202, 2019.
 [12] Alon Cohen, Avinatan Hassidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. arXiv preprint arXiv:1806.07104, 2018.
 [13] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.
 [14] Dheeru Dua and Casey Graff. Uci machine learning repository, 2017.
 [15] Iain Dunning, Joey Huchette, and Miles Lubin. Jump: A modeling language for mathematical optimization. SIAM Review, 59(2):295–320, 2017.
 [16] ClaudeNicolas Fiechter. Pac adaptive control of linear systems. In Annual Workshop on Computational Learning Theory: Proceedings of the tenth annual conference on Computational learning theory, volume 6, pages 72–80. Citeseer, 1997.
 [17] Lasse Holmström and Jussi Klemelä. Asymptotic bounds for the expected l1 error of a multivariate kernel density estimator. Journal of multivariate analysis, 42(2):245–266, 1992.
 [18] Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and JD Tygar. Adversarial Machine Learning. Cambridge University Press, 2018.
 [19] KwangSung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pages 3644–3653, 2018.
 [20] Michael Kearns and Satinder Singh. Nearoptimal reinforcement learning in polynomial time. Machine learning, 49(23):209–232, 2002.
 [21] Pang Wei Koh and Percy Liang. Understanding blackbox predictions via influence functions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1885–1894. JMLR. org, 2017.
 [22] Panqanamala Ramana Kumar and Pravin Varaiya. Stochastic systems: Estimation, identification, and adaptive control, volume 75. SIAM, 2015.
 [23] Laurent Lessard, Xuezhou Zhang, and Xiaojin Zhu. An optimal control approach to sequential machine teaching. arXiv preprint arXiv:1810.06175, 2018.
 [24] Bo Li, Yining Wang, Aarti Singh, and Yevgeniy Vorobeychik. Data poisoning attacks on factorizationbased collaborative filtering. In Advances in neural information processing systems, pages 1885–1893, 2016.
 [25] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [26] Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B Smith, James M Rehg, and Le Song. Iterative machine teaching. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2149–2158. JMLR. org, 2017.
 [27] Yuzhe Ma, KwangSung Jun, Lihong Li, and Xiaojin Zhu. Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, pages 186–204. Springer, 2018.
 [28] David Q Mayne, James B Rawlings, Christopher V Rao, and Pierre OM Scokaert. Constrained model predictive control: Stability and optimality. Automatica, 36(6):789–814, 2000.
 [29] Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimal trainingset attacks on machine learners. In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
 [30] Luis MuñozGonzález, Battista Biggio, Ambra Demontis, Andrea Paudice, Vasin Wongrassamee, Emil C Lupu, and Fabio Roli. Towards poisoning of deep learning algorithms with backgradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 27–38. ACM, 2017.
 [31] Andrew Newell, Rahul Potharaju, Luojie Xiang, and Cristina NitaRotaru. On the practicality of integrity attacks on documentlevel sentiment analysis. In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop, pages 83–93. ACM, 2014.
 [32] Sosale Shankara Sastry and Alberto Isidori. Adaptive control of linearizable systems. IEEE Transactions on Automatic Control, 34(11):1123–1131, 1989.
 [33] Ayon Sen, Scott Alfeld, Xuezhou Zhang, Ara Vartanian, Yuzhe Ma, and Xiaojin Zhu. Training set camouflage. In International Conference on Decision and Game Theory for Security, pages 59–79. Springer, 2018.
 [34] Sigurd Skogestad and Ian Postlethwaite. Multivariable feedback control: analysis and design, volume 2. Wiley New York, 2007.
 [35] Yevgeniy Vorobeychik and Murat Kantarcioglu. Adversarial machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–169, 2018.
 [36] Andreas Wächter and Lorenz T Biegler. On the implementation of an interiorpoint filter linesearch algorithm for largescale nonlinear programming. Mathematical programming, 106(1):25–57, 2006.
 [37] Yizhen Wang and Kamalika Chaudhuri. Data poisoning attacks against online learning. arXiv preprint arXiv:1808.08994, 2018.
 [38] Huang Xiao, Battista Biggio, Blaine Nelson, Han Xiao, Claudia Eckert, and Fabio Roli. Support vector machines under adversarial label contamination. Neurocomputing, 160:53–62, 2015.
 [39] Kemin Zhou, John Comstock Doyle, Keith Glover, et al. Robust and optimal control, volume 40. Prentice hall New Jersey, 1996.
 [40] Xiaojin Zhu. An optimal control view of adversarial machine learning. arXiv preprint arXiv:1811.04422, 2018.
Appendix A Appendix
These proofs follow the technique in Nan Jiang’s Statistical Reinforcement Learning lecture notes (https://nanjiang.cs.illinois.edu/cs598/).
a.1 Proof of theorem 5.1
Proof.
For any policy and state , we have
Since this holds for all , we can also take the supremum on the LHS, which yields
(7) 
Now, for any ,
(8)  
(9)  
(10)  
(11) 
This completes the proof.
a.2 Proof of theorem 5.2
Proof.
We first want to establish an concentration bound for multinomial distribution. Observe that for any vector ,
(12) 
The plan is to prove concentration for each first, and then union bound over all to obtain the error bound. Observe that is the average of i.i.d. random variables with range . Then, by Hoeffding’s Inequality, with probability at least , we have
(13) 
Then, we can apply union bound across all and get that, with probability at least ,
(14) 
Substituting this quantity into Lemma 5.1 yields the desired result.