Adversarial Attacks on Linear Contextual Bandits

# Adversarial Attacks on Linear Contextual Bandits

## Abstract

Contextual bandit algorithms are applied in a wide range of domains, from advertising to recommender systems, from clinical trials to education. In many of these domains, malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior. For instance, an unscrupulous ad publisher may try to increase their own revenue at the expense of the advertisers; a seller may want to increase the exposure of their products, or thwart a competitor’s advertising campaign. In this paper, we study several attack scenarios and show that a malicious agent can force a linear contextual bandit algorithm to pull any desired arm times over a horizon of steps, while applying adversarial modifications to either rewards or contexts that only grow logarithmically as . We also investigate the case when a malicious agent is interested in affecting the behavior of the bandit algorithm in a single context (e.g., a specific user). We first provide sufficient conditions for the feasibility of the attack and we then propose an efficient algorithm to perform the attack. We validate our theoretical results on experiments performed on both synthetic and real-world datasets.

\printAffiliationsAndNotice\icmlEqualContribution

## 1 Introduction

Recommender systems are at the heart of the business model of many industries like e-commerce or video streaming Davidson et al. (2010); Gomez-Uribe and Hunt (2015). The two most common approaches for this task are based either on matrix factorization Park et al. (2017) or bandit algorithms Li et al. (2010), which both rely on a unaltered feedback loop between the recommender system and the user. In recent years, a fair amount of work has been dedicated to understanding how targeted perturbations in the feedback loop can fool a recommender system into recommending low quality items.

Following the line of research on adversarial attacks in deep learning Goodfellow et al. (2014) and supervised learning Biggio et al. (2012); Jagielski et al. (2018); Li et al. (2016); Liu et al. (2017), attacks on recommender systems have been focused on filtering-based algorithms Christakopoulou and Banerjee (2019); Mehta and Nejdl (2008) and offline contextual bandits Ma et al. (2018). The question of adversarial attacks for online bandit algorithms has only started being studied quite recently Jun et al. (2018); Liu and Shroff (2019); Immorlica et al. (2018), though solely in the multi-armed stochastic setting. Although the idea of online adversarial bandit algorithms is not new (see Exp algorithm in Auer et al. (2002)), the focus is different from what we are considering here. Indeed, algorithms like Exp or Exp Lattimore and Szepesvári (2018) are designed to find the optimal actions in hindsight to adapt to any stream of rewards without any further assumptions.

The opposition between the adversarial bandit setting and the stochastic setting has sparked interests in studying a middle ground. In Bubeck and Slivkins (2012), the learning algorithm has no knowledge of the type of feedback it receives. In Li et al. (2019); Gupta et al. (2019); Lykouris et al. (2019); Kapoor et al. (2019), the rewards are assumed to be stochastic but can be perturbed by some attacks and the authors focus on constructing algorithms able to find the optimal actions even in the presence of some non-random perturbations. However, those perturbations are bounded and agnostic to the choices of the learning algorithm. There are also some efforts in the broader Deep Reinforcement Learning (DRL) literature, focusing on modifying the observations of different states to fool a DRL system at inference time Hussenot et al. (2019); Sun et al. (2020).

#### Contribution.

In this work, we first follow the research direction opened by Jun et al. (2018) where the attacker has the objective of fooling a learning algorithm into taking a specific action as much as possible. Consider a news recommendation problem as described in Li et al. (2010), a bandit algorithm has to choose between articles to recommend to a user, based on some information about them, termed context. We assume that an attacker sits between the user and the website and can choose the reward (i.e., click or not) for the recommended article. Their goal is to fool the bandit algorithm into recommending a particular target article to most users. We extend the work in Jun et al. (2018); Liu and Shroff (2019) to the contextual linear bandit setting showing how to perturb rewards for both stochastic and adversarial algorithms. For the first time, we consider and analyze the setting in which the attacker can only modify the context associated with the current user (the reward is not altered). The goal of the attacker is still to fool the bandit algorithm into pulling the target arm for most users while minimizing the total norm of their attacks. We show that it is possible to fool the widely known LinUCB algorithm Abbasi-Yadkori et al. (2011); Lattimore and Szepesvári (2018) with this new type of attack on the context. Finally, we present a harder setting for the attacker, where the latter can only modify the context associated to a specific user. For example, this situation may occur when a malicious agent has infected some computers with a Remote Access Trojan (RAT). The attacker can thus modify the history of navigation of a specific user and, as a consequence, the information seen by the online recommender system. We show how the attacker can attack the two very common bandit algorithms LinUCB and LinTS Agrawal and Goyal (2013) and, in certain cases, have them pull a target arm most of the time when a specific user visits a website.

## 2 Preliminaries

We consider the standard contextual linear bandit setting with arms. At each time , the agent observes a context , selects an action and observes a reward: where for each arm , is a feature vector and is a conditionally independent zero-mean, -subgaussian noise. We also make the following assumptions on the contexts and parameter vectors.

###### Assumption 1.

There exist and , such that for all , and:

 ∀x∈D,∀a∈\llbracket1,K\rrbracket,||x||2≤L and ⟨θa,x⟩∈(0,1]

In addition, we assume that there exists such that for all arms .

The agent is interested in minimizing the cumulative regret, after steps:

 RT=T∑t=1⟨θa⋆t,xt⟩−⟨θat,xt⟩

where . A bandit learning algorithm is said to be no-regret when it satisfies , i.e., the average expected reward received by converges to the optimal one.

Classical bandit algorithms (e.g., LinUCB (Alg. 3) and LinTS (Alg. 4)) compute an estimate of the unknown parameters using past observations. Formally, for each arm we define as the set of times up to (included) where the agent played arm . Then, the estimated parameters are obtained through regularized least-squares regression as , where , and . Denote by the design matrix of the regularized least-square problem and by the weighted norm w.r.t. any positive matrix . We define the confidence set

 Ct,a={θ∈Rd:∥∥θ−ˆθt,a∥∥Vt,a≤βt,a} (1)

where which guarantees that , for all , w.p. . This uncertainty is used to balance the exploration-exploitation trade-off either through optimism (e.g., LinUCB) or through randomization (e.g., LinTS).

## 3 Online Adversarial Attacks on Rewards

The ultimate goal of a malicious agent is to force the bandit algorithm to perform a desired behavior. An attacker may simply want to induce the bandit algorithm to perform poorly—ruining the users’ experience—or to force the algorithm to suggest a specific arm. The latter case is particularly interesting in advertising where a seller may want to increase the exposure of its product at the expense of the competitors. Note that the users’ experience is also compromised by the latter attack since the suggestions they will receive will not be tailored to their needs. Similarly to (Liu and Shroff, 2019; Jun et al., 2018), we focus on the latter objective, i.e., to fool the bandit algorithm into pulling a target arm for time steps (independently of the user). A way to obtain this behavior is to dynamically modify the reward in order to make the bandit algorithm believe that is optimal. Clearly, the attacker has to pay a price in order to modify the perceived bandit problem and fool the algorithm. If there is no restriction on when and how the attacker can alter the reward, the attacker can easily fool the algorithm. However, this setting is not interesting since the attacker may pay a cost higher than the loss suffered by the attacked algorithm. An attack strategy is thus considered successful when the total cost of the attack is sublinear in .

#### Setting.

We assume that the attacker has the same knowledge as the bandit algorithm about the problem (i.e., knows and ). The attacker is assumed to be able to observe the context and the arm pulled by , and can modify the reward received by . When the attacker modifies the reward into the instantaneous cost of the attack is defined as . The goal of the attacker is to fool algorithm such that arm is pulled times and .

#### Attack idea.

We leverage the idea presented in Liu and Shroff (2019) and Jun et al. (2018) where the attacker lowers the reward of arms so that algorithm learns that the target arm is optimal for every context. Since is assumed to be no-regret, the attacker only needs to modify the rewards times to achieve this goal.

Lowering the rewards has the effect of shifting the vectors to new vectors such that for all arms and all contexts , .

Since rewards are assumed to be bounded (see Asm. 1), this objective can be achieved by simply forcing the reward of non-target arms to the minimum value. Contextual ACE (see Alg. 2) implements a soft version of this idea by leveraging the knowledge of the reward distribution. At each round , Contextual ACE modifies the reward perceived by as follows:

 ˜r1t,at={η′tif at≠a†rt,a†otherwise (2)

where is a -subgaussian random variable generated by the attacker independently of all other random variables. By doing this, Contextual ACE transforms the original problem into a stationary bandit problem in which is optimal for all the contexts (having mean ) and all the other arms have expected reward of . Despite this attack may seem expensive, the following proposition shows that the cumulative cost of the attacks is sublinear.

###### Proposition 1.

For any , when using Contextual ACE algorithm (Alg  1) with perturbed rewards , with probability at least , algorithm pulls arm for time steps and the total cost of attacks is .

The proof of this proposition is provided in App. A.1. While Prop. 1 holds for any no-regret algorithm , we can provide a more precise bound on the total cost by inspecting the algorithm. For example, we can show (see App. D), that, with probability at least , the number of times LinUCB (Abbasi-Yadkori et al., 2011) pulls arms different than is at most:

 ∑j≠a†Nj(T)≤64Kσ2λS2minx∈D⟨θa†,x⟩2(dlog⎛⎜⎝λ+TL2dδ2⎞⎟⎠)2

This directly translates into a bound on the total cost.

Comparison with ACE    In the stochastic setting, the ACE algorithm (Liu and Shroff, 2019) leverages a bound on the expected reward of each arm in order to modify the reward. However, the perturbed reward process seen by algorithm is non-stationary and in general there is no guarantee that an algorithm minimizing the regret in a stationary bandit problem keeps the same performance when the bandit problem is not stationary anymore. Nonetheless, transposing the idea of the ACE algorithm to our setting would give an attack of the following form, where at time , Alg. pulls arm and receives rewards :

 ˜r2t,at={rt,at+max(−1,min(0,Ct,at))if at≠a†rt,a†otherwise

with . Note that is defined as in Eq. 1 using the non-perturbed rewards, i.e., .

#### Constrained Attack.

When the attacker has a constraint on the instantaneous cost of the attack, using the perturbed reward may not be possible as the cost of the attack at time is not decreasing over time. Using the perturbed reward offers a more flexible type of attack with more control on the instantaneous cost thanks to the parameter . However, even this attack does not work when the maximum cost of an attack is too small.

#### Defense mechanism.

The attack based on reward is hardly detectable without prior knownledge about the problem. In fact, the reward process associated to is stationary and compatible with the assumption about the true reward (e.g., subgaussian). While having very low rewards is reasonable in advertising, in other problems makes the attack easily detectable. On the other hand, the fact that is a non-stationary process makes this attack easy to detect. When some data are already available on each arm, the learner can monitor the difference between the average rewards per action computed on new and old data.

###### Remark 1.

It is possible to extend this attack to multiple target arms . Similarly to (2), we can set when .

## 4 Online Adversarial Attacks on Contexts

In this section, we consider the attacker to be able to alter the context perceived by the algorithm rather than the reward that is to say the attacker is now restricted to change the type of users presented to the learning algorithm , hence changing their perception of the environment.

#### Setting.

As in Sec. 3, we consider the attacker to have the same knowledge about the problem as . The main difference with the previous setting is that the attacker acts before the algorithm. It means that the attacker does not know the arm that would have been chosen by when presented with the true context . Therefore, we need to have knowledge about the way the algorithm behaves. We focus on LinUCB and we assume the attacker knows the parameters of the algorithm. This is known as a white-box setting in the adversarial attacks literature (Goodfellow et al., 2014). The goal of the attacker is unchanged: they aim at forcing the algorithm to pull the target arm for time steps while paying a sublinear total cost. We denote by the context after the attack and by the instantaneous cost. Finally, we assume that the attacker knows a positive lower bound of the expected reward as follows.

For all , .

#### Attack Idea:

The idea behind the attack in this setting is similar to the idea behind the attack in Sec 3: the attacker builds a bandit problem where arm is optimal for all contexts by lowering the perceived value of all other arms. Here, the attacker cannot modify the reward, but they can scale the contexts in order to decrease the predicted rewards in the original context. At time , the attacker receives the context and computes the attack as follows:

1. Thanks to the white-box setting, the attacker computes the arm that alg. would pull if presented with context .

2. If then the attacker changes the context to with . In practice, for every arm we set .

In other words, the attacker performs a dilatation of the incoming context every time alg. does not pull arm . The fact that the decision rule used by LinUCB is invariant by dilatation guarantees that the attacker will not inadvertently lower the perceived rewards for arm . Because the rewards are assumed to be linear, presenting a large context and receiving the reward associated with the normal context will skew the estimated rewards of LinUCB. The attack protocol is summarized in Alg. 2.

###### Proposition 2.

Using the attack described in Alg. 2, for any , with probability at least , the number of times LinUCB does not pull arm before time is at most:

 ∑j≠a†Nj(T)≤32K2(λα2+σ2dlog(λd+TL2α2dλδ))3

with the number of times arm has been pulled during the first steps, The total cost for the attacker is bounded by:

 T∑t=1ct≤64K2ν(λα2+σ2dlog(λd+TL2α2dλδ))3

The proof of Proposition 2 (see App. A.2) assumes that the attacker can attack at any time step, and that they can know in advance which arm will be pulled by Alg. in a given context. Thus it is not applicable to random exploration algorithms like LinTS Agrawal and Goyal (2013) and -greedy. We also observed empirically that randomized algorithms are more robust to attacks (see Sec. 7).

###### Remark 2.

If the attacker wants alg. to pull any arm in a set of target arms , the same type of attack can still be used with such that for all . Then, the context is multiplied by when alg. is going to pull an arm not in .

## 5 Attacks on a Single Context

Previous sections focused on the man-in-the-middle (MITM) attack either on reward or context. The MITM attack allows the attacker to arbitrarily change the information observed by the recommender system at each round. This attack may be difficulty feasible in practice, since the exchange channels are generally protected by authentication and cryptographic systems. In this section, we consider the scenario where the attacker has control over a single user . As an example, consider the case where the device of the user is infected by a malware (e.g., Trojan horse), giving full control of the system to the malicious agent. The attacker can thus modify the context of the specific user (e.g., by altering the cookies) that is perceived by the recommender system. We believe that changes to the context (e.g., cookies) are more subtle and less easily detectable than changes to the reward (e.g., click). Moreover, if the reward is a purchase, it cannot be altered easily by taking control of the user’s device.

Clearly, the impact of the attacker on the overall performance of the recommender system depends on the frequency of the specific user, that is out of the attacker’s control. It may be thus impossible to obtain guarantees on the cumulative regret of algorithm . For this reason, we mainly focus on the study of the feasibility of the attack.

Formally, the attacker targets a specific user (i.e., the infected user) associated to a context . Similarly to Sec. 4, the objective of the attacker is to find the minimal change to the context presented to the recommender system such that the target arm is selected by . observes a modified context instead of . After selecting an arm , observes the true noisy reward . As before, we study the white-box setting where the attacker has access to all the parameters of .

### 5.1 Optimistic Algorithms

LinUCB chooses the arm to pull by maximizing an upper-confidence bound on the expected reward. For each arm and context , the UCB value is given by (see Sec. 2).

The objective of the attacker is to force LinUCB to pull arm once presented with context . This means to find a perturbation of context that makes the most optimistic arm. Clearly, we would like to keep the perturbation as small as possible to reduce the cost for the attacker and the probability of being detected. Formally, the attacker needs to solve the following non-convex optimization problem:

 miny∈Rd ∥y∥2 (3) s.t maxθ∈˜Ct,a⟨x†+y,θ⟩+ξ≤maxθ∈˜Ct,a†⟨x†+y,θ⟩

where is a parameter of the attacker and is the confidence set constructed by LinUCB. We use the notation to stress the fact that LinUCB observes only the modified context.

In contrast to Sec. 3 and 4, the attacker may not be able to force the algorithm to pull the desired arm . In other words, Problem 3 may not be feasible. However, we are able to characterize the feasibility of (3).

###### Theorem 1.

For any , Problem (3) is feasible at time if and only if:

 ∃θ∈˜Ct,a†,θ∉Conv⎛⎜⎝⋃a≠a†˜Ct,a⎞⎟⎠ (4)

In other words, the condition given by Theorem 1 says that the attack described here can be done when there exists a vector for which the arm is assumed to be optimal according to LinUCB. The condition mainly stems from the fact that optimizing a linear product on a convex compact set will reach its maximum on the edge of this set. In our case this set is the convex hull described by the confidence ellipsoids of LinUCB.

Although it is possible to use an optimization algorithm for this particular class of non-convex problems—e.g., DC programming Tuy (1995)—they are still slow compared to convex algorithms. Therefore, we present a simple convex relaxation of the previous problem that is simple and still enjoys some empirical performance improvement compared to Problem (3). The relaxed problem is the following:

 miny∈Rd ∥y∥2 (5) s.t maxa≠a†maxθ∈Ct,a⟨x†+y,θ−^θta†⟩≤−ξ

Since the RHS of the constraint in Problem (3) can be written as for any , the relaxation here consists in using as a lower-bound to this maximum for any .

For the relaxed Problem (5), the same type of reasoning as for Problem (3) gives that Problem (5) is feasible if and only if:

 ^θa†(t)∉Conv⎛⎜⎝⋃a≠a†Ct,a⎞⎟⎠
###### Remark 3.

When a set of target arms is available, the feasibility condition is the same except that the attacker now cares about the union of the confidence ellipsoids for each arm in the set of target arms.

When condition (4) is not met, the arm cannot be pulled by LinUCB. Indeed, the proof of Theorem 1 shows that the upper-confidence of the arm is always dominated by another arm for any context. Let us assume that is optimal for some contexts. More formally, there exists a sub-space such that:

We also assume that the distribution of the contexts is such that, for all , . Then, the regret is lower-bounded in expectation by:

 E(RT) =E(T∑t=1\mathds1{xt∈V}(⟨xt,θa†−θat⟩)) ≥μm(T)minx∈Vmaxa≠a†⟨θa†−θa,x⟩

where is the expected number of times such that condition (4) is not met. LinUCB guarantees that for every . Hence, . This means that, in an unattacked problem, condition (4) is met times. On the other hand, when the algorithm is attacked the regret of LinUCB is not sub-linear as the confidence bound for the target arm is not valid anymore. Hence we cannot provide the same type of guarantees for the attacked problem.

### 5.2 Random Exploration algorithms

The previous subsection focused on LinUCB, however we can obtain similar guarantees for algorithms with random exploration such as LinTS. In this case, it is not possible to guarantee that a specific arm will be pulled for a given context because of the randomness in the arm selection process. The objective is to guarantee that arm is pulled with probability at least .

Similarly to the previous subsection, the problem of the attacker can be written as:

 miny∈Rd ∥y∥ (6) s.t P(∀a≠a†,⟨x†+y,~θa−~θa†⟩≤−ξ)≥1−δ

where the for different arms are independently drawn from a normal distribution with mean and covariance matrix with . Solving this problem is not easy and in general not possible. For a given and arm , the random variable is normally distributed with mean and variance . We can then write with . For the sake of clarity, we drop the variable when writing and . Thus the constraint in Problem (6) becomes:

 EZa†(Πa≠a†Φ(σa†Za†+μa†−μaσa))≥1−δ

where is the cumulative distribution function of a normally distributed Gaussian random variable. Unfortunately, computing exactly the expectation of the last line is an open problem. Following the idea of Liu and Shroff (2019), a possible relaxation of the constraint in Problem (6) is, for every arm :

 1−Φ⎛⎜ ⎜⎝μa†−μa−ξ√σ2a+σ2a†⎞⎟ ⎟⎠≤δK−1

Therefore, the relaxed version of the attack on LinTS is:

 miny∈Rd ||y|| (7) s.t ∀a≠a†⟨x†+y,^θa†(t)−^θa(t)⟩−ξ ≥νΦ−1(1−δ/(K−1))∣∣∣∣x†+y∣∣∣∣¯V−1a(t)+¯V−1a†(t)

Problem (7) is similar to Problem (5) as the constraint is also a Second Order Cone program but with different parameters (see App. C).

## 6 Attacks on Adversarial Bandits

In the previous sections, we studied algorithms with sublinear regret , i.e., mainly bandit algorithms designed for stochastic stationary environments. Adversarial algorithms like Exp do not provably enjoy a sublinear regret . In addition, because this type of algorithms are, by design, robust to non-stationary environments, one could expect them to induce a linear cost on the attacker. In this section, we show that this is not the case for most contextual adversarial algorithms. Contextual adversarial algorithms are studied through the reduction to the bandit with expert advice problem. This is a bandit problem with arms where at every step, experts suggest a probability distribution over the arms. The goal of the algorithm is to learn which expert gets the best expected reward in hindsight after steps. The regret in this type of problem is defined as:

 RexpT=E(maxm∈\llbracket1,N\rrbracketT∑t=1K∑j=1E(t)m,jrt,j−rt,at)

where is the probability of selecting arm for expert . In the case of contextual adversarial bandits, the experts first observe the context before recommending an expert . Assuming the current setting with linear rewards, we can show that if an algorithm , like Exp, enjoys a sublinear regret , then, using the Contextual ACE attack with either or , the attacker can fool the algorithm into pulling arm a linear number of times under some mild assumptions. However, attacking contexts for this type of algorithm is difficult because, even though the rewards are linear, the experts are not assumed to use a specific model for selecting an action.

###### Proposition 3.

Suppose an adversarial algorithm satisfies a regret of order for any bandit problem and that there exists an expert such that . Then attacking alg. with Contextual ACE leads to pulling arm , of times in expectation with a total cost of for the attacker.

The proof is similar to the one of Prop. 1 and is presented in App. A.4. The condition on the expert in Prop. 3 means that there exists an expert which believes is optimal most of the time. The adversarial algorithm will then learn that this expert is optimal.

Algorithm Exp has a regret bounded by , thus the total number of pulls of arms different from is bounded by . This result also implies that for adversarial algorithms like Exp Auer et al. (2002), the same type of attacks could be used to fool into pulling arm because the MAB problem can be seen as a reduction of the contextual bandit problem with a unique context and one expert for each arm.

## 7 Experiments

In this section, we conduct experiments on the attacks on contextual bandit problems with simulated data and two real-word datasets: MovieLens25M Harper and Konstan (2015) and Jester Goldberg et al. (2001). The synthetic dataset and the data preprocessing step are presented in Appendix B.1.

### 7.1 Attacks on Rewards

We study the impact of the reward attack for contextual algorithms: LinUCB, LinTS, -greedy and Exp. As parameters, we use L=1 for the maximal norm of the contexts, , , at each time step t and . For Exp, we use experts with experts returning a random action at each time, one expert choosing action every time and one expert returning the optimal arm for every context. With this set of experts the regret of bandits with expert advice is the same as in the contextual case. To test the performance of each algorithm, we generate random contextual bandit problems and run each algorithm for steps on each. We report the average cost and regret for each of the problems.

Figure 2 shows the attacked algorithms using the attacked reward (reported as stationary CACE) and the rewards (reported as CACE). These experiments show that, even though the reward process is non-stationary, usual stochastic algorithms like LinUCB can still adapt to it and pull the optimal arm for this reward process (which is arm ). The true regret of the attacked algorithms is linear as is not optimal for all contexts. In the synthetic case, for the algorithms attacked with the rewards , over 1M iterations and , the target arm is drawn more than of the time on average for every algorithm and more than of the time for the stationary attack (see Table 3 in App. B.2). The dataset-based environments (see Figure 2) exhibit the same behavior: the target arm is pulled more of the time on average for all our attacks on Jester and MovieLens and more than of the time in the worst case (for LinTS attacked with the stationary rewards) (see Table 3).

### 7.2 Attacks on Contexts

We now illustrate the setting of Sec. 4. We test the performance of LinUCB, LinTS and -greedy with the same parameters as in the previous experiments. Yet since the variance is much smaller in this case, we generate a random problem and run simulations for each algorithm and each attack type. The target arm is chosen to minimize the average expected reward over all contexts and we use the exact lower-bound on the reward for this target arm as . In addition, we also test the performance of an attack where the contexts are multiplied by compared to the attack in Sec. 4 but where the attacker is only allowed to attack of the time. The rest of the time the attacker does not modify the context. We call this attack as CC.

Table. 1 shows the percentage of times the target arm has been selected by the attacked algorithm. We see that, as expected, CC LinUCB reaches a ratio of almost , meaning the target arms is indeed selected a linear number of times. A more surprising result (at least not covered by the theory) is that -greedy exhibits the same behavior. Similarly to LinTS, -greedy exhibits some randomness in the action selection process. It can cause to be chosen when the context is attacked and interfere with the principle of the attack. We suspect that is what happens for LinTS. Fig. 3 shows the total cost of the attacks for the attacked algorithms (except for LinTS, for which the cost is linear). Although the theory only covers the case when the attacker is able to attack at any time step, the CC20 attack reaches almost the same success rate as CC for LinUCB and -greedy.

### 7.3 Attacks on a Single Context

We now move to the setting described in Sec. 5 and test the same algorithms as in Sec. 7.2. We run 40 simulations for each algorithm and each attack type. The target context is chosen randomly and the target arm as the arm minimizing the expected reward for . The attacker is only able to modify the incoming context for the target context (which corresponds to the context of one user) and the incoming contexts are sampled uniformly from the set of all possible contexts (of size ).

Table 2 shows the percentage of success for each attack. We observe that the non-relaxed attacks on -greedy and LinUCB work well across all datasets. However, the relaxed attack for LinUCB and LinTS are not as successful, on the synthetic dataset and MovieLens25M. The Jester dataset seems to be particularly suited to this type of attacks because the true feature vectors are well separated from the convex hull formed by the feature vectors of the other arms. Only % of Jester’s feature vectors are contained in the convex hull of the others while this number amounts to for MovieLens and on average for the synthetic dataset.

As expected, the cost of the attacks is linear on all the datasets (see Figure 6 in App. B.4). The cost is also lower for the non-relaxed than for the relaxed version of the attack on LinUCB. Unsurprisingly, the cost of the attacks on LinTS is the highest due to the need to guarantee that the arm will be chosen with high probability (95% in our experiments).

## 8 Conclusion

We presented several settings for online attacks on contextual bandits. We showed that an attacker can force any contextual bandit algorithm to almost always pull an arbitrary target arm with only sublinear modifications of the rewards. When the attacker can only modify the contexts, we prove that LinUCB can still be attacked and made to almost always pull by adding sublinear perturbations to the contexts. When the attacker can only attack a single context, we derive a feasibility condition for the attacks and we introduce a method to compute some attacks of small instantaneous cost for LinUCB, -greedy and LinTS. To the best of our knowledge, this paper is the first to describe effective attacks on the contexts of contextual bandit algorithms. Our numerical experiments, conducted on both synthetic and real-world data, validate our results and show that the attacks on all contexts are actually effective on several algorithms and with more permissible settings.

## Appendix A Proofs

In this appendix, we present the proofs of different theoretical results presented in the paper.

### a.1 Proof of Proposition 1

###### Proposition.

For any , when using Contextual ACE algorithm (Alg. 1) with perturbed rewards , with probability at least , algorithm pulls arm , times and the total cost of attacks is .

###### Proof.

Let us consider the contextual bandit problem , with arms with contexts such that the optimal arm has mean reward and all other arms has mean . Then the regret of algorithm for this bandit problem is upper-bounded with probability at least by a function such that . In addition, the reward process fed to Alg. by the attacker is a stationary reward process with -subgaussian noise. Therefore, the number of times algorithm pulls an arm different from is upper-bounded by .

In addition, the total cost of the attack is upper-bounded by where is the number of times arm has been pulled up to time . Thanks to the previous argument, . ∎

### a.2 Proof of Proposition 2

###### Proposition.

Using the attack described in Alg. 2, for any , with probability at least , the number of times LinUCB does not pull arm is at most:

 ∑j≠a†Nj(T)≤32K2(λα2+σ2dlog(λd+TL2α2dλδ))3

with the number of times arm has been pulled after steps, for all arms , the regularization parameter of LinUCB and for all , . The total cost for the attacker is bounded by:

 T∑t=1ct≤64K2ν(λα2+σ2dlog(λd+TL2α2dλδ))3
###### Proof.

Let be the arm pulled by LinUCB at time . For each arms , let be the result of the linear regression with the attacked context and the one with the unattacked context and a regularization of . At any time step , we can write, for all :

 ~θa(t) =⎛⎝λId+t∑l=0,al=aα2xlx⊺l⎞⎠−1t∑k=0,ak=arkαxk=1α⎛⎝λα2Id+t∑k=0,ak=axkx⊺k⎞⎠−1t∑k=0,ak=arkxk=^θa(t,λ/α2)α

We also note that, since the contexts are not modified for arm : . In addition, for any context and arm , the exploration term used by LinUCB becomes:

 ||x||~V−1a,t =1α||x||^V−1a,t (8)

where and . For a time , if presented with context LinUCB pulls arm , we have:

As , we deduce that on the event that the confidence sets (Theorem in Abbasi-Yadkori et al. (2011)) hold for arm :

 2 ≤⟨^θat(t,λ/α2),xt⟩+βat(t)||xt||^V−1at,t≤⟨θat,xt⟩+2βat(t)||xt||^V−1at,t

Thus, . Therefore,

 T∑t=1\mathds1{at≠a†}≤T∑t=1min(2βat(t)||xt||^V−1at,t,1)\mathds1{at≠a†}≤∑j≠a†2βj(T) ⎷T∑t=1\mathds1{at=j}T∑t=1,at=jmin(1,||xt||2^V−1j,t)

But using Lemma from Abbasi-Yadkori et al. (2011) and the bound on the for all arms , we have with Jensen inequality:

 T∑t=1\mathds1{at≠a†}≤4 ⎷KT∑t=1\mathds1{at≠a†}dlog(1+α2TL2λd)(√λ/α2S+σ√2log(1/δ)+dlog(1+α2TL2λd))

### a.3 Proof of Theorem 1

###### Theorem.

For any , Problem (3) is feasible if and only if:

 ∃θ∈Ct,a†,θ∉Conv⎛⎜⎝⋃a≠a†Ct,a⎞⎟⎠ (9)

where for every arm , with the least squares estimate for arm built by LinUCB and

 ~Va,t=λId+t∑l=1,xl≠x†\mathds1{al=a}xlx⊺l+t∑l=1,xl=x†\mathds1{al=a}~xl~x⊺l

the design matrix of LinUCB at time for all arms (where is the modified context)

###### Proof.

The proof of Theorem 1 is decomposed in two parts.

First, let us assume that Equation (9) is satisfied. Then, let , then by the theorem of separation of convex sets applied to and . There exists a vector and such that for all :

 ⟨y,v⟩≤c1

Hence, for we have that for that:

 ⟨y,~v⟩+ξ≤⟨θ,~v⟩

Secondly, let us assume that an attack is feasible. Then there exists a vector such that:

 maxθ∈Ct,a†⟨y,θ⟩>c1:=maxa≠a†maxθ∈Ct,a⟨y,θ⟩

Let us reason by contradiction. We assume that and consider . There exists , and such that and . Thus

 (10)

The problem is feasible, so . This contradicts Eq. 10. ∎

### a.4 Proof of Proposition 3

###### Proposition.

For an adversarial algorithm , satisfying a regret of order for any bandit problem and if there exists an expert such that . Then attacking Alg. with Contextual ACE lead to pull arm , times in expectation with a total cost of order for the attacker.

###### Proof.

Similarly to the proof of Proposition 1, let’s define the bandit with expert advice problem, , such that at each time the reward vector is (with ). The regret of this algorithm is: . The regret of the learner is: