# What You See May Not Be What You Get: UCB Bandit Algorithms Robust to -Contamination

## Abstract

Motivated by applications of bandit algorithms in education, we consider a stochastic multi-armed bandit problem with -contaminated rewards. We allow an adversary to give arbitrary unbounded contaminated rewards with full knowledge of the past and future. We impose the constraint that for each time the proportion of contaminated rewards for any action is less than or equal to . We derive concentration inequalities for two robust mean estimators for sub-Gaussian distributions in the -contamination context. We define the -contaminated stochastic bandit problem and use our robust mean estimators to give two variants of a robust Upper Confidence Bound (UCB) algorithm, crUCB. Using regret derived from only the underlying stochastic rewards, both variants of crUCB achieve . Our simulations are designed to reflect reasonable settings a teacher would experience when implementing a bandit algorithm. We show that in certain adversarial regimes crUCB not only outperforms algorithms designed for stochastic (UCB1) and adversarial bandits (EXP3) but also those that have “best of both worlds” guarantees (EXP3++ and TsallisInf) even when our constraint on the proportion of contaminated rewards is broken.

## 1 Introduction

We first review the problem of stochastic multi-armed bandits (sMAB) with contaminated rewards, or contaminated stochastic bandits (CSB). This scenario assumes that rewards associated with an action are sampled i.i.d. from a fixed distribution and that the learner observes the reward after an adversary has the opportunity to contaminate it. The observed reward can be unrelated to the reward distribution and can be maliciously chosen to fool the learner. An outline for this setup is presented in Section 2.

We are primarily motivated by the use of bandit algorithms in education, where the rewards often come directly from human opinion. Whether responses come from undergraduate students, a community sample, or paid participants on platforms like MTurk, there is always reason to believe some responses are careless or inattentive to the question or could be assisted by bots \citepneckaMeasuringPrevalenceProblematic2016, curranMethodsDetectionCarelessly2016.

An example in education is a recent paper testing bandit Thompson sampling to identify high quality student generated solution explanations to math problems using MTurk participants \citepwilliamsAXISGeneratingExplanations2016. Using a rating between 1-10 from 150 participants, the results showed that Thompson sampling identified participant generated explanations that when viewed by other participants significantly improved their chance of solving future problems compared to no explanation or “bad” explanations identified by the algorithm. While the proportion of contaminated responses will always depend on the population, recent work suggests when already using checks for fraudulent participants, between of MTurk participants give low-quality samples \citepahlerMicroTaskMarketLemons2018, ryanDataContaminationMTurk2018, neckaMeasuringPrevalenceProblematic2016. This is consistent with measurements of careless and inattentive responses seen in survey data, which reports with an estimated mode of , with the conclusion that these responses are generally not a random sample \citepcurranMethodsDetectionCarelessly2016. Accounting for these low quality responses is especially relevant in educational setting where the number of iterations an algorithm can run is often significantly smaller than those used by big tech (e.g. advertising).

Recent work in CSB has various assumptions on the adversary, the contamination, and the reward distributions. Many papers require the rewards and contamination to be bounded \citepkapoorCorruptiontolerantBanditLearning2018, gupta2019better, lykourisStochasticBanditsRobust2018. Others don’t require boundedness, but do assume that the adversary contaminates uniformly across rewards \citepaltschulerBestArmIdentification2019. All works make some assumption on the number of rewards for an action an adversary can contaminate. We discuss previous work more thoroughly in section 3.

Our work expands on these papers by allowing for a full knowledge adaptive adversary that can give unbounded contamination in any manner. However, there is a trade off when compared to work assuming bounded rewards and contamination: we require an estimate of the upper bound on the reward variance. This can often allow for simpler implementation than some algorithms that require boundedness, as we will discuss in section 4. Our constraint on the adversary is that for some fixed , no more than proportion of rewards for an action are contaminated. We provide a -contamination robust UCB algorithm by first proving concentration inequalities for two robust mean estimators in the -contamination context. We are able to show that the regret of our algorithm analyzed on the true reward distributions is provided that the contamination proportion is small enough. Through simulations, we show that with a Bernoulli adversary, our algorithm outperforms algorithms designed for stochastic (UCB1) and adversarial bandits (EXP3) as well as those that have “best of both worlds” guarantees (EXP3++ and TsallisInf) even when our constraint on the adversary is broken.

Though we are motivated by of bandit algorithms applications in education and use this context to determine appropriate parameters in the simulations, we point out opportunities for CSB modeling to arise in other contexts as well.

**Human feedback:** There is always a chance that human feedback is careless or inattentive, and therefore is not representative of the underlying truth related to an action. This may appear in online surveys that are used for A/B testing, or as is the case above in the explanation generation example.

**Click fraud:** Internet users who wish to preserve privacy can intentionally click on ads to obfuscate their true interests either manually or through browser apps. Similarly, malware can click on ads from one company to falsely indicate high interest, which can cause higher rankings in searches or more frequent use of the ad than it would otherwise merit \citeppearceCharacterizingLargeScaleClick2014, crussellMAdFraudInvestigatingAd2014.

**Measurement errors:** If rewards are gathered through some process that may occasionally fail or be inaccurate, then the rewards may be contaminated. For example, in health apps that use activity monitors, vigorous movement of the arms may be perceived as running in place \citepfeehanAccuracyFitbitDevices2018, baiComparativeEvaluationHeart2018.

## 2 Problem Setting

Here we specify our notation and present the -contaminated stochastic bandit problem. We then argue for a specific notion of regret for CSB. We compare our setting to others current in the field in section 3.

**Notation** We use to represent for to represent the number of actions and the indicator function to be 1 if true and 0 otherwise. Let be the number of times action has been chosen at time and to be the vector of all observed rewards for action at time . The suboptimality gap for action is and we define .

### 2.1 -Contaminated Stochastic Bandits

A basic parameter in our framework is , the fraction of rewards for an action that the adversary is allowed to contaminate. Before play, the environment picks a true reward from fixed distribution for all and . The adversary observes these rewards and then play begins. At time the learner chooses an action . The adversary sees then chooses an observed reward and then the learner observes only .

We present the contaminated stochastic bandits game in algorithm 1.

We allow the adversary to corrupt in any fashion as long as for every time there is no more than an -fraction of contaminated rewards for any action. That is, we constrain the adversary such that,

We allow the adversary to give unbounded contamination that can be chosen with full knowledge of the learner’s history as well as current and future rewards. This setting allows the adversary to act differently across actions and place no constraints on the contamination itself, but rather the rate of contamination.

### 2.2 Notion of Regret

A traditional goal in bandit learning is to minimize the observed cumulative regret gained over the total number of plays . Because the adversary in this model can affect the observed cumulative regret, we argue to instead use a notion of regret that considers only the underlying true rewards. We call this uncontaminated regret and give the definition below for any time and policy in terms of the true rewards ,

(2.1) |

This definition eq. 2.1 is first mentioned in [kapoorCorruptiontolerantBanditLearning2018] along with another notion of regret that compares the sum of the observed (possibly contaminated) rewards to the sum of optimal, uncontaminated rewards,

(2.2) |

We argue that eq. 2.2 gives little information about the performance of an algorithm. This notion of regret can be negative, and with no bounds on the contamination it can be arbitrarily small and potentially meaningless. We believe that any regret that compares a true component to an observed (possibly contaminated) component is not a useful measure of performance in CSB as it is unclear what regret an optimal strategy should produce.

## 3 Related Work

We start by briefly addressing why adversarial and “best of both world” algorithms are not optimized for CSB. We then cover relevant work in robust statistics, followed by current work in robust bandits and how our model differs and relates.

### 3.1 Adversarial Bandits

Adversarial bandits with an oblivious environment allows the adversary to first look at the learners policy and then choose all rewards before the game begins. If the learner chooses a deterministic policy, the adversary can choose rewards such that the learner cannot achieve sublinear worst-case regret \citeplattimoreBanditAlgorithms2018. Algorithms such as EXP3 \citepauerNonstochasticMultiarmedBandit2002 are thus randomized, but their regret is analysed with respect to the best fixed action where “best” is defined using the observed rewards. There are no theoretical guarantees with respect to the uncontaminated regret, so it is not immediately clear how they will perform in a CSB problem. We remark that adversarial analysis assumes uniformly bounded observed rewards whereas we allow observed rewards to be unbounded. Additionally, the general adversarial framework does not take advantage of the structure present in CSB, namely that the adversary can only corrupt a small fraction of rewards, so it likely that performance improvements can be make.

### 3.2 Best of Both Worlds

A developing line of work is algorithms that enjoy “best of both worlds” guarantees. That is, they perform well in both stochastic and adversarial environments without knowing a priori which environment they will face. Early work in this area [auerAlgorithmNearlyOptimal2016, bubeckBestBothWorlds2012] started by assuming a stochastic environment and implementing some method to detect a failure of the i.i.d. assumption, at which point the algorithm switches to an algorithm for the adversarial environment for the remainder of iterations. Further work implements algorithms that can handle an environment that is some mixture of stochastic and adversarial, as in EXP3++ and TsallisInf \citepseldinOnePracticalAlgorithm2014, zimmertAnOptimal2019.

While these algorithms are aimed well for a stochastic environment with some adversarial rewards, they differ from contamination robust algorithms in that all observed rewards are thought to be informative. Therefore, their uncontaminated regret has not been analysed and there are no guarantees in the CSB setting.

### 3.3 Contamination Robust Statistics

The -contamination model we consider is closely related to the one introduced by Huber in 1964 \citephuberRobustEstimationLocation1964. Their goal was to estimate the mean of a Gaussian mixture model where fraction of the sample was not sampled from the main Gaussian component. There has been a recent increase of work using this model, especially in extensions to the high-dimensional case ([diakonikolasRobustEstimatorsHigh2019], [kothariOutlierrobustMomentestimationSumofsquares2017], [laiAgnosticEstimationMean2016], [liuHighDimensionalRobust2019]). These works often keep the assumption of a Gaussian mixture component, though there has been expanding work with non-Gaussian models as well.

### 3.4 Contamination Robust Bandits

Some of the first work in CSB started by assuming both rewards and contamination were bounded \citeplykourisStochasticBanditsRobust2018,gupta2019better. These works assume an adversary that can contaminate at any time step, but that is constrained in the cumulative contamination. That is, the cumulative absolute difference of the contaminated reward, , to the true reward, , is bounded, . Lykouris et. al. provides a layered UCB-type active arm elimination algorithm. Gupta expands on this work to provide an algorithm similar to active arm elimination in spirit, but which never completely eliminates an action, and which has better regret guarantees.

Recent work in implementing a robust UCB replaces the empirical mean with the empirical median, and gives guarantees for the uncontaminated regret with Gaussian rewards \citepkapoorCorruptiontolerantBanditLearning2018. They consider an adaptive adversary but require the contamination to be bounded, though the bound need not be known. They cite work that can expand their robust UCB to distributions with bounded fourth moments by using the agnostic mean \citeplaiAgnosticEstimationMean2016, though give no uncontaminated regret guarantees. In one dimension, the agnostic mean takes the mean of the smallest interval containing fraction of points. This estimator is also known as the -shorth mean. Our work expands on this model by allowing for unbounded contamination and analysing the uncontaminated regret for sub-Gaussian rewards when implementing a UCB algorithm with the -shorth mean.

CSB has also been analysed in the best arm identification problem \citepaltschulerBestArmIdentification2019. Using a Bernoulli adversary that contaminates any reward with probability , Altschuler et. al. consider three adversaries of increasing power, from the oblivious adversary, which does not know the player’s history nor the current action or reward, to a malicious adversary, which can contaminate knowing the player’s history and the current action and reward. They give analysis of the probability of best arm selection and sample complexity of an active arm elimination algorithm. While their performance measure is different than ours, we generalize their context to allow an adversary to contaminate in any fashion.

There is also work that explores adaptive adversarial contamination on -greedy and UCB algorithms \citepjunAdversarialAttacksStochastic2018. They give a thorough analysis with both theoretical guarantees and simulations of the affects an adversary can have on these two algorithms when the adversary does not know the optimal action but is otherwise fully adaptive. They show these standard algorithms are susceptible to contamination. Similar work looks at contamination in contextual bandits with a non-adaptive adversary \citepma2019data.

## 4 Main Results

We present concentration bounds for both the -shorth and -trimmed mean estimators in the -contamination context for sub-Gaussian random variables.

Our contribution to the CSB problem is in providing a contamination robust UCB algorithm that is simple to implement and has theoretical regret guarantees close to those of UCB algorithms in the uncontaminated setting.

### 4.1 Contamination Robust Mean Estimators

The estimators we analyse have been in use for many decades as robust statistics. Our contribution is to analyze them within our -contamination model and provide simple finite-sample concentration inequalities for ease of use in UCB-type algorithms.

#### Trimmed Mean

Our first estimator suggested for use in the contaminated model is the -trimmed mean \citepliuHighDimensionalRobust2019.

**
-trimmed mean** Trim the smallest and largest -fraction of points from the sample and calculate the mean of the remaining points. This estimator uses fraction of sample points.

The intuition being if the contamination is large, then it will be removed from the sample. If it is small, it should have little affect on the mean estimate. Next we provide the concentration inequality for the -trimmed mean. {restatable}[Trimmed mean concentration]thmtrMeanConcenG Let be the set of points that are drawn from a -sub-Gaussian distribution with mean . Let be a sample where an -fraction of these points are contaminated by an adversary. For , we have,

with probability at least .

###### Proof.

Our proof techniques are adapted from [liuHighDimensionalRobust2019].

Let be the set of points that are drawn from a -sub-Gaussian distribution. Without loss of generality assume . Let be a sample where an -fraction of these points are contaminated by an adversary.

Let represent the points which are not contaminated and represent the contaminated points. Then our sample can be represented by the union . Let represent the points that remain after trimming fraction of the largest and smallest points, and be the set of points that were trimmed. Then we have,

with

Combining we get,

with probability at least . Letting and , and assuming , we have,

with probability at least . ∎

A more detailed proof can be found in appendix A.1

#### Shorth Mean

Lai’s \citeyearparlaiAgnosticEstimationMean2016 agnostic mean, which we use the more common term -shorth mean for, can be considered a variation of the trimmed mean.

**
-shorth mean** Take the mean of the shortest interval that removes the smallest and largest fraction of points such that , where are chosen to minimize the interval length of remaining points. Uses fraction of sample points.

The -shorth mean is less computationally efficient than the trimmed mean, but may be a better mean estimator when the contaminated points are not large outliers and are skewed in one direction. Intuitively this is because the -shorth mean can trim off contamination that would require removing most of the sample with the trimmed mean. Next we provide the concentration inequality for the -shorth mean.

[-shorth mean concentration]thmsMeanConcenG Let be the set of points that are drawn from a -sub-Gaussian distribution with mean . Let be a sample where an -fraction of these points are contaminated by an adversary. For , , we have,

with probability at least .

Proof is contained in the appendix and follows a similar approach as shown for the trimmed mean.

Our methods ensured that the first term in each concentration bound is the same, giving them similar regret guarantees when implemented in a UCB algorithm. We emphasize that the -shorth mean uses fraction of a sample while the -trimmed mean uses fraction of a sample. We remark that if there is no contamination and then our inequalities reduce to the standard concentration inequality for the empirical mean of samples drawn from a sub-Gaussian distribution.

### 4.2 Contamination Robust Ucb

We present the contamination robust-UCB (crUCB) algorithm for -CSB with sub-Gaussian rewards.

We provide uncontaminated regret guarantees for crUCB below for both the -trimmed and the -shorth mean.

[-trimmed mean crUCB uncontaminated regret]thmtrMeanRegretG Let and . Then with algorithm 4 with the -trimmed mean, -sub-Gaussian reward distributions with , and contamination rate , we have the uncontaminated regret bound,

[-trimmed mean crUCB uncontaminated regret bounded rewards]cortrMeanb If the rewards are bounded by , and have contamination rate , then

[-shorth mean crUCB uncontaminated regret]thmsMeanRegretG Let and . Then with algorithm 4 with the -shorth mean, sub-Gaussian reward distributions with , and contamination rate , we have the uncontaminated regret bound,

[-shorth mean crUCB uncontaminated regret bounded rewards]corshMeanb If the rewards are bounded by , and have contamination rate , then

Proofs for section 4.2 and 4.2 and their corollaries follow standard analysis and are provided in appendix A.5.

From section 4.2 and 4.2 we get that crUCB has the same order of regret in the CSB setting as UCB1 has in the standard sMAB setting. The constraint on the magnitude of is quite strong, but we show in section 5 that they can be broken and still obtain good empirical performance.

**Remark** Our bounds above do not allow to be too big relative to the minimum suboptimality gap . This is natural: if then no algorithm can get sublinear regret since distinguishing between the top two actions is statistically impossible even with infinite samples. We give a simple example in appendix B.
Furthermore, it is possible to derive a regret bound^{1}

## 5 Simulations

We compare our crUCB algorithms using the trimmed mean (tUCB) and shorth mean (sUCB) against a standard stochastic algorithm (UCB1, [auerFinitetimeAnalysisMultiarmed2002]), a standard adversarial algorithm (EXP3, [auerNonstochasticMultiarmedBandit2002]), two “best of both worlds” algorithms (EXP3++, [seldinImprovedParametrizationAnalysis2017], 0.5-TsallisInf, [zimmertAnOptimal2019]), and another contamination robust algorithm (RUCB-MAB, [kapoorCorruptiontolerantBanditLearning2018]). Each trial has five actions (), is run for 1000 iterations (), for . For sUCB and tUCB, we set and . The plots are average results over 10 trials with error bars showing the standard deviation.

Our choice of comes from our motivation to apply contaminated bandits in education, where the sample sizes are often much smaller than for example in advertising. While would be considered a large university class, it still allows one to visually see regret for smaller iterations and see how performance stabilizes. We similarly chose number of arms and proportion contamination to be in a realistic range for the application we have in mind. All algorithms use recommended parameter settings given within their respective papers.

**Rewards and gaps** We chose the reward distribution to be binomial(n=10) to simulate likert scale and because this distribution has bounded rewards and is not symmetric for large . For the optimal action, and for suboptimal actions , thus the suboptimality gap is . All non-optimal actions have the same true distribution.

**Adversaries** We focus on a Bernoulli adversary which gives a contaminated reward at every time step with probability . We also implement a cluster adversary which contaminates at the beginning of play to show the weakness of algorithms to this type of attack.

**Contamination** We use a random contamination scheme which chooses a contaminated reward uniformly from ranges that increase suboptimal action means and decrease the optimal action’s mean.

**Performance measurement** We plot the average regret over 10 trials for 1000 iterations.

We recommend to view the plots on a color screen.

In fig. 0(a) we see that the adversarial and best of both worlds algorithms, EXP3, EXP3++, and TsallisInf, perform poorly in the purely stochastic setting compared to the UCB type algorithms. In fig. 1, we see the best of these, TsallisInf, starts to degrade as the proportion of contamination increases while the robust UCB algorithms are only slightly affected. These simulations show a clear performance benefit to using algorithms that specifically account for contaminated rewards.

Figure 3 and fig. 4 shows that for both sUCB and tUCB, the choice of is much less sensitive than choice of . Over estimating or slightly underestimating does not degrade performance significantly. Underestimating can give a significant boost to performance while over estimating can degrade it. This is consistent with the performance of UCB algorithms in practice, which often scale the exploration term to improve empirical performance \citepliu2014trading.

To look at the impact of using a contamination robust algorithm when there is no contamination, we plotted various values when , shown in fig. 2. Assuming small amounts of contamination when there is none only has a small impact on performance, suggesting it is permissible to use contamination robust methods when there is uncertainty of contamination. Similarly, small and large can render bounded contamination impotent and would not require algorithms that account for it.

We have included RUCB-MAB in our simulations because it is simple to implement and can perform similarly well to our algorithms. We note it currently has guarantees only for Gaussian rewards \citepkapoorCorruptiontolerantBanditLearning2018.

Figure 5 shows the poor performance of all algorithms when the first rewards are contaminated. TsallisInf and EXP3++ show some recovery, but it is clear this type of adversary is harmful. This remains an open problem for scenarios with small .

We also considered including the BARBAR algorithm \citepgupta2019better whose epoch scheme is the only algorithm we know that accounts for the front cluster attack. We chose against this as for our setting of the BARBAR algorithm only has one epoch, and thus does not make any updates to the estimated gaps, resulting in pure exploration.

## 6 Discussion

We have presented two variants of an -contamination robust UCB algorithm to handle uninformative or malicious rewards in the stochastic bandit setting. As the main contribution, we proved concentration inequalities for the -trimmed and -shorth mean in the -contamination setting and guarantees on the uncontaminated regret of the crUCB algorithms. The regret guarantees are similar to those in the uncontaminated sMAB setting.

We have shown through simulation that these algorithms can outperform “best of both worlds” algorithms and those for stochastic or adversarial environments when using a small number of iterations and chosen to be reasonable when implementing bandits in education.

We highlight that our algorithms are simple to implement. In practice, it is often easy to find upper bounds on the parameters which are robust to underestimation. Our algorithms are numerically stable and have clear intuition to their actions.

A weak point of these algorithms is they require knowledge of before hand. Choices of may come from domain knowledge, but could also require a separate study.

In this work we assumed a fully adaptive adversarial contamination, constrained only by the total fraction of contamination at any time step. By making more assumptions about the adversary, it is likely possible to improve uncontaminated regret bounds.

**Limitations** The adversary used in the simulation is quite simple and does not take full advantage of the power we allow in our model. We designed it as a first test of our algorithms and associated theory. In the future, we would like to design simulated adversaries that are modeled on real world contamination. It will also be important to deploy contamination robust algorithms in the real world. This will require thought on how to select various tuning parameters ahead of the deployment.

There remain many open questions in this area. In particular, we think this work could be improved by

**Randomized algorithms** UCB-type algorithms are often outperformed in applications by the randomized Thompson sampling algorithm. Creating a randomized algorithm that accounts for the contamination model would increase the practicality of this line of work.

**Contamination correlated with true rewards** One possibility is that the contaminated rewards contain information of the true rewards. For example if contamination can be missing data, we know dropout can be correlated with the treatment condition.

#### Acknowledgements

L.N. acknowledges the support of NSF via grant DMS-1646108 and thanks Joseph Jay Williams for helpful discussions and for inspiring this work. A.T. would like to acknowledge the support of a Sloan Research Fellowship and NSF grant CAREER IIS-1452099.

#### References

\printbibliography[heading=none]

## Appendix A Proofs

### a.1 Theorem 4.1.1

\trMeanConcenG*

###### Proof of section 4.1.1.

Without loss of generality assume for the underlying true distribution. For -sub-Gaussian, by definition, we have:

and

Let represent the points which are not contaminated and represent the contaminated points. Then our sample can be represented by the union . Let represent the points that remain after trimming fraction of the largest and smallest points, and be the set of points that were trimmed. Then we have that.

with

Combining we get,

with probability at least . Letting and , and assuming , we have,

with probability at least . ∎

### a.2 Theorem 4.1.2

\sMeanConcenG*

###### Proof of section 4.1.2.

Without loss of generality assume for the underlying true distribution. Let -sub-Gaussian.

We want to bound the impact of the contaminated points in our interval. Once we have this bound, the proof follows just as in the trimmed mean.

Assume and . Let be the interval that contains the shortest fraction of , be the interval that contains (i.e. the remaining good points after contamination), and be the interval that contains the points of after trimming the largest and smallest fraction of points. Use to denote the length of interval . It must be that because otherwise the points in would contain fraction of . Let be a point in and be a point in . Recall that trMean is the trimmed mean of the contaminated sample from above. Then we have,

The second step comes from and both being in and because . The third step comes from .

To bound the length of we have,

Finally, since

with probability at least , we get that for ,

Now that we have a bound on the contaminated points in , our analysis follows as before,

where

Combining we get,

With probability at least . Letting and , and assuming , we have,

With probability at least . ∎

### a.3 Theorem 4.2

\trMeanRegretG*

###### Proof of section 4.2.

First will show that for non-optimal actions. Assume .

Now to find for non-optimal actions.

Finally, we can find the regret following the standard analysis,

∎

### a.4 Corollary 4.2

\trMeanb*

###### Proof of section 4.2.

By replacing the part of the concentration bound for the trimmed mean that is based on the maximum value in the sample with , we get that,

with probability at least .

First will show that for non-optimal actions. Assume .

Results follow with a similar analysis as above. ∎

### a.5 Theorem 4.2

\sMeanRegretG*

###### Proof of section 4.2.

The proof for the contamination robust UCB using the -shorth mean is similar to that of the trimmed mean.

Using the analysis from the trimmed mean regret, we again get,