Probabilistic Sequential Shrinking: A Best Arm Identification Algorithm for Stochastic Bandits with Corruptions
Abstract
We consider a best arm identification (BAI) problem for stochastic bandits with adversarial corruptions in the fixedbudget setting of steps. We design a novel randomized algorithm, Probabilistic Sequential Shrinking (PSS), which is agnostic to the amount of corruptions. When the amount of corruptions per step (CPS) is below a threshold, PSS identifies the best arm or item with probability tending to as . Otherwise, the optimality gap of the identified item degrades gracefully with the CPS. We argue that such a bifurcation is necessary. In addition, we show that when the CPS is sufficiently large, no algorithm can achieve a BAI probability tending to as . In PSS, the parameter serves to balance between the optimality gap and success probability. En route, the injection of randomization is shown to be essential to mitigate the impact of corruptions. Indeed, we show that PSS has a better performance than its deterministic analogue, the Successive Halving (SH) algorithm by Karnin et al. (2013). PSS’s performance guarantee matches SH’s when there is no corruption. Finally, we identify a term in the exponent of the failure probability of PSS that generalizes the common term for BAI under the fixedbudget setting.
1 Introduction
Consider a drug company that wants to design a vaccine for a certain illness, say COVID19. It has a certain number of options, say , to design a nearoptimal vaccine. Because has a limited budget, it can only test vaccines for a fixed number of times, say . Using the limited number of tests, it wants to find the option that will lead to the “best” outcome, e.g., the shortest average recovery time of certain model organisms. However, a competitor drug company arbitrarily corrupts the tests, and hence the observed recovery times may be significantly different from the true ones. The total corruption budget is bounded as a function of the number of tests. How can find a nearoptimal drug design in the presence of the corruptions and uncertainty of the efficacy of the drugs? We will show that the utilization of a suitably randomized algorithm is key.
To solve ’s problem, in this paper, we study the Best Arm Identification (BAI) problem for stochastic bandits with adversarial corruptions. We note that the effect and mitigation of corruptions were studied for the Regret Minimization problem by Lykouris et al. (2018) and others. While most existing works study the BAI problem for stochastic bandits without corruptions (Auer et al., 2002; Audibert and Bubeck, 2010; Carpentier and Locatelli, 2016), Altschuler et al. (2019) considers a variation of the classical BAI problem and aims to identify an item with high median reward, while Shen (2019) assumes that the amount of corruptions per time step (CPS) diminishes as time progresses. Therefore, these studies are not directly applicable to as we are interested in obtaining a nearoptimal item in terms of the mean and we assume that the CPS does not diminish with time. Our setting dovetails neatly with company ’s problem and can utilize our algorithm to sequentially and adaptively select different design options to test the vaccines and to eventually choose a nearoptimal design that results in a short recovery time even in the presence of adversarial corruptions.
Main Contributions. In stochastic bandits with adversarial corruptions, there are items with different distributions and means. At each time step, a random reward is generated from each item’s distribution; this reward is observed and arbitrarily corrupted by the adversary. The learning agent selects an item based on corrupted observations in previous steps and only observes the pulled items’ corrupted rewards. Given , the agent aims to identify a nearoptimal item with high probability over time steps. We design Probabilistic Sequential Shrinking or PSS and upper bound the probability that it fails to output a nearoptimal item. The agent does not need to know the amount of corruptions to use PSS and can adjust the parameter to trade off between the optimality gap and the BAI probability.
The key challenge to identify a nearoptimal item in the bandits with corruptions is to mitigate the impact of corruptions. For this purpose, upon observing pulled items’ corrupted rewards in previous time steps, PSS pulls subsequent items probabilistically. By comparing PSS to its deterministic counterparts, e.g., the Successive Halving (SH) algorithm by Karnin et al. (2013), we argue that this injection of randomness is advantageous in helping the agent to eventually choose a nearoptimal item with high probability. PSS’s performance guarantee (at least in the exponent) matches that of SH when there is no corruption. We identify a term in the exponent of the failure probability of PSS that generalizes the common term for BAI under the fixedbudget setting.
Moreover, we provide complementary lower bounds that imply that when the total corruption budget is sufficiently large, any algorithm will fail to identify a nearoptimal item with constant or high probability. For this purpose, we consider the case where each item’s distribution is Bernoulli and we design two offline corruption strategies given a corruption budget that is larger than a linear function of the given horizon . This corroborates the tightness of our bound on the nearoptimality of the item produced by PSS.
Literature review. The BAI problem has been studied extensively for both stochastic bandits (Audibert and Bubeck, 2010) and bandits with adversarial corruptions (Shen, 2019). There are two complementary settings for BAI: (i) Given , the agent aims to maximize the probability of finding a nearoptimal item in at most steps; (ii) Given , the agent aims to find a nearoptimal item with the probability of at least in the smallest number of steps. These settings are respectively known as the fixedbudget and fixedconfidence settings. Another line of studies aims to prevent the agent from achieving the above desiderata and thus to design strategies to attack the rewards efficiently (Jun et al., 2018; Liu and Lai, 2020). We now review some relevant works.
Firstly, we review the related studies in the classical stochastic bandits, where the agent observes the true random reward of the pulled item at each time step. Both the fixedbudget setting (Audibert and Bubeck, 2010; Karnin et al., 2013; Jun et al., 2016) and the fixedconfidence setting (Audibert and Bubeck, 2010; Chen et al., 2014; Rejwan and Mansour, 2020; Zhong et al., 2020) have been extensively studied. However, as previously motivated, we need to be cognizant that the agent may encounter corrupted rewards and thus must design appropriate strategies to nullify or minimize the effects of these corruptions.
Regret minimization on stochastic bandits with corruptions was first studied by Lykouris et al. (2018), and has attracted extensive interest recently (Zimmert and Seldin, 2019; Li et al., 2019; Gupta et al., 2019; Lykouris et al., 2020; Liu and Lai, 2020; Krishnamurthy et al., 2020; Bogunovic et al., 2020). Pertaining to the BAI problem in the presence of corruptions, Altschuler et al. (2019) studies a variation of the classical fixedconfidence setting and aims to find an item with a high median reward. In contrast, Shen (2019) proposes an algorithm under the fixedbudget setting, whose theoretical guarantee requires a number of stringent conditions. In particular, Shen (2019) assumes that CPS diminishes as time progresses. However, it may be hard to verify in practice whether these conditions are satisfied. In spite of the many existing works, the classical BAI problem has not been analyzed when the rewards suffer from general corruptions. Our work fills in this gap in the literature by proposing and analyzing the PSS algorithm under the fixedbudget setting. The randomized design of our algorithm is crucial in mitigating the impact of corruptions.
Another concern in the study of stochastic bandits with corruptions is how the adversary can corrupt the rewards effectively to prevent the agent from obtaining sufficient information from the corrupted observations.
Many studies aim at attacking certain algorithms, such as UCB, greedy or Thompson sampling, using an adaptive strategy (Jun et al., 2018; Zuo, 2020). In addition, Liu and Shroff (2019)
design offline strategies to attack a certain algorithm and also an adaptive strategy against any algorithm.
All these attack strategies aim to corrupt the rewards such that the agent can only receive a small cumulative reward in expectation. The design and analysis
of attack strategies pertaining to the BAI problem have yet to be done.
Our analysis fills in this gap by proposing two offline strategies for Bernoulli instances and proving that when the total corruption budget is sufficiently large (i.e., of the order
2 Problem Setup
For brevity, for any , we denote the set as . Let there be ground items, contained in . Each item is associated with a reward distribution supported in and a mean . The quantities are not known to the agent. Over time, the agent is required to learn these rewards by adaptively pulling items. The agent aims to identify an optimal item, which is an item of the highest mean reward, after a fixed time budget of time steps, whenever possible in the presence of corruptions. More precisely, at each time step ,

A stochastic reward is drawn for each item from .

The adversary observes , and corrupts each by an additive amount , leading to the corrupted reward for each .

The agent pulls an item and observes the corrupted reward .
For each , the random variables in are i.i.d. When determining at time step , the adversary cannot observe the item going to be pulled, but he can utilize the current observations consisting of , , and . We assume that the total amount of adversarial corruptions during the the horizon is bounded:
The corruption budget is not known to the agent.
We focus on instances with a unique item of the highest mean reward, and assume that , so that item is the unique optimal item. To be clear, the items can, in general, be arranged in any order; the ordering that for is just to ease our discussion. We denote as the optimality gap of item . An item is optimal () if .
The agent uses an online algorithm to decide the item to pull at each time step , and the item to output as the identified item eventually. More formally, an online algorithm consists of a tuple , where

the sampling rule determines, based on the observation history, the item to pull at time step . That is, the random variable is measurable, where ;

the recommendation rule chooses an item , that is, by definition, measurable.
We denote the probability law of the process by . This probability law depends on the agent’s online algorithm , which influences the adversarial corruptions.
For fixed , an algorithm is said to be PAC (probably approximately correct) if
Our overarching goal is to design an PAC algorithm such that both and are small. In particular, when , an PAC algorithm identifies an optimal item with probability at least . For BAI with no corruption, existing works (Audibert and Bubeck, 2010; Karnin et al., 2013) provide PAC algorithms. In the presence of corruptions, unfortunately it is impossible to achieve a PAC performance guarantee, as we discuss in the forthcoming Section 4.2. We investigate the tradeoff between and , and focus on constructing PAC algorithms with as small as possible. We abbreviate as and as when there is no ambiguity.
Finally, in anticipation of our main results, we remark that given a failure probability , the smallest possible is, in general, a function of the corruption per step (CPS) and possibly the total number of items .
3 Algorithm
Our algorithm Probabilistic Sequential Shrinking (PSS) is presented in Algorithm 1. The algorithm involves randomization in order to mitigate the impact of adversarial corruptions.
The agent partitions the whole horizon into phases of equal length. During each phase, PSS classifies an item as active or inactive based on the empirical averages of the corrupted rewards. Initially, all ground items are active and belong to the active set . Over phases, the active sets shrink, and an item may be eliminated from and consequently it may become inactive.
During phase :

at each time step, the agent chooses an active item uniformly at random and pulls it;

at the end, the agent finds , the corrupted empirical mean during this phase for each active item ;

the agent utilizes the ’s of active items to shrink the active set.
By the end of the last phase , we show that (see Lemma 5.1 in Section 5.1), and the agent outputs the single active item.
The effectiveness of Algorithm 1 is manifested in four different aspects: (i) the agent only utilizes information from the current phase to shrink the active set, which ensures that any corruption has a limited impact on her decision; (ii) the injection of randomization by the agent to decide on which item to pull nullifies the ability of the adversary from corrupting rewards of specific items; (iii) the agent can handle the adversarial attacks even though she does not know the total corruption budget ; (iv) the agent can choose any to trade off between and in its PAC performance guarantee. A smaller leads to a higher failure probability .
When , PSS regards the whole horizon as one single phase. Each ground item is pulled with probability at each time step, and is expected to be pulled for times in time steps. We can regard PSS as a randomized version of the naïve Uniform Pull (UP) algorithm, which pulls each item for times according to a deterministic schedule.
When , is a randomized analogue to the Sequential Halving (SH) algorithm proposed in Karnin et al. (2013). Both PSS and SH divide the whole horizon into phases and halve the active set during each phase, i.e., . However, the differences between them are as follows:

during phase , SH pulls each item in for exactly times according to a deterministic schedule.
Therefore, though PSS and SH pull each active item for about an equal number of times in expectation, PSS involves more randomness in the pulls.
4 Main Results
4.1 Upper bound
{restatable}theoremthmUbprobaSS
For any , the Probabilistic Sequential Shrinking algorithm, as presented in Algorithm 1, outputs an item satisfying
(4.1) 
where
(4.2) 
Theorem 4.1 shows that PSS is PAC for any , where
We remark that only , but not , depends on . The dependence of on the CPS is, in general, unavoidable in view of our lower bounds (see Section 4.2).
The upper bound on the failure probability involves the parameter , which quantifies the difficulty of identifying the best item in the instance. The parameter generalizes its analogue
proposed by Audibert and Bubeck (2010), in the sense that
We propose to consider the more general version in order to analyze the randomized versions of SH and UP under one framework.
Function of parameter . Theorem 4.1 implies that when is larger, the upper bound on becomes smaller. However, the quantity increases, which leads to a larger upper bound on the failure probability. Specifically, we have
Meanwhile, as presented in Algorithm 1, PSS with a larger separates the whole horizon into fewer phases and shrinks the active set faster. (i) The fewer number of phases leads to a longer duration of each phase, which is beneficial for bounding the impact of corruptions (see Lemma 5.1). (ii) Besides, the faster the active sets shrink, the larger is. See Section C.4 for details.
Altogether, Theorem 1 provides a bound on learning an optimal item and importantly, implies that PSS allows the agent to trade off between the bound on and the failure probability by adjusting the parameter . When the CPS is so low that
(4.3) 
Theorem 4.1 implies that identify the optimal item with probability at least , where is as shown in (4.1). When the CPS is so large such that
(4.4) 
Theorem 4.1 is vacuous, since all the items are optimal. In the extreme case when
Theorem 4.1 is vacuous for all . Indeed, we show in Theorems 4.2 and 4.3 that this bifurcation on the learnability holds true not only to our algorithms, in the sense that no algorithm can achieve when the CPS is above a certain threshold. In passing, we note that our characterization of the threshold is tight up to constant factors.
BAI on stochastic setting without corruptions. In the setting without adversarial corruptions, i.e., , Theorem 1 upper bounds the probability that PSS outputs with . We compare Theorem 4.1 on with the performance guarantee of SH by Karnin et al. (2013):
Disregarding constants, the bound on of PSS is worse than that of SH by a factor of , which is a multiplicative factor we incur due to the impact of corruptions. Apart from that, our bound involves while Karnin et al. (2013) involves , and notice that
As a result, our exponent matches that by Karnin et al. (2013) up to an absolute constant (which is ).
Next, we compare Theorem 4.1 on with the performance guarantee of UP, which is folklore. We use the following in Section 33.3 of Lattimore and Szepesvári (2020):
(4.5) 
where (4.5) is tight when for all . For , , and the failure probability bound in (4.1) specializes to which matches (4.5) up to multiplicative factors in the exponent and the notation.
Algorithm  Order of  Order of 



PSS  
PSS  
SH  
PSS  
UP 
Comparisons in the corrupted setting. Though the SH and the UP algorithms can be directly applied to the setting with corruptions, we propose PSS to inject randomness in order to mitigate the impact of corruptions. Intuitively, for an adversary with the knowledge of the algorithm, the fact that a deterministic algorithm such as SH or UP pulls each active item according to a deterministic schedule fixed at the start of a phase allows the adversary to corrupt rewards of the items to be pulled. However, PSS pulls items probabilistically, which prevents the adversary from identifying the items to be pulled even when the semantics of the algorithm are known to the adversary.
We analyse SH and UP using a similar analysis to our proof of Theorem 4.1, and we tabulate the PAC performance guarantee in Table 4.1. While SH and UP have similar performance guarantees on compared to their randomized counterparts, namely , the upper bounds on for SH and UP are larger than their randomized counterparts by a multiplicative factor of . Consequently, the randomization in PSS allows us to mitigate the adversarial corruptions and leads to a better performance guarantee on compared to its deterministic counterparts.
4.2 Lower bounds
In the previous section, we observed that the performance guarantee of on deteriorates as the CPS increases. Interestingly, the deterioration is, in fact, fundamental to any online algorithm. Here, we demonstrate that no online algorithm is able to identify the optimal item with vanishing failure probability when is above a certain threshold. The impossibility result is generalized to the identification of an optimal item for an arbitrary but fixed .
Bernoulli instance. In the following, let denote the Bernoulli distribution with parameter . We focus on instances where each item follows , and . For any , we use to count the number of items with mean reward at most worse than that of the optimal item.
Corruption strategy against BAI. Abbreviate as . Assume that . In this strategy, essentially, the adversary solely corrupts the reward of item , so that , different from , as long as there is enough corruption budget (Figure 4.1). We describe the corruption strategy in full in Appendix C.5.
For a BAI with adversarial corruptions instance, we say that the instance has an optimality gap if . We provide the following lower bound result for the case of BAI with corruptions. {restatable}theoremthmcoupling Fix and . For any online algorithm, there is a BAI with an adversarial corruption instance in steps, corruption budget , and optimality gap , such that
In particular, Theorem 4.2 implies that, if the CPS satisfies
(4.6) 
then it is impossible to identify the best item with probability . The upper bound in (4.3) and the lower bound in (4.6) differ by a multiplicative factor of . Consequently, the upper bound in (4.3) is within a factor of away from the largest possible upper bound on CPS , under which it is possible to identify the best item with probability at least .
Corruption strategy against identifying an optimal item. We extend the previous strategy in order to impede the identification of an optimal item for any . Consider the following two offline strategies:

at each time step, if the random reward is , the adversary shifts it to until the corruption amount is depleted (see Figure 4.2);

at each time step, if the random reward is , the adversary shifts it to until the corruption amount is depleted.
The design of either strategy aims to make the agent obtain the same random reward at all time steps. As a result, the agent fails to get any information from the observations. In this case, the best thing she can do is to output any item with a uniform probability of after time steps.
theoremthmLbBerShiftToZeroOne Fix any . If , strategy (I) can attack the rewards such that
If instead , strategy (II) can attack the rewards such that
When so that , Theorem 4.2 provides lower bounds for the probability of identifying the optimal item under corruption strategies (I) and (II) respectively. In this case, when , the failure probability is asymptotically lower bounded by .
Although the adversary can use adaptive strategies to attack random rewards, i.e., design a strategy to add corruptions according to past observations, Theorem 4.2 shows that when the corruption is sufficiently large, even an offline strategy, i.e., one that is fixed before the algorithm runs, prevents the agent from identifying a satisfactory item with high probability.
5 Proof Sketches
We provide proof sketches for Theorems 4.1 and 4.2. The detailed proofs of these theorems and that of Theorem 4.2 are deferred to the supplementary material.
5.1 Proof sketch of Theorem 4.1
Feasibility. We first show that our algorithm is feasible in the sense that the phases proceed within steps, and is a singleton. {restatable}lemmalemmaUbprobaSSSingleOutput It holds that and . Lemma 5.1 ensures that is welldefined. Moreover, it implies that
Concentration. At the end of phase , the agent shrinks the active set according to the ’s, the corrupted empirical means of the active items. Intuitively, we expect that if and are sufficiently close, we can identify item with small . To this end, we define the amount of corruptions during phase as
To estimate the gap between and , we define a class of “nice events” for all and :
We utilize Theorem B.1 and B.2 to show that all these events hold with high probability. In particular, Theorem B.2 allows us to bound the impact of corruptions.
lemmalemmaUbprobaSSConcApply Let denote the complement of any event . For any fixed and ,
Note that is the expected number of pulls of each active item during phase . Lemma 5.1 implies that we are able to bound the gap between and for each active item with high probability.
Technique. In light of the importance of randomization for the regret minimization problem (Lykouris et al., 2018; Gupta et al., 2019; Zimmert and Seldin, 2019), we inject randomness in PSS and derive Lemma 5.1. The lemma shares similarity to Lemma 4 in Gupta et al. (2019). Lemma 5.1 explains the necessity of Line 6 in Algorithm 1 in order to mitigate the impact of adversarial corruptions. Let us explain this in more detail. While an active item is pulled probabilitistically in PSS(), it is pulled for a fixed number of times in SH (Karnin et al., 2013). Though the expected number of pulls of one active item is of the same order for PSS and SH, the absence of randomization in SH does not allow Theorem B.2 to bound the gap between and in the same way as for PSS. For SH, we can only show that
Disregarding constants, the difference between these bounds and those for PSS in Lemma 5.1 is that the term involving is worse by a factor of for SH. As a result, the bound on for SH turns out to be , which is worse than that for PSS by a factor of (see Table 4.1). A similar explanation is also applicable to explain the difference between the bounds for UP and PSS.
Elimination of the optimal item. When the agent fails to output item (the optimal item), i.e.,, item is inactive by the end of the last phase of the algorithm. Let
where . Since , we have The index labels the phase during which item turns from active to inactive. Next, any item that belongs to the active set satisfies
Conditioning on and , we have
To facilitate our analysis, we set for all We let be the item in with the least mean reward
Lemma 5.1 implies that with probability we have
Since , we have . This allows us to bound as follows:
Note that , are random variables that depend on the dynamics of the algorithm. For any realization of , , we formulate the observation above in Lemma 5.1. The complete proof of Lemma 5.1 is postponed to Section C.3.
lemmalemmaUbprobaSSConfApplyGap
Conditioned on and , where for each we have
Bounds. When and hold, we can apply Lemma 5.1 to bound with the total corruption budget , i.e., for any realization of , ,
In addition, the definitions of and indicate that
and
These inequalities, along with Lemma 5.1, the definitions of , and imply that for all and
Altogether,
We complete the proof of Theorem 4.1 with , . We elaborate on the details in Section C.4.
5.2 Proof sketch of Theorem 4.2
In addition to the Bernoulli instance with mean rewards , consider another Bernoulli instance with mean rewards , where and for all . Crucially, instance has a different optimal item (item ) from instance (item ), and both instances have optimality gap , since by our assumption that . Consider an adversary who only corrupts the reward of item 1 in instance , and makes whenever the corruption budget permits. In addition, instance is not corrupted. Essentially, the corruption hinders the agent from differentiating between instances and , which is necessary for outputting the different optimal items in each of the instances. By a suitable coupling between the Bernoulli distributions induced by instances and , we show
Then, elementary calculations yield
6 Summary and Future Work
This paper has taken a step in understanding the fundamental performance limits of BAI algorithms to cope with the presence of adversarial corruptions that are added on to the random rewards. We designed an algorithm termed PSS that can be regarded as a generalization and strengthening of the SH algorithm by Karnin et al. (2013) so that it is robust to corruptions. Due to PSS’s inherent randomized nature, it can successfully mitigate the adversarial corruptions. Furthermore, we showed by way of constructing several adversarial corruption strategies that the optimality gap of PSS is competitive visàvis any corruptiontolerant algorithm.
Inspired by the works of Liu and Shroff (2019), Jun et al. (2018), and Zuo (2020), it would be fruitful to devise optimal corruption strategies for algorithmspecific and algorithmindependent settings to uncover whether the dependence of the smallest optimality gap on is fundamental. We conjecture that the smallest does not depend on . More ambitiously, we would like to close the gap between the upper and lower bounds in (4.3) and (4.6).
Appendix A Notations
set for any  
ground set of size  
reward distribution of item  
mean reward of item  
random reward of item at time step  
corruption added on random reward item at time step  
corrupted reward of item at time step  
pulled item at time step  
total corruption budget  
probability law of the process  
gap between mean rewards of item and  
optimality gap of item  
nonanticipatory algorithm  
pulled item of algorithm at time step  
output of algorithm  
final recommendation rule of algorithm  
observation history  
bound on  
failure probability  
parameter in Algorithm 1  
amount of phases in Algorithm 1  
length of one phase in Algorithm 1  
active set in Algorithm 1  
probability to pull an active item during phase in Algorithm 1  
expected number of pulls of an active item during phase in Algorithm 1  
corrupted empirical mean of item during phase in Algorithm 1  
difficulty of the instance for PSS  
intrinsic difficulty of the instance  
Bernoulli distribution with parameter  
number of item with  
equals to  
parameter in the analysis of corruption strategies  
amount of corruptions during phase  
“nice events” in the analysis of Algorithm 1  
index of the phase during which item turns from active to inactive  
item in with the least mean reward  
equals to for all item 
Appendix B Useful theorems
Theorem B.1 (Standard multiplicative variant of the ChernoffHoeffding bound; Dubhashi and Panconesi (2009), Theorem 1.1).
Suppose that are independent valued random variables, and let . Then for any ,