Nonlinear Sequential Accepts and Rejects for Identification
of Top Arms in Stochastic Bandits
Abstract
We address the bestarm identification problem in multiarmed bandits. A player has a limited budget to explore arms (), and once pulled, each arm yields a reward drawn (independently) from a fixed, unknown distribution. The goal is to find the top arms in the sense of expected reward. We develop an algorithm which proceeds in rounds to deactivate arms iteratively. At each round, the budget is divided by a nonlinear function of remaining arms, and the arms are pulled correspondingly. Based on a decision rule, the deactivated arm at each round may be accepted or rejected. The algorithm outputs the accepted arms that should ideally be the top arms. We characterize the decay rate of the misidentification probability and establish that the nonlinear budget allocation proves to be useful for different problem environments (described by the number of competitive arms). We provide comprehensive numerical experiments showing that our algorithm outperforms the stateoftheart using suitable nonlinearity.
I Introduction
MultiArmed Bandits (MAB) is a sequential decisionmaking framework for the explorationexploitation dilemma [1, 2]. In MAB, a player explores a finite set of arms, and pulling each arm reveals a reward to the player. In the stochastic MAB, the rewards for each arm are independent samples from an unknown, fixed distribution. The player aims to exploit the arm with the largest expected reward as often as possible to maximize the gain. This framework has been formulated in terms of the cumulative regret, a comparison measure between the player’s performance versus a clairvoyant knowing the best arm a priori. Early studies on MAB dates back to several decades ago, but the problem has attracted a lot of renewed interest due to its modern applications, such as web search and advertising, wireless cognitive radios, and multichannel communication systems (see e.g. [3, 4, 5, 6, 7] and references therein).
More recently, many researchers have examined MAB in a pureexploration framework where the player aims to minimize the simple regret. This task is closely related to (probability of) finding the best arm in the pool [8]. As a result, the bestarm identification problem has received a considerable attention in the literature of machine learning[9, 10, 11, 12, 8, 13, 14]. It is wellknown that algorithms developed to minimize the cumulative regret (explorationexploitation) perform poorly for the simpleregret minimization (pureexploration). Consequently, one must adopt different strategies for optimal bestarm recommendation [12]. To motivate the pureexploration setting, consider channel allocation for mobile phone communication. Before the outset of communication, a cellphone (player) can explore the set of channels (arms) to find the best one to operate. Each channel feedback is noisy, and the number of trials (budget) is limited. The problem is hence an instance of bestarm identification, and minimizing the cumulative regret is not the right approach to the problem [8].
In this paper, we consider the bestarm identification problem in the fixedbudget setting [15]. Given a fixed number of arm pulls, the player attempts to maximize the probability of correctly identifying the top arms (in the sense of the expected reward). Note that this setting differs from the fixedconfidence setting, in which the objective is to minimize the number of trials to find the top arms with a certain confidence [16, 17]. Recently, for bestarm identification () in the fixedbudget setting, the authors of [18] proposed an efficient algorithm based on nonlinear sequential elimination. The idea is to discard the suboptimal arms sequentially and divide the budget according to a nonlinear function of remaining arms at each round. With a suitable nonlinearity, the nonlinear budget allocation was proven to improve upon Successive Rejects [8] (its linear counterpart) as well as Sequential Halving [13].
Inspired by the success of nonlinear budget allocation for bestarm identification [18], in this work, we extend the Successive Accepts and Rejects (SAR) algorithm in [15] to nonlinear budget allocation for bestarm identification. Our algorithm, called Nonlinear Sequential Accepts and Rejects (NSAR), proceeds in rounds. At each round, the arms are pulled strategically and their empirical rewards are calculated. Then, one arm is deactivated, and according to a decision rule the arm may be accepted or rejected. Unlike SAR that divides the budget by a linear function of remaining arms, NSAR (our algorithm) does so in a nonlinear fashion. For two general reward regimes, we prove theoretically that our algorithm achieves a lower sample complexity compared to SAR, which improves the decay rate of the misidentification probability. We also provide various numerical experiments to support our theoretical results, and moreover, we compare NSAR to the fixedbudget version of ATLUCB in [19].
Ia Related Work
Pureexploration in the PAClearning setup was examined in [9], where Successive Elimination for finding an optimal arm with probability (fixedconfidence setting) was developed. The matching lower bounds for the problem were provided in [10, 20]. Many algorithms for pureexploration are inspired by the celebrated UCB1 algorithm for explorationexploitation [2]. As an example, Audibert et al. [8] proposed UCBE, which modifies UCB1 for pureexploration. In addition, Jamieson et al. [21] proposed an optimal algorithm for the fixedconfidence setting, inspired by the law of the iterated logarithm. Gabillon et al. [14] presented a unifying approach for fixedbudget and fixedconfidence settings. For identification of multiple top arms (or bestarm identification), Kalyanakrishnan et al. [16] developed the HALVING algorithm in the fixedconfidence setting, which is later improved by the LUCB algorithm in[17]. For the fixedconfidence setting, more recent progress can be found in [22, 23, 24]. In [25], the bestarm identification problem was posed using a notion of aggregate regret, and it was applied to crowdsourcing. Furthermore, Kaufmann et al. [26] studied the identification of multiple top arms using KLdivergencebased confidence intervals. The authors of [27] investigated both settings to show that the complexity of the fixedbudget setting may be smaller than that of the fixedconfidence setting.
Ii Preliminaries
Notation: For integer , we define to represent the set of positive integers smaller than or equal to . We use to denote the cardinality of the set , and to denote the ceiling function, respectively. We use the notation when there exists a positive constant and a point such that for . Throughout, the random variables are denoted in bold letters.
Algorithm  Successive Rejects  Sequential Halving  Nonlinear Sequential Elimination 

Algorithm  SAR  ATLUCB  NSAR (our algorithm) 

Sampling complexity order 
Iia Problem Statement
In the stochastic Multiarmed Bandit (MAB) problem, a player explores a finite set of arms. When the player samples an arm, the corresponding reward of that arm is observed. The rewards of each arm are drawn independently from an unknown, fixed distribution with the expected value . The support of the distribution is the unit interval , and the rewards are generated independently across the arms. For simplicity, we have the following assumption on the order of arms
(1) 
where the strict inequalities guarantee that there is no ambiguity over the top arms . Let denote the gap between arm and arm 1, measuring the suboptimality of arm , and the (empirical) average reward obtained by pulling arm for times.
In this work, we address the bestarm identification setup, a pureexploration problem in which the player aims to find the top arms with a high probability. The two wellknown settings for this problem are the fixedconfidence and the fixedbudget. In the former, the objective is to minimize the number of arm pulls needed to identify the top arms with a certain confidence. In the latter, which is the focus of this work, the problem is posed formally as:
Problem 1
Given a total budget of arm pulls, an bestarm identification algorithm outputs the arms . Find the decay rate of misidentification probability, i.e., the decay rate of .
For the case that , known as bestarm identification, it is proven that classical MAB techniques in the explorationexploitation setting (e.g. UCB1) are not optimal. In particular, Bubeck et al. [12] have showed that upper bounds on the cumulative regret results in lower bounds on the simple regret, i.e., the smaller the cumulative regret, the larger the simple regret. The underlying intuition is that in the explorationexploitation setting, we aim to find the best arm as quickly as possible to exploit it, and in this case, playing even the secondbest arm for a long time yields an unacceptable cumulative regret. On the other hand, in the bestarm identification problem, there is no need to minimize an intermediate cost, and the player only recommends the best arm at the end. Therefore, exploring the suboptimal arms strategically during the game helps the player to make a better final decision. In other words, the performance is only measured by the final output, regardless of the number of pulls for the suboptimal arms.
IiB Previous Performance Guarantees and Our Result
Though the focus of this work is bestarm identification, we start by reviewing some of the results for the case of (bestarm identification). Any (single) bestarm identification algorithm samples the arms based on some strategy and outputs a single arm as the best. In order to characterize the misidentification probability of these algorithms, we need to define a few quantities. The decay rate of misidentification probability for two of the stateoftheart algorithms, Successive Rejects [8] and Sequential Halving [13], relies on the complexity measure , defined as
(2) 
which is equal to up to logarithmic factors in [8]. In Successive Rejects, at round , the remaining arms are played proportional to the whole budget divided by (a linear function of ). As the linear function is not necessarily the best sampling rule, the authors of [18] extended Successive Rejects to Nonlinear Sequential Elimination which divides the budget at round by the nonlinear function , based on an input parameter ( recovers Successive Rejects). The performance of the algorithm depends on the following quantities
(3) 
For each of the three algorithms, the bound on the misidentification probability can be written in the form of , where and are provided in Table I (). It was shown in [18] that in many regimes for the arm gaps, provides better results (theoretical and practical), and Nonlinear Sequential Elimination outperforms the other two algorithms. The value of must be tuned, but the tuning is more qualitative rather than quantitative, i.e., the algorithm performs reasonably well as long as is either in or , and thus, the value of needs not be specific.
In this work, our goal is to extend this idea to bestarm identification. For convenience, we discuss the performance of these algorithms in terms of the sample complexity, defined as the smallest budget needed to achieve the confidence level for misidentification probability, i.e., the smallest for which . For bestarm identification, we need to define a new set of quantities and complexity measures as
(4) 
where for each is such that
Based on the definitions above,
Table II tabulates the sample complexities of three algorithms for bestarm identification: SAR [15], ATLUCB [19], and NSAR proposed in this paper. It follows immediately from (IIB) that for , , and for , . Also, in view of (3), for and for . Therefore, the comparison of and , the sample complexities of SAR and NSAR, is not obvious. As in the case of single bestarm identification, we will show that in many regimes for rewards, NSAR can outperform SAR.
Note that ATLUCB [19] is an anytime algorithm, i.e., it does not require a preassigned budget. In that sense, ATLUCB is more powerful compared to algorithms designed specifically for the fixedbudget setting, but since it can also be used in this framework, we include it in the table as a benchmark and will compare our results with this algorithm in the numerical experiments.
Iii Nonlinear Sequential Accepts and Rejects
In this section, we propose the Nonlinear Sequential Accepts and Rejects (NSAR) algorithm for bestarm identification in the fixed budget setting. The algorithm follows the steps of SAR [15], except for the fact that the budget allocation at each round is a nonlinear function of arms. The details of NSAR is given in Figure 1. The algorithm is given a budget of arm pulls. At any round , it maintains an active set of arms , initialized by . The algorithm proceeds for rounds to deactivate the arms sequentially (one arm at each round) until a single arm is left. Based on an input value , the constant and the sequence are calculated for any . At round , the algorithm samples the active arms for times and computes the empirical average of rewards for each arm. Then, it orders the empirical rewards and calculates the empirical version of gaps, where the true gaps for are defined in the first line of (IIB). The arm with the highest empirical gap is deactivated: if its empirical reward is within the top arms, it is accepted; otherwise, it is rejected. At the end, the algorithm outputs accepted arms as the top arms.
Note that our algorithm with the choice of amounts to SAR. We will show that in many regimes for arm gaps, provides better theoretical results, and we further exhibit the efficiency in the numerical experiments in Section IV. The following proposition encapsulates the theoretical guarantee of the algorithm (the proof is given in the appendix).
Proposition 2
The performance of NSAR relies on the input parameter , but this choice is more qualitative rather than quantitative. In particular, larger values for increase and decrease , and hence, there is a tradeoff in selecting . According to Table II, to compare NSAR with SAR and ATLUCB , we have to evaluate the corresponding sample complexities. Fair theoretical comparisons with ATLUCB is delicate, since is in essence slightly different from and . However, we will provide comprehensive simulations in Section IV to compare all algorithms. We consider two instances for suboptimality of arms in this section to compare NSAR with SAR:

A large group of competitive arms: The top arms are roughly similar such that , is nonnegligible, and the other arms are just as competitive as each other, i.e., .

A small group of competitive arms: The top arms are roughly similar such that . for a small number of arms ( with respect to ) and , , and . We also have .
The subsequent corollary follows from Proposition 2. Note that the orders are expressed with respect to .
Corollary 3
Consider the Nonlinear Sequential Accepts and Rejects algorithm in Figure 1. Let constants and be chosen such that and . Then, for the two settings given above, the bound on the misidentification probability presented in Proposition 2 satisfies
Regime 1  Regime 2 

Now let us compare NSAR and SAR using the result of Corollary 3. Returning to Table II and calculating for Regimes and , we can derive the following table,
Algorithm  SAR  NSAR 

Regime 1  
Regime 2 
which shows that with a proper tuning for , we can save a factor in the sampling complexity. Though we do not have prior information on gaps to categorize them specifically, the choice of the input parameter is more qualitative rather than quantitative, i.e., once the suboptimal arms are almost the same performs better than , and when there are a few real competitive arms, outperforms . Next, we will show in the numerical experiments that a wide range of values for can potentially result in efficient algorithms with small misidentification error.
Iv Numerical Experiments
We now empirically evaluate our proposed algorithm on a few settings studied in [15]. More specifically, we compare NSAR with SAR, ATLUCB, as well as uniform allocation (UNI), where in the UNI algorithm, we simply divide the budget uniformly across the arms. We remark that ATLUCB in [19] is an anytime algorithm, i.e., it does not require a preassigned budget; however, since it can also be used for the fixedbudget setting, we include it in our numerical experiments as a benchmark. We consider arms and assume Bernoulli distribution on the rewards. For the following setups, we examine two values for top arms (we use the notation to denote integers in ):

One group of suboptimal arms: and .

Two groups of suboptimal arms: , , and .

Three groups of suboptimal arms: , , , and .

Beta(2,2): The expected values of Bernoulli distributions are generated according to a beta distribution with shape parameters and .

Beta(5,5): The expected values of Bernoulli distributions are generated according to a beta distribution with shape parameters and .

One real competitive arm: , and .
We run experiments for each setup with a specific value of , and we calculate the misidentification probability by averaging out over the error in experiment runs. We set the budget in each setup equal to in the corresponding setup as suggested in [15], and we also choose the parameters of ATLUCB as instructed in [19].
We illustrate the overall performance of the algorithms in Figure 2 for different setups. The height of each bar shows the misidentification probability, and the index guideline is as follows: (i) indices 15: NSAR with parameter . (ii) index 6: SAR. (iii) index 7: ATLUCB. (iv) index 8: UNI. The legends are the same for all of the plots, and hence, they are omitted in most of the plots.
The results are consistent with Corollary 3, and the following comments are in order:

Setup 1 corresponds to Regime 1 in Corollary 3. As expected, with any choice of , NSAR should outperform SAR, and we observe that this happens when . However, in this regime, our algorithm is inferior compared to ATLUCB.

Setups 236 are considered close to Regime 2 in Corollary 3 as we have a small number of arms competitive to the top arms. Thus, we should choose . We observe that in these setups, at least for two choices out of , NSAR outperforms SAR and ATLUCB. One should observe that the improvement in Corollary 3 is which increases slowly with . Since we only have numbers, using larger values for is not suitable in these setups, because the increase in worsens the performance overall. Though for larger values of , the improvement must be more visible, we avoid that due to prohibitive timecomplexity of Monte Carlo simulations.

In Setups 45, we choose the expected values of Bernoulli rewards randomly and concentrate them around . Again, for at least two choices out of , our algorithm outperforms SAR and ATLUCB.

In all setups, the naive UNI algorithm is outperformed by the other methods.
Overall, the performance of algorithms depends on the problem environment. If we have prior knowledge of the environment, we can select the suitable algorithm. The notable feature of NSAR is incorporation of this prior knowledge in tuning of without changing the foundation of the algorithm.
V Conclusion
We considered bestarm identification in stochastic multiarmed bandits, where the objective is to find the top arms in the sense of the expected reward. We presented an algorithm working based on sequential deactivation of arms in rounds. The key is to allocate the budget of arm pulls in a nonlinear fashion at each round. We proved theoretically and empirically that we can gain from the nonlinear budget allocation in several problem environments, compared to the stateoftheart methods. An important future direction is to propose a method that adaptively finetunes the nonlinearity according to the problem environment.
Vi Appendix
Fact 1
(Hoeffding’s inequality) Let be independent random variables with support on the unit interval with probability one. If , then for all , it holds that
Proof of Proposition 2
Recall that denotes the average reward of pulling arm for times. Now consider the following event
Using Hoeffding’s inequality (Fact 1), we get
Noting the fact that , we can use above to conclude that
The rest of the proof is to show that the event warrants that the algorithm does not make erroneous decision. This part follows precisely by the induction argument given in [15] (see page 45).
Proof of Corollary 3
First, let us analyze the order of defined as
For any , is a convergent sum when . Thus, for the regime , the sum is a constant, i.e., . On the other hand, consider , and note that the sum is divergent, and for large we have . Now, let us analyze
For Regime 1, and we have
Combining with , the product . For Regime 2, and we have
since . Therefore, combining with , the product .
References
 [1] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.
 [2] P. Auer, N. CesaBianchi, and P. Fischer, “Finitetime analysis of the multiarmed bandit problem,” Machine learning, vol. 47, no. 23, pp. 235–256, 2002.
 [3] A. Mahajan and D. Teneketzis, “Multiarmed bandit problems,” in Foundations and Applications of Sensor Management. Springer, 2008, pp. 121–151.
 [4] K. Liu and Q. Zhao, “Distributed learning in multiarmed bandit with multiple players,” IEEE Transactions on Signal Processing, vol. 58, no. 11, pp. 5667–5681, 2010.
 [5] K. Wang and L. Chen, “On optimality of myopic policy for restless multiarmed bandit problem: An axiomatic approach,” IEEE Transactions on Signal Processing, vol. 60, no. 1, pp. 300–309, 2012.
 [6] S. Vakili, K. Liu, and Q. Zhao, “Deterministic sequencing of exploration and exploitation for multiarmed bandit problems,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 5, pp. 759–767, 2013.
 [7] D. Kalathil, N. Nayyar, and R. Jain, “Decentralized learning for multiplayer multiarmed bandits,” IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2331–2345, 2014.
 [8] J.Y. Audibert and S. Bubeck, “Best arm identification in multiarmed bandits,” in COLT23th Conference on Learning Theory2010, 2010, pp. 13–p.
 [9] E. EvenDar, S. Mannor, and Y. Mansour, “PAC bounds for multiarmed bandit and markov decision processes,” in Computational Learning Theory. Springer, 2002, pp. 255–270.
 [10] S. Mannor and J. N. Tsitsiklis, “The sample complexity of exploration in the multiarmed bandit problem,” The Journal of Machine Learning Research, vol. 5, pp. 623–648, 2004.
 [11] S. Bubeck, R. Munos, and G. Stoltz, “Pure exploration in multiarmed bandits problems,” in Algorithmic Learning Theory. Springer, 2009, pp. 23–37.
 [12] ——, “Pure exploration in finitelyarmed and continuousarmed bandits,” Theoretical Computer Science, vol. 412, no. 19, pp. 1832–1852, 2011.
 [13] Z. Karnin, T. Koren, and O. Somekh, “Almost optimal exploration in multiarmed bandits,” in Proceedings of the 30th International Conference on Machine Learning (ICML13), 2013, pp. 1238–1246.
 [14] V. Gabillon, M. Ghavamzadeh, and A. Lazaric, “Best arm identification: A unified approach to fixed budget and fixed confidence,” in Advances in Neural Information Processing Systems, 2012, pp. 3212–3220.
 [15] S. Bubeck, T. Wang, and N. Viswanathan, “Multiple identifications in multiarmed bandits,” in Proceedings of The 30th International Conference on Machine Learning (ICML), 2013, pp. 258–265.
 [16] S. Kalyanakrishnan and P. Stone, “Efficient selection of multiple bandit arms: Theory and practice,” in Proceedings of the 27th International Conference on Machine Learning (ICML10), 2010, pp. 511–518.
 [17] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone, “PAC subset selection in stochastic multiarmed bandits,” in Proceedings of the 29th International Conference on Machine Learning (ICML12), 2012, pp. 655–662.
 [18] S. Shahrampour, M. Noshad, and V. Tarokh, “On sequential elimination algorithms for bestarm identification in multiarmed bandits,” IEEE Transactions on Signal Processing, vol. 65, no. 16, pp. 4281–4292, Aug 2017.
 [19] K.S. Jun and R. D. Nowak, “Anytime exploration for multiarmed bandits using confidence information.” in International Conference on Machine Learning (ICML), 2016, pp. 974–982.
 [20] E. EvenDar, S. Mannor, and Y. Mansour, “Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems,” The Journal of Machine Learning Research, vol. 7, pp. 1079–1105, 2006.
 [21] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck, “lil’ucb: An optimal exploration algorithm for multiarmed bandits,” in Proceedings of The 27th Conference on Learning Theory, 2014, pp. 423–439.
 [22] L. Chen, J. Li, and M. Qiao, “Nearly instance optimal sample complexity bounds for topk arm selection,” arXiv preprint arXiv:1702.03605, 2017.
 [23] H. Jiang, J. Li, and M. Qiao, “Practical algorithms for bestk identification in multiarmed bandits,” arXiv preprint arXiv:1705.06894, 2017.
 [24] J. Chen, X. Chen, Q. Zhang, and Y. Zhou, “Adaptive multiplearm identification,” arXiv preprint arXiv:1706.01026, 2017.
 [25] Y. Zhou, X. Chen, and J. Li, “Optimal PAC multiple arm identification with applications to crowdsourcing,” in Proceedings of the 31st International Conference on Machine Learning (ICML14), 2014, pp. 217–225.
 [26] E. Kaufmann and S. Kalyanakrishnan, “Information complexity in bandit subset selection,” in Conference on Learning Theory, 2013, pp. 228–251.
 [27] E. Kaufmann, O. Cappé, and A. Garivier, “On the complexity of bestarm identification in multiarmed bandit models,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 1–42, 2016.