We propose the first fully-adaptive algorithm for pure exploration in linear bandits—the task to find the arm with the largest expected reward, which depends on an unknown parameter linearly. While existing methods partially or entirely fix sequences of arm selections before observing rewards, our method adaptively changes the arm selection strategy based on past observations at each round. We show our sample complexity matches the achievable lower bound up to a constant factor in an extreme case. Furthermore, we evaluate the performance of the methods by simulations based on both synthetic setting and real-world data, in which our method shows vast improvement over existing ones.
Fully adaptive algorithm for pure exploration in linear bandits
Liyuan Xu Junya Honda Masashi Sugiyama
:The University of Tokyo :RIKEN
The multi-armed bandit (MAB) problem (Robbins, 1985) is a sequential decision-making problem, where the agent sequentially chooses one arm out of arms and receives a stochastic reward drawn from a fixed, unknown distribution related with the arm chosen. While most of the literature on the MAB focused on the maximization of the cumulative rewards, we consider the pure-exploration setting or the best arm identification problem (Bubeck et al., 2009). Here, the goal of the agent is to identify the arm with the maximum expected reward.
The best arm identification has recently gained increasing attention, and a considerable amount of work covers many variants of it. For example, Audibert and Bubeck (2010) considered fixed budget setting, where the agent tries to minimize the misspecification probability in a fixed number of trials, and Even-Dar et al. (2006) introduced fixed confidence setting, where the agent tries to minimize the number of trials until the probability of misspecification becomes smaller than a fixed threshold.
An important extension of the MAB is the linear bandit (LB) problem (Auer, 2002). In the LB problem, each arm has its own feature , and the expected reward can be written as , where is an unknown parameter and is the transpose of . Although there are a number of studies on the LB (Abbasi-Yadkori et al., 2011, Li et al., 2010), most of them aim for maximization of the cumulative rewards, and only a few consider the pure-exploration setting.
In spite of the scarce literature, the best arm identification problem on LB has a wide range of applications. For example, Hoffman et al. (2014) applied the pure exploration in LB to the optimization of a traffic sensor network and automatic hyper-parameter tuning in machine learning. Furthermore, even if the goal of the agent is to maximize the cumulative rewards, such as the case of news recommendation (Li et al., 2010), considering pure exploration setting is sometimes helpful when the system cannot respond feedback in real-time after once launched.
The first work that addressed the LB best arm identification problem was by Hoffman et al. (2014). They studied the best arm identification in the fixed-budget setting with correlated reward distributions and devised an algorithm called BayesGap, which is a Bayesian version of a gap based exploration algorithm (Gabillon et al., 2012).
Although BayesGap outperformed algorithms that ignore the correlation, there is a drawback that it never pulls arms turned out to be sub-optimal, which can significantly harm the performance in LB. For example, consider the case where there are three arms and the feature of them are , and , respectively. Now, if , then the expected reward of arms 1 and 2 are close to each other, hence it is hard to figure out the best arm just by observing the samples from them. On the other hand, pulling arm 3 greatly reduces the samples required, since it enhances the accuracy of estimation of . As illustrated in this example, selecting a sub-optimal arm can give valuable insight for comparing near-optimal arms in LB.
Soare et al. (2014) is the first work taking this nature into consideration. They studied the fixed-confidence setting and derived an algorithm based on transductive experimental design (Yu et al., 2006), called -static allocation. The algorithm employs a static arm selection strategy, in the sense that it fixes all arm selections before observing any reward. Therefore, it is not able to focus on estimating near-optimal arms, thus the algorithm can only be the worst case optimal.
In order to develop more efficient algorithms, it is necessary to pull arms adaptively based on past observations so that most samples are allocated for comparison of near-optimal arms. The difficulty in constructing an adaptive strategy is that a confidence bound for statically selected arms is not always applicable when arms are adaptively selected. In particular, a confidence bound for an adaptive strategy introduced by Abbasi-Yadkori et al. (2011) is looser than a bound for a static strategy derived from Azuma’s inequality (Azuma, 1967) by a factor of in some cases, where is the dimension of the feature. Soare et al. (2014) tried to mitigate this problem by introducing a semi-adaptive algorithm called -adaptive allocation, which divides rounds into multiple phases and uses different static allocations in different phases. Although this theoretically improves the sample complexity, the algorithm has to discard all samples collected in previous phases to make the confidence bound for static strategies applicable, which drops the empirical performance significantly.
To discuss tightness of the sample complexity of -adaptive allocation, Soare et al. (2014) introduced the -oracle allocation algorithm, which assumes access to the true parameter for selecting arms to pull. They discussed that the sample complexity of this algorithm can be used as a lower bound on the sample complexity for this problem and claimed that the upper bound on the sample complexity of -adaptive allocation is close to this lower bound. However, the derived upper bound is not given in an explicit form and contains a complicated term coming from -static allocation used as a subroutine. In fact, the sample complexity of -adaptive allocation is much worse than that of -oracle allocation, as we will see numerically in Section 7.1.
Our contribution is to develop a novel fully adaptive algorithm, which changes arm selection strategies based on all of the past observations at every round. Although this prohibits us from using a tighter bound for static strategies, we show that the factor can be avoided by the careful construction of the confidence bound, and the sample complexity almost matches that of -oracle allocation. We conduct experiments to evaluate the performance of the proposed algorithm, showing that it requires ten times less samples than existing methods to achieve the same level of accuracy.
2 Problem formulation
We consider the LB problem, where there are arms with features . We denote the set of the features as and the largest -norm of the features as . At every round , the agent selects an arm , and observes immediate reward , which is characterized by
Here, is an unknown parameter, and represents a noise variable, whose expectation equals zero. We assume that the -norm of is less than and the noise distribution is conditionally -sub-Gaussian, which means that noise variable satisfies
for all . This condition requires the noise distribution to have zero expectation and or less variance (Abbasi-Yadkori et al., 2011). As prior work (Abbasi-Yadkori et al., 2011, Soare et al., 2014), we assume that parameters and are known to the agent.
We focus on the -best arm identification problem. Let be the best arm, and be the feature of arm . The problem is to design an algorithm to find arm which satisfies
as fast as possible.
3 Confidence Bounds
In order to solve the best arm identification in the LB setting, the agent sequentially estimates from past observations and bounds the estimation error. However, if arms are selected adaptively based on past observations, the estimation becomes much more complicated compared to the case where pulled arms are fixed in advance. In this section, we discuss this difference and how we can construct a tight bound for an algorithm with an adaptive selection strategy.
Given the sequence of arm selections , one of the most standard estimators for is the least-square estimator given by
where and are defined as
Proposition 1 (Soare et al., 2014, Proposition 1).
Let noise variable be bounded as for , then, for any fixed sequence , statement
holds for all and with probability at least .
Here, the matrix norm is defined as . The assumption that is fixed is essential in Prop. 1. In fact, if is adaptively determined depending on past observations, then the estimator is no more unbiased and it becomes essential to consider the regularized least-squares estimator given by
where is defined by
for and the identity matrix . For this estimator, we can use another confidence bound which is valid even if an adaptive strategy is used.
Proposition 2 (Abbasi-Yadkori et al., 2011, Theorem 2).
In the LB with conditionally -sub-Gaussian noise, if the -norm of parameter is less than and the arm selection only depends on past observations, then statement
holds for given and all with probability at least , where is defined as
Moreover, if holds for all , then
4 Arm Selection Strategies
In order to minimize the number of samples, the agent has to select arms that reduce the interval of the confidence bound as fast as possible. In this section, we discuss such an arm selection strategy, and in particular, we consider the strategy to reduce the matrix norm , which represents the uncertainty in the estimation of the gap of expected rewards between arms and .
Soare et al. (2014) introduced the strategy called -static allocation, which makes the sequence of selection to be
In (5), is the set of directions defined as . The problem is to minimize the confidence bound of the direction hardest to estimate, which is known as transductive experimental design (Yu et al., 2006). Note that this problem does not depend on the past reward, which satisfies the prerequisite of Prop. 1.
A drawback of this strategy is that it treats all directions equally. Since our goal is to find the best arm , we are not interested in estimating the gaps between all arms but the gaps between the best arm and the rest. Therefore, we should focus on the directions in , where is the feature of the best arm. Furthermore, directions in are still not equally important, since we need more samples to distinguish the arms whose expected reward is close to that of the best arm.
In order to overcome this weakness while using Prop. 1, Soare et al. (2014) proposed a semi-adaptive strategy called the -adaptive strategy. This strategy partitions rounds into multiple phases and arms to select are static within a phase but changes between phases. At the beginning of phase , it constructs a set of potentially optimal arms based on the samples collected during the previous phase . Then, it selects the sequence in phase as
for . As it goes through the phases, the size of decreases so that the algorithm can focus on discriminating a small number of arms.
Although the -adaptive strategy can avoid the extra factor in (4), the agent has to reset the design matrix at the beginning of each phase in order to make Prop. 1 applicable. As experimentally shown in Section 7, we observe that this empirically degenerates the performance considerably.
On the other hand, our approach is fully adaptive, which selects arms based on all of the past observations at every round. More specifically, at every round , the algorithm chooses (but not pulls) a pair of arms, and , the gap of which needs to be estimated. Then, it selects an arm so that the sequence of selected arms becomes close to
where . Although Prop. 1 is no longer applicable in our strategy, it can focus on the estimation of the gap between the best arm and near-optimal arms.
5 LinGapE Algorithm
In this section, we present a novel algorithm for -best arm identification in LB. We name the algorithm LinGapE (Linear Gap-based Exploration), as it is inspired by UGapE (Gabillon et al., 2012).
The entire algorithm is shown in Algorithm 1. At each round, LinGapE first chooses two arms, the arm with the largest estimated rewards and the most ambiguous arm . Then, it pulls the most informative arm to estimate the gap of expected rewards by Line 1 in Algorithm 1.
An algorithm of choosing arms and is presented in Algorithm 2, where we denote the estimated gap by and the confidence interval of the estimation by defined as
for given in (3).
5.1 Arm Selection Strategy
After choosing arms and , the algorithm has to select arm , which most decreases the confidence bound , or equivalently, . As in Soare et al. (2014), we propose two procedures for this.
One is to select arms greedily, which is
We were not able to gain a theoretical guarantee of the performance for this greedy strategy, though our experiment shows that it performs well.
where is the solution of the linear program
We pull the arm that makes the ratio of arm selections close to ratio . To be more precise, is decided by
where is the number of times that arm is pulled until -th round. This strategy is a little more complicated than the greedy strategy in (9) but enjoys a simple theoretical characteristic, based on which we conduct analysis.
LinGapE is capable of solving -best arm identification, regardless of which strategy is employed, as stated in the following theorem.
The proof can be found in Appendix C.
5.2 Comparison of Confidence Bounds
A distinctive character of LinGapE is that it considers an upper confidence bound of the gap of rewards, while UGapE and other algorithms for LB, such as OUFL (Abbasi-Yadkori et al., 2011), consider an upper confidence bound of the reward of each arm. This approach is, however, not suited for the pure exploration in LB, where the gap plays an essential role.
The following example illustrates the importance of considering such quantities. Consider that there are three arms, features of which are and . Assuming that we have , thus the estimated best arm is . Now, let us consider the case where we have already been confident that but still unsure of . In such a case, algorithms considering an upper confidence bound of the rewards of each arm, such as UGapE, choose arm as , since it has a larger estimated expected reward and a longer confidence interval than arm . However, it is not efficient, since arm 2 cannot have a larger expected reward than arm 1 when . On the other hand, LinGapE can avoid this problem, since a confidence interval for is longer than .
6 Sample Complexity
In this section, we give an upper bound of the sample complexity of LinGapE and compare it with existing methods.
6.1 Sample Complexity
Here, we bound the sample complexity of LinGapE when arms to pull are selected by (12). Let the problem complexity be defined as
where is defined as
and is the optimal value of problem (11), denoted as
Then, the sample complexity of LinGapE can be bounded as follows.
The proof can be found in Appendix C. The theorem states that there are two types of sample complexity. The first bound (16) is more practically applicable, since the condition can be checked by known parameters. On the other hand, we cannot ensure the condition is satisfied, since we cannot know in advance. However, the second bound in (17) can be tighter than the first one in (16) if there are only few directions that needs to be estimated accurately, where we have . In such a case, the additional term is much smaller than , since when .
6.2 Discussion on Problem Complexity
The problem complexity (13) has an interesting relation with that of the -oracle allocation algorithm introduced by Soare et al. (2014). They considered the case where the agent knows true parameter when selecting an arm to pull, and tries to confirm arm is actually the best arm. Then, an efficient strategy is to let the sequence of arm selections be
An upper bound of the sample complexity of -oracle allocation is proved to be , where problem complexity is defined as
This is expected to be close to the achievable lower bound of the problem complexity (Soare et al., 2014). Here, we prove a theorem that points out the relation between and our problem complexity .
Let be the problem complexity of LinGapE (13) when is set as . Then, we have
where is defined as
The proof of the theorem can be found in Appendix E. Since , this result shows that our problem complexity matches the lower bound up to a factor of , the number of arms. Furthermore, if for some is very small compared with , that is, if there is only one near-optimal arm, then becomes close to , and hence our problem complexity achieves the lower bound up to a constant factor.
Soare et al. (2014) claimed that -adaptive allocation achieves this lower bound as well. To be precise, they discussed that the sample complexity of -adaptive allocation is , where is the sample complexity of -oracle allocation. Nevertheless, they did not give an explicit bound of , which stems from the static strategy employed in each phase. Our experiments in Section 7 show that can be as large as the sample complexity of -static allocation, the problem complexity of which is proved to be and can be arbitrarily larger than in the case of (Soare et al., 2014). Therefore, LinGapE is the first algorithm that always achieves the lower bound up to a factor of .
We point out another interpretation of our problem complexity. If the features set equals the set of canonical bases , then the LB problem reduces to the ordinary MAB problem. In such a case, and are computed as
Therefore, if the noise variable is bounded in the interval , which is known as -sub-Gaussian, the problem complexity becomes
where is the problem complexity of UGapE (Gabillon et al., 2012). This fact suggests that LinGapE incorporates the linear structure into UGapE from the perspective of the problem complexity.
In this section, we compare the performance of LinGapE with the algorithms proposed by Soare et al. (2014) through experiments in two synthetic settings and simulations based on real data.
7.1 Experiment on Synthetic Data
We conduct experiments in two synthetic settings. One is the setting where an adaptive strategy is suitable, and the other is where pulling all arms uniformly becomes the optimal strategy. We set the noise distribution as and run LinGapE with parameters of and in both cases. We observed that altering the arm selection strategy in (9) and in (12) has very little impact on the performance, and we plot the results only for the greedy strategy (9). We repeated experiments ten times for each experimental setting, the average of which is reported.
7.1.1 Setting where an Adaptive Strategy is Suitable
The first experiment is conducted on the setting where the adaptive strategy is favored, which is introduced by Soare et al. (2014). We set up the LB problem with arms, where features consist of canonical bases and an additional feature . The true parameter is set as so that the expected reward of arm is very close to that of the best arm compared with other arms. Hence, the performance heavily depends on how much the agent can focus on comparing arms 1 and .
Figure 1 is a semi-log plot of the average stopping time of LinGapE, in comparison with the -static allocation, -adaptive allocation and -oracle allocation algorithms, all of which are introduced by Soare et al. (2014). The arm selection strategies of them are given in (5), (6) and (18), respectively. The result indicates the superiority of LinGapE to the existing algorithms.
This difference is due to the adaptive nature of LinGapE. While -static allocation treats all directions equally, LinGapE is able to identify the most important direction from the past observations and select arms based on it. Table 1, which shows the number of times that each arm is pulled when , supports this idea. From this table, we can see that while -static allocation pulls all arms equally, LinGapE and -oracle allocation pull arm frequently. This is an efficient strategy for estimating the gap of expected rewards between arms and , since the feature of arm is almost aligned with the direction of . We can conclude that LinGapE is able to focus on discriminating arms and , which reduces the total number of required samples significantly.
Although -adaptive allocation has adaptive nature as well, it performs much worse than -static allocation in this setting. This is due to the limitation that it has to reset the design matrix at every phase. We observe that it actually succeeds to find in the first few phases. However, it “forgets” the discarded arms and gets again. This is because the agent pulls only arms , and at phase for estimating , and the design matrix constructed at the phase cannot discard other arms anymore. Therefore, the algorithm still handles all at the last phase, which requires as many samples as -static allocation. Hence, this is an example that the sample complexity of -adaptive allocation matches that of -static allocation. We observed that the same happened in the subsequent two experiments and -adaptive performed at least five times worse than -static allocation. Therefore, in order to highlight the difference between -static allocation and LinGapE in linear scale plots, we do not plot the result for -adaptive allocation in the following.
It is somewhat surprising that LinGapE wins over -oracle allocation, given that it assumes access to the true parameter . The main reason for this is that our confidence bound is tighter than that used in -oracle allocation. This seems contradicting, since our confidence bound is looser by a factor of in the worst case where as discussed in Section 3. Nevertheless, grows almost linearly with in our setting, since LinGapE mostly pulls the same arm as presented in Table 1, which significantly reduces the length of the confidence interval. This suggests the sample complexity given in Theorem 2 is actually loose, in which we bound by as well (see Prop. 3 in Appendix D).
7.1.2 Setting where a Static Strategy is Optimal
We conduct another experiment in synthetic setting, where -static allocation is almost optimal. We consider the LB with , where the feature set equals the canonical set . We set the parameter as , where , hence arm has a larger expected reward by than all other arms. As , we need to estimate all arms equally accurately, therefore the optimal strategy is to pull all arms uniformly, which corresponds to -static allocation.
The result for various gaps is shown in Figure 2. We observe not only that LinGapE performs better than -static allocation but also that the gap of the performance increases as , where -static allocation can be thought as the optimal strategy. A reason for this is that while -static allocation always pulls arms uniformly until all arms satisfy the stopping condition, LinGapE quits pulling arms that is once turned out to be sub-optimal, which prevents LinGapE from observing unnecessary samples. This enhances the performance, especially in the case of , where the number of samples needed for discriminating each arm is severely influenced by the noise.
7.2 Simulation Based on Real Data
We conduct another experiment based on a real-world dataset. We use Yahoo! Webscope Dataset R6A111https://webscope.sandbox.yahoo.com/, which consists of features of -dimensions accompanied with binary outcomes. It is originally used as an unbiased evaluation benchmark for the LB aiming for cumulative reward maximization (Li et al., 2010), and we slightly change the situation so that it can be adopted for pure exploration setting. We construct the 36 dimensional feature set by the random sampling from the dataset, and the reward is generated by
where is the regularized least squared estimator fitted for the original dataset. Although is not necessarily bounded in , we observe that for all features in the dataset. Therefore, is always a valid probability in this case. We compare the performance with the -static allocation algorithm, where the estimation is given by the regularized least squared estimator with . The detailed procedure can be found in Appendix A.
The average number of samples required in ten simulations is shown in Figure 3, in which LinGapE performs roughly five times better than the -static strategy, and the gap of performances increases as we consider more arms. This is because the -static strategy tries to estimate all arms equally, while LinGapE is able to focus on estimating the best arm even if there are many arms.
In this paper, we studied the pure exploration in the linear bandits. We first reviewed a drawback in the existing work, and then introduced a novel fully adaptive algorithm, LinGapE. We proved that the sample complexity of LinGapE matches the lower bound in an extreme case and confirmed its superior performance in the experiments. Since LinGapE is the first algorithm that achieves this lower bound, we would like to consider its various extensions and develop computationally efficient algorithms. In particular, pure exploration in the fixed budget setting is a promising direction of extension, since LinGapE is shares many ideas with UGapE, which is known to be applicable in the fixed budget setting as well (Gabillon et al., 2012). Furthermore, as explained in Section 7.1, the derived sample complexity may be improved since the evaluation of the determinant in Prop. 3 given in Appendix D is loose. The bound based on the tight evaluation of the determinant remains for the future work.
LX utilized the facility provided by Masason Foundation. JH acknowledges support by KAKENHI 16H00881, and MS acknowledges support by KAKENHI 17H00757.
- Abbasi-Yadkori et al. (2011) Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312–2320. Curran Associates, Inc., 2011.
- Audibert and Bubeck (2010) J.-Y. Audibert and S. Bubeck. Best arm identification in multi-armed bandits. In Proceedings of the 23th Conference on Learning Theory, page 13 p., 2010.
- Auer (2002) P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Azuma (1967) K. Azuma. Weighted sums of certain dependent random variables. Tohoku Math. J. (2), 19(3):357–367, 1967.
- Bubeck et al. (2009) S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In Proceedings of the 20th International Conference on Algorithmic Learning Theory, pages 23–37. Springer, 2009.
- Chu et al. (2009) W. Chu, S.-T. Park, T. Beaupre, N. Motgi, A. Phadke, S. Chakraborty, and J. Zachariah. A case study of behavior-driven conjoint analysis on Yahoo!: front page today module. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1097–1104. ACM, 2009.
- Even-Dar et al. (2006) E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(Jun):1079–1105, 2006.
- Gabillon et al. (2012) V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems 25, pages 3212–3220. Curran Associates, Inc., 2012.
- Hoffman et al. (2014) M. Hoffman, B. Shahriari, and N. Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, pages 365–374, 2014.
- Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661–670. ACM, 2010.
- Robbins (1985) H. Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, pages 169–177. Springer, 1985.
- Sherman and Morrison (1950) J. Sherman and W. J. Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950.
- Soare et al. (2014) M. Soare, A. Lazaric, and R. Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems 27, pages 828–836. Curran Associates, Inc., 2014.
- Yu et al. (2006) K. Yu, J. Bi, and V. Tresp. Active learning via transductive experimental design. In Proceedings of the 23rd International Conference on Machine Learning, pages 1081–1088. ACM, 2006.
Appendix A Detailed Procedure of Simulation Based on Real-World Data
In this appendix we give the detailed procedure of the experiment presented in Section 7.2. We use the Yahoo! Webscope dataset R6A, which consists of more than 45 million user visits to the Yahoo! Today module collected over 10 days in May 2009. The log describes the interaction (view/click) of each user with one randomly chosen article out of 271 articles. It was originally used as an unbiased evaluation benchmark for the LB in explore-exploration setting (Li et al., 2010). The dataset is made of features describes each user and each article , both are expressed in 6 dimension feature vectors, accompanied with a binary outcome (clicked/not clicked). We use article-user interaction feature , which is expressed by a Kronecker product of a feature vector of article and that of . Chu et al. (2009) present a detailed description of the dataset, features and the collection methodology.
In our setting, we use the subset of the dataset which is collected on the one day (May 1st). We first conduct the regularized linear regression on whether the target is clicked () or not clicked (). Here, the regularize term is set as . Let be the learned parameter, which we regard as the “true” parameter in the simulation. We consider the LB with arms, the features of which are sampled from the dataset. We limit the the case of for all arms in order to make the problem not too hard. The reward at the -th round is given by
where is the feature of the arm selected at the th round. Although it does not always the case, is happended to be bounded in for all feature in the dataset, therefore is always valid for probability. Furthermore, since , the noise variable is bounded as , which is known as -sub-Gaussian. We run LinGapE on this setting, where the parameter is fixed as , , and , in comparison with -static allocation, where the estimation is given by regularized least squared estimator with .
Appendix B Derivation of Ratio
The original problem of reducing the interval of confidence bound for given is to obtain
in the limit of . Since we choose features from the finite set in the LB, the problem becomes
where the represents the number of times that the arm is pulled before the -th round.
We first conduct the continuous relaxation, which turns the optimization problem (19) into
where corresponds to the ratio . Although this relaxed problem can be solved by convex optimization, it is not suited for the LB setting because the solution depends on the sample size . Therefore, we consider the asymptotic case, where the sample size goes to infinity.
It is proved (Yu et al., 2006, Thm 3.2) that the continuous relaxed problem is equivalent to
Since we consider , there always exists such that . Then, such that cannot be the optimal solution for sufficiently small and thus the optimal solution has to satisfy . Therefore, the asymptotic case of (20) corresponds to the problem
the KKT condition of which yields the definition of in (10).
This lemma is proved by the following lemma.
Let be a positive definite matrix in and be vectors in . Then, for any constant ,
By Sherman-Morrison formula (Sherman and Morrison, 1950) we have,
The last inequality follows from the fact that is positive definite. ∎
Proof of Lemma 1.
By the definition of , we have
from Lemma 2 and the fact
which can be inferred from the definition of . Therefore, proving
completes the proof of the lemma.
For convenience, we write as . The KKT condition of (21) implies that and satisfy
where corresponds to the Lagrange multiplier. Therefore, the optimal value can be written as
Now, let be denoted as
Then, since , we have
The inequality follows from the fact that both of and are positive semi-definite matrices. ∎
Appendix C Proofs of Theorems
occurs, where is the gap of expected rewards between arms and . The following lemma states that this assumption holds with high probability.
Event holds w.p. at least .
Combining Prop. 2 and union bounds proves this lemma.
c.1 Proof of Theorem 1
Let be the stopping round of LinGapE. If holds, that is the returned arm is worse than the best arm by , then we have
The second inequality holds for stopping condition and the last follows from the definition of (Line 2 in Algorithm 2). From this inequality, we can see that means that event does not occur. Thus, the probability that LinGapE returns such arms is