Algorithms for Linear Bandits on Polyhedral Sets
Abstract
We study stochastic linear optimization problem with bandit feedback. The set of arms take values in an dimensional space and belong to a bounded polyhedron described by finitely many linear inequalities. We provide a lower bound for the expected regret that scales as . We then provide a nearly optimal algorithm that alternates between exploration and exploitation intervals and show that its expected regret scales as for an arbitrary small . We also present an algorithm than achieves the optimal regret when subGaussian parameter of the noise is known. Our key insight is that for a polyhedron the optimal arm is robust to small perturbations in the reward function. Consequently, a greedily selected arm is guaranteed to be optimal when the estimation error falls below some suitable threshold. Our solution resolves a question posed by [1] that left open the possibility of efficient algorithms with asymptotic logarithmic regret bounds. We also show that the regret upper bounds hold with probability . Our numerical investigations show that while theoretical results are asymptotic the performance of our algorithms compares favorably to stateoftheart algorithms in finite time as well.
Algorithms for Linear Bandits on Polyhedral Sets
Manjesh K Hanawal Department of ECE Boston Unversity Boston, MA 02215 mhanawal@bu.edu Amir Leshem Department of EE BarIlan University RamatGan, Israel 52900 leshema@eng.biu.ac.il Venkatesh Saligrama Department of ECE Boston Unversity Boston, MA 02215 srv@bu.edu
1 Introduction
Stochastic bandits are sequential decision making problems where a learner plays an action in each round and observes the corresponding reward. The goal of the learner is to collect as much reward as possible or, alternatively minimize regret over a period of rounds. Stochastic linear bandits are a class of structured bandit problems where the rewards from different actions are correlated. In particular, the expected reward of each action or arm is expressed as an inner product of a feature vector associated with the action and an unknown parameter which is identical for all the arms. With this structure, one can infer reward of arms that are not yet played from the observed rewards of other arms. This allows for considering cases where number of arms can be unbounded and playing each arm is infeasible.
Stochastic linear bandits have found rich applications in many fields including web advertisements [2], recommendation systems [3], packet routing, revenue management, etc. In many applications the set of actions are often defined by a finite set of constraints. For example, in packet routing, the amount of traffic to be routed on a link is constrained by its capacity. In webadvertisements problems, the budget constraints determine the set of available advertisements. It follows that the each arm in these applications belongs to a polyhedron.
Bandit algorithms are evaluated by comparing their cumulative reward against the optimal achievable cumulative reward and the difference is referred to as regret. The focus of this paper is on characterizing asymptotic bounds for regret for fixed but unknown reward distributions, which are commonly referred to as problem dependent bounds [4].
We consider linear bandits where the arms take values in an dimensional space and belong to a bounded polyhedron described by finitely many linear inequalities. We derive an asymptotic lower bound of for this problem and present an algorithm that is (almost) asymptotically optimal. Our solution resolves a question posed by [1] that left open the possibility of efficient algorithms with asymptotic logarithmic regret bounds. Our algorithm alternates between exploration and exploitation phases, where a set of arms on the boundary of the polyhedron is played in exploration phases and a greedily selected arm is played superexponentially many times in the exploitation phase. Due to the simple nature of the strategy we are able to provide upper bounds which hold almost surely. We show that our regret concentrates around its expected value with probability one for all . In contrast regret for upper confidence bound based algorithms concentrates only at a polynomial rate [5]. Thus, our algorithms are more suitable for riskaverse decision making. A summary of our results and comparison of regrets bounds is given in Table LABEL:tab:RegretComparison. Numerical experiments show that its regret performance compares well against stateoftheart linear bandit algorithms even for reasonably small rounds while being significantly better asymptotically.
armed bandits  Linear bandits  

dependent  independent  dependent  independent  
Lower bounds  
Upper bounds  
Efficient algorithm  UCB1 [6]  MOSS [7]  SEE (this paper)  [4] 
Related Work: Our regret bounds are related to those described in [4], who describe an algorithm () with regret bounds that scale as , where is the reward gap defined over extremal points. These algorithms belong to the class of so called OFU algorithms (optimism in the face of uncertainty). Since OFU algorithms play only extremal points (arms), one may think that regret bounds can be attained for linear bandits by treating them as armed bandits, were denotes the number of extremal points of the set of actions. This possibility arises from the classical results on the armed bandit problem due to Lai and Robbins [8] who provided a complete characterization of expected regret by establishing a problem dependent lower bound of and then providing an asymptotically (optimal) algorithm with a matching upper bound. But, as noted in [1][Sec 4.1, Example 4.5], the number of extremal points can be exponential in , and this renders such adaptation of multiarmed bandits algorithm inefficient. In the same paper, the authors pose it as an open problem to develop efficient algorithms for linear bandits over polyhedral set of arms that have logarithmic regret. They also remark that since convex hull of a polyhedron is not strongly convex, regret guarantees of their PEGE (Phased Exploration Greedy Exploitation) algorithm does not hold.
Our work is close to FEL (Forced Exploration for Linear bandits) algorithm developed in [17]. FEL separates the exploration and exploitation phases by comparing the current round number against a predetermined sequence. FEL plays randomly selected arms in the exploration intervals and greedily selected arms in the exploitation intervals. However, our policy differs from FEL as follows– 1) we always play fixed set of arms (deterministic) in the exploration phases. 2) noise is assumed to be bounded in [17], whereas we consider more general subGaussian noise model 3) unlike FEL, our policy does not require computationally costly matrix inversions. FEL provides expected regret guarantee of only whereas our policy PolyLin has optimal regret guarantee. Moreover, the authors in [17] remark that the leading constant in their regret bound can be set proportional to (see discussion following Th 2.4 in [17]), but this seems incorrect in light of the lower bound of we establish in this paper.
In contrast to the asymptotic setting considered here, much of the machine learning literature deals with problem independent bounds. These bounds on regret apply in finite time and for the minimax case, namely, for the worstcase over all reward (probability) distributions. [9] established a problem independent lower bound of for multiarmed bandits, and was shown to be achievable in [7]. For linear bandits, problem dependent bounds and well studied and stated in terms of dimension of the set of arms rather than its size. In [10], for the case of finite number of arms, a lower bound of with matching upperbounds is established, where denotes the dimension of the set of arms. For the case when the number of arms is infinite or form a bounded subset of a dimensional space, a lower bound of is established in [4, 1] with matching achievable bounds.
Several variants and special cases of stochastic linear bandits are available depending on what forms the set of arms. The classical stochastic multiarmed bandits introduced by Robbins [11] and later studied by Lai and Robbins [8] is a special case of linear bandits where the set of actions available in each round is the standard orthonormal basis. Auer [12] first studied stochastic linear bandits as an extension of “associated reinforcement learning” introduced in [13]. Since then several variants of the problems have been studied motivated by various applications. In [2, 14], the linear bandit setting is adopted to study contentbased recommendation systems where the set of actions can change at each round (contextual), but their number is fixed. Another variant of linear bandits with finite action set are spectral bandits [15, 16], where the graph structure defines the set of actions and its size. Several authors [4, 1, 17] have considered linear bandits with arms constituting a (bounded) subset of a finitedimensional vector space and remains fixed over the learning period. [18] considers cases where the set of arms can change between the rounds but must belong to a bounded subset of a fixed finitedimensional vector space.
The paper is organized as follows: In Section 2, we describe the problem and setup notations. In Section 3, we derive a lower bound on expected regret and describe our main algorithm SEE and its variant SEE2. In Section 5, we analyze the performance of SEE, and its adaptation for general polyhedron is discussed in Section6. In Section 7 we provide probability bounds on the regret of SEE. Finally, we numerically compare performance of our algorithm against sateoftheart in 8.
2 Problem formulation
We consider a stochastic linear optimization problem with bandit feedback over a set of arms defined by a polyhedron. Let denote a bounded polyhedral set of arms given by
(1) 
where . At each round , selecting an arm results in reward . We investigate the case where the expected reward for each arm is a linear function regardless of the history. I.e., for any history , there is a parameter , fixed but unknown, such that
Under these setting the noise sequence , where forms a martingale difference sequence. Let denote the algebra generated by noise events and arms selections till time . Then is measurable and we assume that it satisfies
(2) 
i.e., noise is conditionally  subGaussian which automatically implies and . We can think of as the conditional variance of noise. An example of subGaussian noise is , or any bounded distribution over an interval of length and zero mean. In our work, is fixed but unknown.
A policy is a sequence of functions such that an arm is selected in round based on the history . Define expected (pseudo) regret of policy over rounds as:
(3) 
where denotes the optimal arm in , which exists and is an extremal point^{1}^{1}1Extremal point of a set is a point that is not a proper convex combination of points in the set. of the polyhedron [19]. The expectation is over the random realization of the arm selections induced by the noise process. The goal is to learn a policy that keeps the regret as small as possible. We will be also interested in regret of the policy defined as
(4) 
For the above setting, we can use [4] or [1] and achieve optimal regret of order . For linear bandits over a set with finite number of extremal points, one can also achieve regret that scales more gracefully, growing logarithmically in time , using algorithms for the standard multiarmed bandits. Indeed, from fundamentals of linear programming
where denotes the set of extremal points of . Since the set of extremal points is finite for a polyhedron, we can use the standard Lai and Robbin’s algorithm [8] or UCB1 in [6] treating each extremal point as an arm and obtain regret bound (problem dependent) of order , where denotes the gap between the best and the next best extremal point. However, the leading term in these bounds can be exponential in , rendering these algorithm ineffective. For example, the number of extremal points of can be of the order . Nevertheless, in analogy with the problem independent regret bounds in linear bandits, one wishes to derive problem dependent logarithmic regret where the dependence on set of arms is only linear in its dimension. Hence we seek an algorithm with regret of order .
In the following, we first derive a lower bound on the expected regret and develop an algorithm that is (almost) asymptotically optimal.
3 Main results
In this section we provide a lower bound on the expected regret and present our proposed policy and prove the main results regarding its complexity.
3.1 Lower Bound
We establish through a simple example that regret of any asymptotically optimal linear bandit algorithm is lower bounded as . Recall the fundamental property of the linear optimization that an optimal point is always an extremal point. Then any linear bandit algorithm on a polyhedral set of arms always play the extremal points. We exploit this fact, and mapping the problem to a standard multiarmed bandits we obtain the lower bound.
We need the following notations to prove the result. Let denote a set of distributions parametrized by and such that each is absolutely continuous with respect to a positive measure on . Let denote the probability density function associated with distribution , and let denote the KullbackLeibler (KL) divergence between distributions and defined as . Consider a set of arms. We say that arm is parametrized by if its reward is distributed according to .
We are now ready to state asymptotic lower bound for the linear bandit problem over any bounded polyhedron with positive measure . Without loss of generality, we restrict our attention to uniformly good policies as defined in [8]. We say that a policy is uniformly optimal if for all , for all .
Theorem 1
Let any uniformly good policy on a bounded polyhedron with positive measure. For any , let for all . Then,
(5) 
Proof sketch: First, note that number of extremal points of any bounded polyhedron with positive measure is atleast . We can then restrict to a bounded polyhedron with extremal points. Let . The extremal points of are . In the linear bandit problem with unknown parameter , playing the extremal point gives mean reward . Also, by the property of linear optimization, any OFU policy will only play extremal points in every round. Then, the linear bandit over polyhedron is the same as armed bandit where reward of th arm is distributed as with mean , and the reward of th arm is distributed as with mean .
The result follows from LaiRobbin’s lower bound for stochastic multiarmed bandits proved in [8] after verifying that the mean values of the parametrized distribution satisfy the required conditions.
3.2 Algorithms
The basic idea underlying our proposed technique is based on the following observations for linear optimization over a polyhedron. 1) The set of extremal points of polyhedron is finite and hence . 2) When is sufficiently close to , then over the set both and give the same value. We exploit these observations and propose a two stage technique, where we first estimate based on a block of samples and then exploit it for much longer block. This is repeated with increasing block lengths so that at each point the regret is logarithmic. For ease of exposition, we first consider the polyhedron that contains origin and postpone the general case to Section 6.
Assume that the polyhedron contains origin as an interior point. Let denote th standard unit vector of dimension . For all , let . The subset of arms are the vertices of the largest simplex bounded in . Since we can estimate by repeatedly playing the arm . One can also estimate by playing an interior point for some . But as will see later selecting the maximum possible improves the probability of estimation error.
AlgorithmSEE
In our policy which we refer as SequentialEstimationExploitation (SEE) we split the time horizon into cycles and each cycle consists of an exploration interval followed by an exploitation interval. We index the cycles by and denote the exploration and exploitation intervals in cycle as and , respectively. In the exploration interval , we play each arm in repeatedly for times. At the end of , using the rewards observed for each arm in in the past  cycles we compute ordinary least square (OLS) to estimate each component separately and obtain the estimate . Using as a proxy for , we compute a greedy arm by solving a linear program and play it repeatedly for times in the exploitation interval , where in an input parameter. We repeat the process for each cycle. A formal description of SEE is given in the adjacent figure. The estimation in line is computed for all as follows:
(6) 
Note that in the exploration intervals, SEE plays a fixed set of arms and no adaption happens, adding positive regret in each cycle. The regret incurred in the exploitation intervals starts reducing as the estimation error gets small, and when it falls below the step (line16) selects the optimal arm and no regret is incurred in the exploitation intervals (Lemma 2). As we will show later, the probability of estimation error decays superexponentially across the cycles, and hence the probability of playing a suboptimal arm in the exploitation interval also decays superexponentially.
Theorem 2
Let the noise be subGaussian and without loss of generality^{2}^{2}2For general , we replace it by and the same method works. Only is scaled by a constant factor. assume . Then, the expected regret of SEE, with parameter is bounded as follows:
(7) 
where denotes the maximum reward. is a constant that depends on noise parameter and the suboptimality gap .
The parameter determines the length of the exploitation intervals, and larger implies that SEE spends less time in exploitation and more time in exploration. Increasing will make SEE spend more time in explorations resulting in improved estimations and reduces the probability of playing suboptimal arm in the exploitation intervals. Hence parameter determines how fast the regret concentrates, and larger its value more ’riskaverse’ is the algorithm. This motivates us to consider a variant of SEE that is more risk averse but at the cost of increased expected regret.
3.3 Risk Averse Variant
Our second algorithmwhich we refer to as SEE2 is essentially same as SEE, except for the length of the exploration intervals which is exponential instead of superexponential and does not depend on . Specifically, we play the greedy arm times in cycle . Compared to SEE, SEE2 spends significantly more time in the exploration intervals, and hence the probability that it makes error in the exploitation intervals is also significantly smaller and thus its regret concentrates around the expected regret faster.
Theorem 3
Let the noise be subGaussian and . Then, the expected regret of SEE2 is bounded as follows:
(8) 
where is a constant that depends on noise parameter and the suboptimality gap .
4 Optimal Algorithm.
We next obtain an optimal algorithm that achieves the lower bound in (5) within a constant factor when the subGaussian parameter is known.
AlgorithmPolyLin:
In our next policy which we refer as PolyhedralLinearbandits we again split the time horizon into cycles consisting of an exploration interval followed by an exploitation interval as in SEE. As earlier, we index the cycles by and denote the exploration and exploitation intervals in cycle as and , respectively. In the exploration interval , we play each arm in once. After cycles, using the rewards observed for each arm in in the past exploration intervals we compute ordinary least square (OLS) to estimate each component separately, and obtain the estimate as follows.
(9) 
Using as a proxy for we compute a greedy arm and the suboptimality gap as follows.
In the exploitation interval , we play repeatedly for times where is set to , where . We repeat the process for each cycle. A formal description of PolyLin is given in the adjacent figure.
Note that the exploration intervals of PolyLin are fixed length, whereas in SEE they are increasing as the the time progresses. Also, exploitation intervals in PolyLin are adaptive, whereas it is nonadaptive in SEE.
Theorem 4
Let the noise be subGaussian and without loss of generality assume . Then, the expected regret of PloyLin is bounded as follows:
(10) 
where denotes the maximum reward. and are constants that depends on noise parameter and the suboptimality gap .
5 Regret Analysis
In this section we prove Theorem 2, the proof of Theorem 3 follows similarly and omitted. We first derive the probability of error in estimating each component of in each cycle. Note that in the exploration stage of each cycle we sample each arm , 2 times more than that in the exploration stage of the previous cycle. Thus, we have plays of each arm at the end of cycle . The estimation error of component after cycles is given as follows:
Lemma 1
Let the noise be subGaussian and . In any cycle of both SEE and SEE2, for all we have
(11) 
Note that larger the value of , the smaller the probability of estimation error is. The next lemma gives the probability that we play a suboptimal arm in the exploitation intervals of a cycle.
Lemma 2
For every cycle , we have

Let . The estimation error is bounded as
(12) 
Let . The error in reward estimation is bounded as
(13) 
Probability that we play a suboptimal arm is bounded as
(14)
The proofs of Lemmas 1 and 2 are given in appendix. Recall that the number of extremal points is finite for the polyhedron and . We use this fact to argue that whenever , the greedy stage of the algorithm selects the optimal arm. This in an importation observation and follows from continuity property of optimal point in linear optimization theory [19]. Further, the probability of this event decays superexponentially fast in our policy implying that the probability that we incur a positive regret in the exploitation intervals is gets negligibly small over the cycles. We compute the expected regret incurred in the exploration and exploitation intervals separately.
5.1 Regret of SEE.
We analyze the regret in the Exploration and Exploitation phases separately as follows.
Exploration regret: At the end of cycle , each arm in is played times. The total expected regret from the exploration intervals after cycles is at most .
Exploitation regret: Total expected regret from the exploration intervals after cycle is
(15) 
where is a convergent series. After cycles, the total number of plays is and we get . Finally, expected regret form rounds is bounded as
5.2 Regret of PolyLin.
We analyze the regret in the Exploration and Exploitation phases separately as follows.
Exploration regret: After cycles, each arm in is played times. The total expected regret from the exploration intervals after cycles is at most .
Exploitation regret: Total expected regret from the explorations interval after cycles is
(16) 
Now consider the series .

From Lemma 2(a), as almost surely, we get almost surely and which in turn implies almost surely.

Then, for , the difference for all but finitely many . Hence, is finite.
After cycles the total number of plays is , and we get . Finally, expected regret form rounds, as , is bounded as
Note that for all but finitely many . Then for sufficiently large we get Substituting in the last inequality we get
6 General Polyhedron
In this section we extend the analysis of the previous section to the case where origin is not an interior point of .
Analogous to set , we first define a set of arms that lie on the boundary of the polyhedron and these points are computed with respected to an interior point of that we use as a proxy for origin. We use OPT1 to find an interior point, whose smallest distance to boundaries along all the directions is the largest. The motivation to maximize the minimal distances to the boundaries comes from lemma 2, where larger value of imply smaller probability of estimation error.
OPT1:
subjected to:  
OPT2:
subjected to:  
OPT1 can be translated into an equivalent linear progamme given in OPT2 and hence the point can be efficiently computed. We note that the set of points need not all necessarily lie on the boundary. To see this, let the point returned by OPT1 is such that it is closer to the boundary along th direction. Then the vector with all its component equal to is a solution of OPT1. To overcome this, we further stretch each point along the direction such that it hits the boundary. Let . Finally, we fix the set of arms we use for explorations as .
We are now ready to present an algorithm for linear bandits over for any polyhedra. For the general polyhedron, we use the SEE with the exploration strategy modified as follows. In cycle , we first play the arm for and then play each arm in times as earlier. To estimate the component , we average the difference in rewards observed from arms and so far. From a straightforward modification of regret analysis of SEE, we can show that the expected regret of modified algorithm is upper bounded as for all .
The new algorithm required that we play the arm along with the arms in in the exploration intervals to obtain estimate of , and it increases the length of exploration intervals. However, it is possible that one can obtain estimates only by playing arms in provided we suitably modify the estimation method. More details are given in the appendix.
7 Probability 1 Regret Bounds
Recall the definiton of expected regret and regret in (3) and (4). In this section we show that with probability 1, the regret of our algorithms are within a constant factor from the their expected regret.
Theorem 5
With probability , is and is .
Proof: Let denote an event that we select suboptimal arm in the th cycle. From Lemma 2, this event is bounded as . Hence . Now, from application of BorelCantelli lemma, we get , which implies that almost surely SEE and SEE2 play optimal arm in all but finitely many cycles. Hence the exploitation intervals contribute only a bounded regret. Since the regret due to exploration intervals is deterministic, the regret of SEE and SEE2 are within a constant factor from their expected regret with probability 1, i.e., and . This completes the claim.
We note that the regret bounds proved in [4] hold with high confidence, where as ours hold with probability and hence provides a stronger performance guarantee.
8 Experiments
In this section we investigate numerical performance of our algorithms against the known algorithms. We run the algorithms on a hypercube with dimension . We generated randomly and noise is zero mean Gaussian random variable with variance in each round. The experiments are averaged over runs. In Fig. 1 we compare SEE and SEE2 against UCBNormal [20], where we treated each extremal point as an arm of an armed bandit problem. As expected, our algorithms perform much better. UCBNormal need to sample each of the atleast once before it could start learning the right arm. Whereas, our algorithm starts playing the right arm after a few cycles of exploration intervals. In Fig. 2, we compare our algorithms against the linear bandits algorithm LinUCB and selfnormalization based algorithm in [18], which is labeled SelfNormalized in the figure. For these we set confidence parameter to . We see that SEE beats LinUCB by a huge margin, but its performance comes close to that of SelfNormalized algorithm. Note that SelfNormalzed algorithm requires knowledge of subGaussianity parameter of noises super. Whereas, our algorithms are agnostic to this parameter. Though, SEE2 seems to play the right arm in exploitation intervals, its regret performance is poor. This is due to increased number of exploration intervals, where no adaptation happens and a positive regret is always incurred.
The numerical performance of SEE2 can be improved by adaptively playing the arms in the exploration plays as follows, but at the increase cost of computations complexity. In each cycle , we find a new set computed by setting to , the greedy arm selected in the previous cycle, and play the new set arms as in the explorations intervals of the algorithm given for the general polyhedron. However, since is an extremal points some of the ’s are zero. To overcome this, we slightly shift the point into the interior of the polyhedron along the direction and find a new set with respect to the new interior point. The regret of the algortihm based on this adaptive exploitation strategy is shown is Fig. 2 with label ’ImprovedSEE2’. As shown, the modification improves performance of SEE2 significantly. In all the numerical plots, we initialized the algorithm to run from cycle number .
9 Conclusion
We studied stochastic linear optimization over polyhedral set of arms with bandit feedback. We provided asymptotic lower bound for any policy and developed algorithms that are near asymptotically optimal. The regret of the algorithms grow (near) logarithmically in and its growth rate is linear in the dimension of the polyhedron. We showed that the regret upper bounds hold almost surely. The regret growth rate of our algorithms is for some . It is interesting to develop strategies that work for , while still maintain linear growth rate in .
References
 [1] P. Rusmevichientong and J. N. Tsitsiklis, “Linearly parameterized bandits,” Mathematics of Operations Research, vol. 35, no. 2, pp. 395–411, 2010.
 [2] L. Li, C. Wei, J. Langford, and R. E. Schapire, “A contextualbandit approach to personalized news article recommendation,” in Proceeding of International Word Wide Web conference, WWW, NC, USA, April 2010.
 [3] D. Jannach, M. Zanker, A. Felfernig, and G. Friedrich, Recommender Systems: An Introduction. Cambridge University Press, 2010.
 [4] V. Dani, T. P. Hayes, and S. M. Kakade, “Stochastic linear optimization under bandit feedback,” in Proceeding of Conference on Learning Theory, COLT, Helsinki, Finland, July 2008.
 [5] J.Y. Audibert, R. Munos, and C. Szepesvári, “Explorationâexploitation tradeoff using variance estimates in multiarmed bandits,” Theoretical Computer Science, vol. 410, p. 1876â1902, 2009.
 [6] P. Auer, NicholóCesaBianchi, and P. Fischer, “Finitetime analysis of multiarmed bandit problem tradeoffs,” Journal of Machine Learning, vol. 3, pp. 235–256, 2002.
 [7] J.Y. Audibert and S. Bubeck, “Regret bounds and minimax policies under partial monitoring,” Journal of Machine Learning Research, vol. 11, pp. 2635–2686, 2010.
 [8] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Journal of Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.
 [9] N. CesaBianchi and G. Lugosi, Prediction, Learning, and Games. Cambridge University Press, New York, 2006.
 [10] P. Auer, N. CesaBianchi, Y. F. Robert, and E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM Journal on Computing, vol. 32, 2003.
 [11] H. Robbins, “Some aspects of the sequential design of experiments,” Bulletin of the American Mathematics Society, vol. 58, pp. 527–535, 1952.
 [12] P. Auer, “Using confidence bounds for exploitationexploration tradeoffs,” Journal of Machine Learning Research, vol. 3, pp. 397–422, 2002.
 [13] N. Abe and P. M. Long, “Associative reinforcement learning using linear probabilistic concepts,” in Proceeding of International Conference on Machine Learning (ICML), 1999.
 [14] W. Chu, L. Li, L. Reyzin, and R. E. Schapire, “Contextual bandits with linear payoff functions,” in Proceeding of International Conference on Artificial Intelligence and Statistics (AISTATS), 2011, pp. 208–214.
 [15] M. Valko, R. Munos, B. Kveton, and T. Kocák, “Spectral bandits for smooth graph functions,” in Proceeding of International Conference on Machine Learning (ICML), 2014.
 [16] M. Hanawal, V. Saligrama, M. Valko, and R. Munos, “Cheap bandits,” in Proceeding of International Conference on Machine Learning (ICML)(to appear), 2015.
 [17] Y. AbbasiYadkori, A. Antos, and C. Szepesvári, “Forcedexploration based algorithms for playing in stochastic linear bandits,” in Proceeding COLT workshop on Online Learning with Limited Feedback, 2009.
 [18] Y. AbbasiYadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” in Proceeding of Advances in Neural Information Processing Systems (NIPS), 2011, pp. 2312–2320.
 [19] D. Bertsimas and J. N. Tsitsiklis, Introduction to Linear Optimization. Athena Scientific, Belmont, Massachusetts, 2008.
 [20] P. Auer, N. CesaBianchi, and P. Fischer, “Finitetime analysis of the multiarmed bandit problem,” Machine Learning, vol. 47, no. 2, pp. 235—256, 2002.
Proof of Lemma 1
Let denote the noise in reward from playing in phase for the th time. We bound the estimation error as follows:
(17)  
(18)  
(19)  
(20)  
(21)  
(22)  
(23)  
(24)  
(25) 
where (18) follows from estimation step given in (6). In (19) and (20) we exponentiated both sides within the probability functions after multiplying them by . (21) follows by applying union bound and using the symmetric property of the noise terms. In (22) we applied the Markov inequality. In (23) we aplied conditional independence property of the noise. (24) follows by applying the definition of subGaussian property.
Proof of Lemma 2
Part a:
Part b:
For all , we have
(30) 
Define events and . The last inequality implies . The claim follows from parta of the lemma.
Part c:
Suppose , where is the optimal arm, such that . Then, since we must have that either or , otherwise we cannot close the gap. Hence, if the greedy selection in cycle is not , it implies that there exists a such that . From partb this probability is bounded as , where . This completes the proof.
Estimation in the case general polyhedron
Let . Let denote the average of the reward obtained from arm till end of phase . At the end of phase , we estimate as follows:
where denote the diagonal matrix with diagonal elements as and is the vector with th component as . By applying matrix inversion lemma we get
After simplification, for each we have
Substituting the reward from arm , i.e.,
and further simplifying we get
where and is the noise average from playing arm .