In this paper, we investigate the impact of context diversity on stochastic linear contextual bandits. As opposed to the previous view that contexts lead to more difficult bandit learning, we show that when the contexts are sufficiently diverse, the learner is able to utilize the information obtained during exploitation to shorten the exploration process, thus achieving reduced regret. We design the LinUCB-d algorithm, and propose a novel approach to analyze its regret performance. The main theoretical result is that under the diverse context assumption, the cumulative expected regret of LinUCB-d is bounded by a constant. As a by-product, our results improve the previous understanding of LinUCB and strengthen its performance guarantee.
In many applications, such as resource allocation in cloud computing platforms, or treatment selection for patients in clinical trials, the diverse user preferences and characteristics impose urgent need of personalized decision-making. In order to make optimal decisions for different individuals, the decision maker must learn a model to predict the reward when a decision is taken under different contexts. This problem is often formulated as a contextual bandit problem (Auer:2003:UCB; Langford:2008), which generalizes the classical multi-armed bandit (MAB) framework (lai1985asymptotically; auer2002finite; bubeck2012regret; agrawal2012analysis; agrawal2013further).
The inclusion of contextual information in the decision-making process introduces more uncertainty into the MAB framework and creates significant challenges for the learning problem. Part of the difficulty in contextual MAB comes from the increased problem dimension, as context is added as part of the unknown environment. Existing literature mostly focuses on developing bandit algorithms and providing theoretical analysis by treating the context as structureless side information. The resulting models and algorithms are generic, but represent a “worst case” scenario since very little, if any, structure of the context is exploited.
In many real-world applications, however, context often exhibits sufficient level of diversity, which has been largely overlooked in the existing studies. For example, user profile is considered as the context in recommendation systems (Li:2010:LinUCB). When the system serves a large number of users, the group of user profiles is likely to be very diverse. As another example, contextual MAB has been adopted in service placement of mobile edge computing, which utilizes the time of day and mobile user types as the context (Chen2019).
It is not difficult to see that in these applications, the context arrival exhibits sufficient diversity that may be beneficial to the bandit algorithm design. Intuitively, the diverse contexts create opportunities for the learner to reduce the learning regret: when an arm is pulled frequently as the optimal arm for certain contexts, its parameters can be estimated accurately with the rewards obtained during exploitation. Therefore, the learner does not have to spend much time exploring it when the instantaneous context is not in favor for it, thus shortening the exploration stage and speeding up the convergence.
In this paper, we demonstrate this optimistic view of context diversity by investigating the impact of diverse contextual information on the learning regret under the stochastic linear contextual bandits framework (Bastani2017ExploitingTN). We show that, instead of considering the context as part of the “uncontrollable” environment and passively “reacting” to the incoming context, proactively interacting between context and arm exploration allows learning to transfer between different contexts and leads to much better overall performance. Specifically, we consider a set of arms, where the parameter of each arm is represented by a -dimensional vector unknown to the learner. When an arm is pulled under a context , the obtained reward is the inner product of the parameter of the arm and a feature vector determined by arm and context , corrupted by noise. The objective of the learner is to use the information contained in the observed rewards to decide an arm to pull in response to the instantaneous context. Assuming independent and identically distributed (i.i.d.) contexts, we aim to show that when the contexts are sufficiently diverse, the cumulative learning regret in expectation can be bounded by a constant.
The main contributions of this paper are three-fold. First, we formally introduce the concept of context diversity into the stochastic linear contextual bandit framework and present a novel geometric interpretation. Such geometric interpretation provides an intuitive viewpoint to understand and analyze the impact of context diversity on the learning performance of stochastic linear contextual bandits.
Second, we propose an Upper Confidence Bound (UCB) type algorithm, termed as LinUCB-d, for the contextual bandit model. The new formulation of LinUCB-d enables a unique approach to characterize the impact of context diversity and achieve finite cumulative regret. The results also extend the existing understanding of LinUCB and strengthen its performance guarantee.
Third, we design a novel approach to analyze the performance of LinUCB-d. There are two distinct features of our approach: First, we relate the uncertainty in the estimated rewards with the solution to a constrained optimization problem, and leverage the optimality of the estimator to bound the corresponding frequency of bad decisions during the learning process. Second, we propose a frame-based approach to isolate the error events on a frame basis and make the regret tractable. These techniques are novel and may find useful in other related settings.
2 Problem Formulation
We consider a set of items (arms) denoted as . Assume for each there is a fixed but unknown parameter vector . At each time , the learner observes a random context , which is generated according to an unknown distribution. Next, the learner decides to pull an arm based on the information available. The incurred reward is given by , where is a random noise, and is a linear function of and the feature vector , i.e.,
Here we use to denote the transpose of vector .
Let be the set of all contexts. For any , define
i.e., is the subset of contexts under which arm is the best arm rendering the maximum expected reward, and is the set of feature vectors when arm is pulled under contexts in . Let be the -field summarizing the information available just before is observed. We make the following assumptions throughout the paper.
Bounded parameters: For any , , we have , .
Minimum reward gap: For any , , , .
Conditionally 1-subgaussian noise: Given , is conditionally 1-subgaussian with , for any .
Stochastic context arrivals: In each time , is drawn from the context set in an i.i.d. fashion according to a distribution .
Diverse contexts: For any arm , , where is the smallest eigenvalue of .
Assumption 1.1 ensures that the maximum regret at any step is bounded. Assumption 1.2 indicates that under a given context , the optimal arm is strictly better than any other sub-optimal arms. Such a reward gap affects the convergence rate of the proposed algorithm (similar to the stochastic MAB setting). Assumption 1.3 allows us to utilize the induced super-martingale to derive exponentially decaying tail bound of the estimation error. We note that all three assumptions on the bandit model are standard in the bandit literature or in the study of linear bandits.
Assumption 1.4 is a non-critical assumption made for ease of exposition. Essentially, what is required for the main results to hold is the ergodicity of the context arrival process, i.e., contexts lying in certain favorable subsets recur frequently in time.
Assumption 1.5 is however critical for our main results to hold. It is equivalent to the condition that
where is a random matrix whose columns are feature vectors associated with arm and i.i.d. contexts drawn according to the conditional distribution of given . It implies two conditions: First, all arms in could be optimal under certain contexts, i.e., . Second, for all contexts in favor of the same arm (i.e., contexts in ), they are sufficiently diverse so that the corresponding feature vectors span . If the first condition does not hold, the arms that are strictly sub-optimal have to be explored sufficiently frequently in order to be distinguished from the optimal arms, thus an regret is unavoidable in this situation. For the second condition, although it seems strict at first sight, it is actually quite reasonable in practice. This is because if a feature vector falls in , we would expect that feature vectors drawn from a small neighborhood of it fall in as well. Since small perturbations of a vector can form a full-rank matrix, it is thus reasonable to assume .
Assume is given and is unknown a priori. The cumulative regret of an online learning algorithm is defined as
where . While sublinear learning regret has been established for such linear contextual bandits (Abbasi:2011:IAL; Chu:SuperLin), our objective is to investigate the fundamental impact of context diversity on the expected regret .
The existing linear contextual bandit algorithms such as the celebrated LinUCB (Li:2010:LinUCB) can be directly applied to the considered bandit problem. However, such approaches ignore the diversity in context arrivals and offer little insight to the understanding of diversity. In this section, we propose a Linear Upper Confidence Bound algorithm to manifest the impact of the diversity of context on the scaling of the learning regret. To distinguish it from LinUCB, we term it LinUCB-d.
We label all contexts that have appeared in the order of their first appearances. We assume there are different contexts that have appeared before time . With a slight abuse of notation, we denote the subset of those contexts as . Besides, we add dummy contexts, and denote the subset as . In the following, we use to index the contexts, while the first contexts are in and the last are the added dummy ones. For the added dummy contexts, we assume the corresponding feature vector , , where is the unit vector whose th entry is 1, and is the upper bound on .
Let be an indicator function that takes value one when is true and zero otherwise. Define , i.e., the number of times that arm is pulled under context up to time . Denote as the cumulative reward of pulling arm under context right before time , i.e. for any . We point out that for the dummy contexts, i.e., , at any time since the dummy contexts never appear.
To simplify the notation, we let be the row vector with ’s, and be the row vector with ’s. We also introduce the following matrix(vector)-form notations:
Besides, we use to denote the pseudo-inverse of obtained by flipping its non-zero entries, i.e.,
The proposed LinUCB-d algorithm is presented in Algorithm 1, where we set in the expression of . It adopts the Optimism in Face of Uncertainty (OFU) principle where the learner always chooses the arm with the highest potential reward after padding a UCB term.
in Algorithm 1 is the solution to the following optimization problem:
Remark: The rationale behind Algorithm 1 can be intuitively explained as follows: For each incoming , the learner needs to estimate the expected reward for each of the arms before it decides which one to pull. Due to the linear reward structure in (1), if we are able to express as a linear combination of the feature vectors in in the form of , then, the expected reward can be expressed as , where . Since can be estimated based on observed rewards generated by pulling arm in the past, we can then estimate directly without trying to estimate first.
Thus, the problem boils down to obtaining a valid representation of in the form of . The existence of such a representation can be guaranteed by including the unit vectors associated with the dummy contexts in . On the other hand, such a representation may not be unique when arm is pulled and more feature vectors are added to . That is when Proposition 1 comes into play: by minimizing the objective function in (6) subject to the linear constraint, we pick the representation that minimizes the uncertainty in the estimated .
We point out that inclusion of the dummy contexts introduces bias to the estimation. However, as increases and gets expanded by including more feature vectors, the bias caused by the dummy contexts will vanish gradually. This is because under Assumptions 1.4 and 1.5, the optimal solution to (6) will put more and more weights on feature vectors associated with the observed contexts instead of the dummy ones.
Proposition 1 provides a brand new angle to view the linear contextual bandit problem. Leveraging this new viewpoint and the additional diversity assumption on the contexts, we will show that a constant regret can be achieved under LinUCB-d.
We note that LinUCB-d turns out to have deep connections with LinUCB. In order to avoid diversion from the main focus of this work, which is to elucidate the fundamental impact of context diversity on learning regret, we leave the comparison with LinUCB to Appendix B.
4 Analysis: Finite Contexts
In order to obtain some insights on how the diversity of context could help reducing the learning regret, in this section, we focus on a scenario where the context is drawn in an i.i.d. fashion from a finite set according to a uniform distribution. With insight obtained for this scenario, we will extend the result and analysis to a general context distribution setting in Section 5.
According to Assumption 1.5, there must exist at least one subset of distinct contexts in , such that the corresponding feature vectors span . Denote
and as the contexts associated with the feature vectors in . Then, under Assumption 1.5, . Intuitively, can be used as a metric for the diversity of context under this setting. We present our main theoretical result for the finite contexts setting as follows.
Theorem 1 indicates that the expected regret is bounded by a constant, which is in stark contrast to the state-of-the-art results on linear contextual bandits. It indicates that diverse contexts can indeed help to accelerate the learning process and make it converge to the optimal solution within finite steps on average. Besides, the constant bound monotonically decreases as increases, which is consistent with our intuition that larger diversity of context is more advantageous in learning.
We point out that the dependence on the number of contexts in the upper bound can be further reduced to a constant that does not scale in the total number of contexts, as we will show in the general context distribution setting in Section 5.
4.1 Sketch of the Proof of Theorem 1
The complete proof of Theorem 1 can be found in Appendix C. In this section, we provide a sketch of the proof to highlight the key ideas and shed light on the profound impact of context diversity to the learning performance.
The bounded regret in Theorem 1 can be intuitively explained in this way: thanks to context diversity under Assumption 1.5, arms that are suboptimal for a given context are optimal for some other contexts. Since contexts are drawn in an i.i.d. fashion, then, with high probability, each arm will be played as an optimal arm for a linear fraction of time. Context diversity then ensures that for any arm , the feature vector for any incoming context can be expressed as a linear combination (denote the coefficient vector as ) of the columns of . We note that can be estimated accurately based on the rewards collected when is pulled as an optimal arm. Hence, if were given a priori, the error of using the linear combination of to predict would decrease in the order of . To overcome the difficulty that is unknown beforehand, LinUCB-d greedily selects the linear combination (with coefficient vector ) to minimize the estimation uncertainty. Then, according to Proposition 1, the corresponding estimation uncertainty must be lower than that if were used, leading to a faster decay of the prediction error.
As explained above, the key to the result in Theorem 1 is to show that each arm will be played as an optimal arm for a linear fraction of time. In order to show this, we propose a novel frame-based approach.
Specifically, we divide the time axis into frames with lengths , , starting at . Denote as the time slots lying in the -th frame, i.e., Denote as the number of times that context appears up to time , and as the number of times context appears in , i.e., . Similarly, we define as the number of times arm is pulled under context in . We consider the following error events:
Irregular context arrivals. For each arm , we focus on the contexts in . Within a frame, if the total number of arrivals of any context is smaller than half of its expected number of arrivals in that frame, we term it irregular context arrivals. If irregular context arrivals happen in frame , we will put all time indices in the th frame in , i.e.,
Intuitively, due to the i.i.d. context arrival assumption, the probability of having irregular context arrivals in the th frame decays exponentially in the length of frame . Thus, the corresponding regret over can be bounded by a constant. The detailed analysis can be found in Appendix C.1.
Bad estimates. At time , if the estimated reward deviates from its expected value by more than , we term it a bad estimate. We group the time slots with bad estimates over in , i.e., The regret over can be bounded by a constant by adapting the Laplace method (LS19bandit-book) to our setting. The detailed analysis is deferred to Appendix C.2.
Bad presence of good estimates. Within a frame, if the total number of time slots with bad estimates exceeds of the frame length, we term the event bad presence of good estimates. If such an event happens in frame , we put all time indices in the th frame in , i.e., where is the number of bad estimates in frame . As shown in Appendix C.3, can be upper bounded by a linear function of . The regret over can thus be bounded as a linear function of the regret over .
Pulling sub-optimal arms in good time slots. For any time slot not included in , or , we call it a good time slot. The learner may still pull a sub-optimal arm in a good time slot, due to the overlap of the confidence intervals of . We group the time slots when such event happens in . Specifically,
While the regrets over , or can be bounded in a relatively straightforward way, characterizing the regret over relies on the context diversity, and is the most critical step towards the constant regret in Theorem 1. The detailed analysis is provided in Appendix C.4. It involves the following major steps:
After assembling the regrets over , , and together, the result in Theorem 1 can be obtained.
Remark: We point out that the operation of LinUCB-d itself does not depend on frames. We introduce them for the purpose of analysis only. Besides, LinUCB-d does not require the knowledge of , or the distribution of . It can operate under general context arrival processes, even if Assumption 1 does not hold.
5 Analysis: General Context Arrivals
In this section, we extend the analysis for the finite uniform context distribution setting to the general context distribution setting. Compared with the finite contexts case, the major difference for the general setting is that the context set could be infinite and even uncountable. Although LinUCB-d still works in the same way, the corresponding performance analysis becomes much more challenging. For the finite contexts case, we choose a set of feature vectors (columns in ) as the basis for , and show that a linear combination of the corresponding empirical average rewards leads to a fast decaying estimation error, as the number of times is pulled under contexts in scales linearly in time. However, for general context arrivals, the recurrence of any finite subset of contexts may have probability zero. Thus, the previous analysis cannot be extended straightfowardly to handle such case.
To overcome such challenges, we make the following modifications: First, we extend the definition of from distinct contexts to non-overlapping meta-contexts, where each meta-context consists of a subset of contexts with a non-zero probability mass. Thus, the meta-contexts recur frequently, similar to the finite contexts setting. One subsequent challenge associated with the meta-contexts is that feature vectors associated with the contexts in a meta-context are different and occur randomly. Thus, we cannot fix a basis (such as the columns in as in the finite contexts case) beforehand for , as the corresponding contexts may not appear frequently in time. Rather, it needs to be adaptively selected based on up-to-date observations. How to ensure the existence of such a valid basis at each time is thus challenging.
We construct the meta-contexts and a basis for each arm as follows. First, we select a matrix with , and denote its columns as . Assumption 1.5 ensures the existence of such for each according to (4). Let
Then, we have with the selected s.
We then divide into disjoint groups based on their closeness to , and break the tie arbitrarily, e.g.,
Let , and be an ball centered at with radius . Let . Then, as shown in Lemma 7 in Appendix D, a valid basis for can be formed if an arbitrary vector is picked from each of the subsets . We then take the sample average of the previously observed feature vectors in (denoted as ) as the corresponding basis vector.
The definition of induces the definition of meta-contexts as follows:
Then, Assumption 1.5 ensures that there exists such that is bounded away from zero. For ease of exposition, in the following, we simply use to denote without causing ambiguity.
Denote as the total number of times that the contexts in meta-context appear up to time . We then keep the definitions of and the same as in the finite context set setting and modify the definition of and as follows:
Intuitively, the regret over remains unchanged, while the regrets over and can be obtained through a straightforward extension of the previous results in the finite contexts case. The challenge of the analysis thus lies in the analysis of the regret over , whose major steps are listed as follows.
We then show that the total number of times that arm is pulled as the optimal arm under contexts in scales linearly in (Lemma 10).
Putting everything together, we have the following bounded regret for the general case. The detailed proof is provided in Appendix D.
Theorem 2 indicates that even for the general context distribution setting where the contexts are drawn from a continuous set, we are still able to obtain a constant regret bound. Compared with the result in Theorem 1, the scaling in terms of and is larger, due to the inclusion of multiple contexts in the meta-contexts.
Remark: Similar to the finite contexts case, , , , , and are introduced for the purpose of analysis only, and are not required for LinUCB-d.
6 Experimental Evaluation
6.1 Uniform Context Arrivals
First, we consider a simplified scenario with 2 arms and 4 contexts for a proof of concept. We assume the arm parameters are , . The arm-context feature vectors are as follows: , , , , , , , . The expected rewards for pulling the arms under the four different contexts can be calculated accordingly. Therefore, arm is the optimal arm under contexts and and arm is the optimal arm under contexts and . We can verify that and both span , thus they are valid basis for and , respectively.
With the selected parameters, we compare LinUCB-d with the following baseline algorithms through simulation: 1) UCB with for individual contexts. We treat the arms under each context as a standard MAB and perform UCB for each context. 2) LinUCB with the same choice of as in LinUCB-d. 3) A greedy LinUCB with . This is the pure exploitation algorithm considered in Bastani2017ExploitingTN essentially.
For each algorithm, we randomly pick one out of those four contexts with probability each time, and add i.i.d. noise according to a standard Gaussian distribution to generate the reward. We run the simulation 100 times for each algorithm over 500,000 time slots. The sample average pseudo regrets are plotted in Fig. 1(a), where the pseudo regret is obtained by replacing in the definition of regret by , and the shaded area corresponds to twice of the standard deviation. As we expect, LinUCB-d with the same choice of behaves exactly the same as LinUCB, and shows bounded regret. However, the greedy algorithm and UCB do not achieve constant regret. This indicates the following: First, the pure exploitation strategy does not work well in this case. This is because the selected parameters do not satisfy the covariate diversity defined in Bastani2017ExploitingTN. The covariate diversity in Bastani2017ExploitingTN requires that the correlation matrix of the feature vectors lying in any half space is positive definite. It requires that there are feature vectors at least in any half space. Since the feature vectors in our example only lie in the first orthant, the covariate diversity condition is not satisfied and hence the greedy approach does not work well. Second, treating each context individually does not utilize the information obtained under other contexts about the same arm, thus cannot leverage the diversity of context to reduce the regret.
Next, we evaluate how affects the regrets. We modify the feature vectors associated with contexts 2 and 3 while keeping the rest parameters the same. Specifically, we let , , , . Compared with the previous setting, increases from to , while the reward gap stays approximately the same. Intuitively, the basis vectors for each arm now point to more perpendicular directions and are more diverse in this sense. As indicated in Fig. 1(b), the increased diversity leads to much faster convergence and lower regret.
6.2 General Context Arrivals
In this part, we investigate the performance of LinUCB-d with a more general context distribution. We first randomly generate parameter vectors in for 5 arms under the constraint that . Thus, the arms are randomly located on a sphere in with radius 10, which ensures that each of them can be optimal under certain contexts. For the feature vectors, we randomly draw for at each time and make sure the reward gap condition in Assumption 1.2 is satisfied. We set throughout the simulation. The contexts are drawn from a continuous set which includes infinite many contexts.
We only compare LinUCB-d with greedy LinUCB under this setup. This is because UCB for individual contexts cannot be run without recurring contexts, and LinUCB with the same behaves the same as linUCB-d. The sample average pseudo regrets are plotted in Fig. 1(c). As we observe, LinUCB-d still achieves constant regret, while the greedy algorithm does not converge.
7 Related Work
The model considered in this paper falls in the contextual bandits framework. In the contextual MAB setting, the learner repeatedly takes one of actions in response to the observed context (Auer:2003:UCB). Efficient exploration based on instantaneous context is of critical importance for contextual bandit algorithms to achieve small learning regret. The strongest known results (Auer:2003:UCB; Langford:2008; McMahan2009; beygelzimer11a; Dudk:2011; Agarwal2014) achieve an optimal regret after rounds of with high probability.
More specifically, our reward model is similar to that of linear contextual bandits in the literature. This setting is first introduced in Auer:2003:UCB through the LinRel algorithm and is subsequently improved through the OFUL algorithm in DaniHK08 and the LinUCB algorithm in Li:2010:LinUCB. Tsitsiklis:2010:LinearBandits extend the work of DaniHK08 by considering both optimistic and explore-then-commit strategies. It is shown in Abbasi:2011:IAL that the regret can be upper bounded by , where is the dimension of the context. A modified version of LinUCB, named SupLinUCB, is considered in Chu:SuperLin, and shown to achieve regret. Later, Valko:2013:Kernel mix LinUCB and SupLinUCB with kernel functions and propose an algorithm to further reduce the regret to , where is the effective dimension of the kernel feature space. This line of literature typically allows for arbitrary (adversarial) context sequences, and the regret persists.
Recently, a few works start to take the diversity in contexts into consideration. goldenshluger2013 introduce a notion of diversity similar to Assumption 1.5 to a two-armed linear bandits setting. They show that the regret scales in when a margin condition is satisfied, where the contribution from the “large-margin” covariates scales in . Bastani:2015 generalize the notation to a so called “compatibility condition” in a contextual linear bandits model with high-dimensional covariates, and investigate a LASSO based approach. They show that the regret can be bounded by a polynomial of under the margin condition. The regret persists for error events associated with large-margin covariate vectors. In contrast, we show that a bounded regret can be achieved, by leveraging the geometric interpretation of the diversity condition and the reward gap condition.
Bastani2017ExploitingTN propose a concept called covariate diversity, which requires that the correlation matrix of the covariate vectors lying in any half space is positive definite. Under this condition, it shows that the exploration-free greedy algorithm is near-optimal for a two-armed bandit under the stochastic setting and achieves regret in . A perturbed adversarial setting with a similar notion of diversity is studied in kannan2018smoothed. It shows that greedy algorithms can achieve regrets in . We note that such condition is stronger than Assumption 1.5. As illustrated through simulations in Section 6, a greedy strategy may not work well under our setting, due to the difference between the diversity definitions.
The main purpose of this paper was to study the impact of context diversity on the learning performance in stochastic linear contextual bandits. We have shown that, by adding an assumption that the context arrivals satisfy some diversity conditions, it is possible to significantly reduce the learning regret of contextual bandits. We proposed an algorithm called LinUCB-d and showed that when the diversity assumption is satisfied, the expected regret can in fact be upper bounded by a constant. This study illustrates the power of incorporating structure in the contexts to the bandit problem. It is of interest to evaluate whether other structures of the context can be similarly considered, and what their impacts would be. Another interesting problem is to study the impact of context diversity in other settings, such as the perturbed adversarial setting (kannan2018smoothed).
JY acknowledges the support from U.S. National Science Foundation under Grant ECCS-1650299.
Supplementary Material: Stochastic Linear Contextual Bandits with Diverse Contexts
Weiqiang Wu &Jing Yang &Cong Shen
London Stock Exchange The Pennsylvania State University University of Virginia
Appendix A Proof of Proposition 1
For the constrained convex optimization problem in (6), the corresponding Lagrangian can be formulated as
where is the Lagrangian multiplier vector.
Taking derivative with respect to , we have
Then, to satisfy the first constraint in (6), we have
which implies that
Note that the definitions of and ensure that is positive definite and invertible for every .
Appendix B One the Relationship between LinUCB-d and LinUCB
Major difference. One major difference between LinUCB-d and LinUCB (Li:2010:LinUCB) is as follows: Under LinUCB, at each time , it will first estimate the true parameter of arm (i.e., ) by solving a ridge regression and then use it to derive the UCB for the expected reward. The criterion to select the estimate is to minimize the penalized mean squared error in fitting the past observations; On the other hand, under LinUCB-d, the learner will directly estimate the expected reward through a linear combination of the rewards obtained when arm was pulled under all contexts. The criterion of selecting the estimate is to minimize the uncertainty (or “variance") of the estimation. It avoids the intermediate step of trying to estimate first in LinUCB.
Essential equivalence. Although linUCB-d and linUCB view the problem from different angles, they actually produce the same estimate on the expected reward and confidence bound at every time under the same realizations of context arrivals and rewards, as shown below.
where . We can verify that this is exactly the estimate of obtained by applying the ridge regression with penalty factor to the historical data .
Besides, for the in Algorithm 1, we have
where we follow the convention to denote as .
Thus, if we let , both and share the same form as the corresponding quantities in LinUCB. As a reformulation of LinUCB, LinUCB-d automatically inherits all properties of LinUCB.
Computation and analytical issues. Computationally LinUCB-d is the same as LinUCB if we first compute the Lagrangian multiplier in (14) through , which can be equivalently computed by summing over the time slots when is pulled. The advantage of LinUCB-d as an alternative form of LinUCB is on the analytical side. The prediction uncertainty minimization nature shown in Proposition 1 gives us a unique angle to elucidate the impact of context diversity on the corresponding learning regret, as elaborated in Lemma 3, Lemma 4, Lemma 9 and Lemma 10.
Appendix C Proof of Theorem 1
In the following, we will derive regret bounds for those error events individually, and then assemble them together to obtain the regret bound in Theorem 1.
c.1 Bound the Regret over
c.2 Bound the Regret over
First, we define and as follows:
Intuitively, corresponds to the accumulated noise in the observations when arm is pulled under different contexts up to time , and corresponds to the bias contributed by the feature vectors associated with the dummy contexts, which were added to ensure the existence of the unique solution in (6) for every .
Then, the reward estimation error can be expressed as
where (21) is due to the fact that according to Proposition 1, and the in (22) is the Lagrangian multiplier involved in the proof of Proposition 1 in Appendix A and satisfies (11). In the following, we will bound the contribution from and in the estimation error, respectively.
where (24) follows from the Cauchy-Schwarz inequality.
Before we proceed to bound , we first introduce the following notations. Recall that . Let be its square root, i.e., . Let