From Bandits to Experts: On the Value of Side-Observations
We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node is linked to node if sampling provides information on the reward of . This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds.
From Bandits to Experts: On the Value of Side-Observations
Shie Mannor Department of Electrical Engineering Technion, Israel email@example.com Ohad Shamir Microsoft Research New England USA firstname.lastname@example.org
One of the most basic learning settings studied in the online learning framework is learning from experts. In its simplest form, we assume that each round , the learning algorithm must choose one of possible actions, which can be interpreted as following the advice of one of “experts”111The more general setup, which is beyond the scope of this paper, considers experts providing advice for choosing among actions, where in general .. At the end of the round, the performance of all actions, measured here in terms of some reward, is revealed. This process is iterated for rounds, and our goal is to minimize the regret, namely the difference between the total reward of the single best action in hindsight, and our own accumulated reward. We follow the standard online learning framework, in which nothing whatsoever can be assumed on the process generating the rewards, and they might even be chosen by an adversary who has full knowledge of our learning algorithm.
A crucial assumption in this setting is that we get to see the rewards of all actions at the end of each round. However, in many real-world scenarios, this assumption is unrealistic. A canonical example is web advertising, where at any timepoint one may choose only a single ad (or small number of ads) to display, and observe whether it was clicked, but not whether other ads would have been clicked or not if presented to the user. This partial information constraint has led to a flourishing literature on multi-armed bandits problems, which model the setting where we can only observe the reward of the action we chose. While this setting has been long studied under stochastic assumptions, the landmark paper  showed that this setting can also be dealt with under adversarial conditions, making the setting comparable to the experts setting discussed above. The price in terms of the provable regret is usually an extra multiplicative factor in the bound. The intuition for this factor has long been that in the bandit setting, we only get “ of the information” obtained in the expert setting (as we observe just a single reward rather than ). While the bandits setting received much theoretical interest, it has also been criticized for not capturing additional side-information we often have on the rewards of the different actions. This has led to studying richer settings, which make various assumptions on the relationship between the rewards; see below for more details.
In this paper, we formalize and initiate a study on a range of settings that interpolates between the bandits setting and the experts setting. Intuitively, we assume that after choosing some action , and obtaining the action’s reward, we observe not just action ’s reward (as in the bandit setting), and not the rewards of all actions (as in the experts setting), but rather some (possibly noisy) information on a subset of the other actions. This subset may depend on action in an arbitrary way, and may change from round to round. This information feedback structure can be modeled as a sequence of directed graphs (one per round ), so that an edge from action to action implies that by choosing action , “sufficiently good” information is revealed on the reward of action as well. The case of being the complete graph corresponds to the experts setting. The case of being the empty graph corresponds to the bandit setting. The broad scenario of arbitrary graphs in between the two is the focus of our study.
As a motivating example, consider the problem of web advertising mentioned earlier. In the standard multi-armed bandits setting, we assume that we have no information whatsoever on whether undisplayed ads would have been clicked on. However, in many cases, we do have some side-information. For instance, if two ads are for similar vacation packages in Hawaii, and ad was displayed and clicked on by some user, it is likely that the other ad would have been clicked on as well. In contrast, if ad is for running shoes, and ad is for wheelchair accessories, then a user who clicked on one ad is unlikely to clique on the other. This sort of side-information can be better captured in our setting.
As another motivating example, consider a sensor network where each sensor collects data from a certain geographic location. Each sensor covers an area that may overlap the area covered by other sensors. At every stage a centralized controller activates one of the sensors and receives input from it. The value of this input is modeled as the integral of some “information” in the covered area. Since the area covered by each of the sensors overlaps the area covered by other sensors, the reward obtained when choosing sensor provides an indication of the reward that would have been obtained when sampling sensor . A related example comes from ultra wideband communication networks, where every agent can select which channel to use for transmission. When using a channel, the agent senses if the transmission was successful, and also receives some indication of the noise level in other channels that are in adjacent frequency bands .
Our results portray an interesting picture, with the attainable regret depending on non-trivial properties of these graphs. We provide two practical algorithms with regret guarantees: the ExpBan algorithm that is based on a combination of existing methods, and the more fundamentally novel ELP algorithm that has superior guarantees. We also study lower bounds for our setting. In the case of undirected graphs, we show that the information-theoretically attainable regret is precisely characterized by the average independence number (or stability number) of the graph, namely the size of its largest independent set. For the case of directed graphs, we obtain a weaker regret which depends on the average clique-partition number of the graphs. More specifically, our contributions are as follows:
We formally define and initiate a study of the setting that interpolates between learning with expert advice (with regret) that assumes that all rewards are revealed and the multi-armed bandits setting (with regret) that assumes that only the reward of the action selected is revealed. We provide an answer to a range of models in between.
The framework we consider assumes that by choosing each action, other than just obtaining that action’s reward, we can also observe some side-information about the rewards of other actions. We formalize this as a graph over the actions, where an edge between two actions means that by choosing one action, we can also get a “sufficiently good” estimate of the reward of the other action. We consider both the case where changes at each round , as well as the case that is fixed throughout all rounds.
We establish upper and lower bounds on the achievable regret, which depends on two combinatorial properties of : Its independence number (namely, the largest number of nodes without edges between them), and its clique-partition number (namely, the smallest number of cliques into which the nodes can be partitioned).
We present two practical algorithms to deal with this setting. The first algorithm, called ExpBan, combines existing algorithms in a natural way, and applies only when is fixed at all rounds. Ignoring computational constraints, the algorithm achieves a regret bound of . With computational constraints, its regret bound is , where is the size of the minimal clique partition one can efficiently find for . However, note that for general graphs, it is NP-hard to find a clique partition for which for any .
The second algorithm, called ELP, is an improved algorithm, which can handle graphs which change between rounds. For undirected graphs, where sampling gives an observation on and vice versa, it achieves a regret bound of . For directed graphs (where the observation structure is not symmetric), our regret bound is at most . Moreover, the algorithm is computationally efficient. This is in contrast to the ExpBan algorithm, which in the worst case, cannot efficiently achieve regret significantly better than .
For the case of a fixed graph , we present an information-theoretic lower bound on the regret, which holds regardless of computational efficiency.
We present some simple synthetic experiments, which demonstrate that the potential advantage of the ELP algorithm over other approaches is real, and not just an artifact of our analysis.
1.1 Related Work
The standard multi-armed bandits problem assumes no relationship between the actions. Quite a few papers studied alternative models, where the actions are endowed with a richer structure. However, in the large majority of such papers, the feedback structure is the same as in the standard multi-armed bandits. Examples include , where the actions’ rewards are assumed to be drawn from a statistical distribution, with correlations between the actions; and [1, 8], where the actions reward’s are assumed to satisfy some Lipschitz continuity property with respect to a distance measure between the actions.
In terms of other approaches, the combinatorial bandits framework  considers a setting slightly similar to ours, in that one chooses and observes the rewards of some subset of actions. However, it is crucially assumed that the reward obtained is the sum of the rewards of all actions in the subset. In other words, there is no separation between earning a reward and obtaining information on its value. Another relevant approach is partial monitoring, which is a very general framework for online learning under partial feedback. However, this generality comes at the price of tractability for all but specific cases, which do not include our model.
Our work is also somewhat related to the contextual bandit problem (e.g., [9, 10]), where the standard multi-armed bandits setting is augmented with some side-information provided in each round, which can be used to determine which action to pick. While we also consider additional side-information, it is in a more specific sense. Moreover, our goal is still to compete against the best single action, rather than some set of policies which use this side-information.
2 Problem Setting
Let and . We consider a set of actions . Choosing an action at round results in receiving a reward , which we shall assume without loss of generality to be bounded in . Following the standard adversarial framework, we make no assumptions whatsoever about how the rewards are selected, and they might even be chosen by an adversary. We denote our choice of action at round as . Our goal is to minimize regret with respect to the best single action in hindsight, namely
For simplicity, we will focus on a finite-horizon setting (where the number of rounds is known in advance), on regret bounds which hold in expectation, and on oblivious adversaries, namely that the reward sequence is unknown but fixed in advance (see Sec. 8 for more on this issue).
Each round , the learning algorithm chooses a single action . In the standard multi-armed bandits setting, this results in being revealed to the algorithm, while remains unknown for any . In our setting, we assume that by choosing an action , other than getting , we also get some side-observations about the rewards of the other actions. Formally, we assume that one receives , and for some fixed parameter is able to construct unbiased estimates for all actions in some subset of , such that and . For any action , we let be the set of actions, for which we can get such an estimate on the reward of action . This is essentially the “neighborhood” of action , which receives sufficiently good information (as parameterized by ) on the reward of action . We note that is always a member of , and moreover, may be larger or smaller depending on the value of we choose. We assume that for all are known to the learner in advance.
Intuitively, one can think of this setting as a sequence of graphs, one graph per round , which captures the information feedback structure between the actions. Formally, we define to be a graph on the nodes , with an edge from node to node if and only if . In the case that if and only if , for all , we say that is undirected. We will use this graph viewpoint extensively in the remainder of the paper.
3 The ExpBan Algorithm
We begin by presenting the ExpBan algorithm (see Algorithm 1 above), which builds on existing algorithms to deal with our setting, in the special case where the graph structure remains fixed throughout the rounds - namely, for all . The idea of the algorithm is to split the actions into cliques, such that choosing an action in a clique reveals unbiased estimates of the rewards of all the other actions in the clique. By running a standard experts algorithm (such as the exponentially weighted forecaster - see [6, Chapter 2]), we can get low regret with respect to any action in that clique. We then treat each such expert algorithm as a meta-action, and run a standard bandits algorithm (such as the EXP3 ) over these meta-actions. We denote this algorithm as ExpBan, since it combines an experts algorithm with a bandit algorithm.
The following result provides a bound on the expected regret of the algorithm. The proof appears in the appendix.
Suppose is fixed for all rounds. If we run ExpBan using the exponentially weighted forecaster and the EXP3 algorithm, then the expected regret is bounded as follows:222Using more sophisticated methods, it is now known that the factor can be removed (e.g., ). However, we will stick with this slightly less tight analysis for simplicity.
For the optimal clique partition, we have , the clique-partition number of .
It is easily seen that is a number between and . The case corresponds to being a clique, namely, that choosing any action allows us to estimate the rewards of all other actions. This corresponds to the standard experts setting, in which case the algorithm attains the optimal regret. At the other extreme, corresponds to being the empty graph, namely, that choosing any action only reveals the reward of that action. This corresponds to the standard bandit setting, in which case the algorithm attains the standard regret. For general graphs, our algorithm interpolates between these regimes, in a way which depends on .
While being simple and using off-the-shelf components, the ExpBan algorithm has some disadvantages. First of all, for a general graph , it is -hard to find for any . (This follows from  and the fact that the clique-partition number of equals the chromatic number of its complement.) Thus, with computational constraints, one cannot hope to obtain a bound better than . That being said, we note that this is only a worst-case result, and in practice or for specific classes of graphs, computing a good clique partition might be relatively easy. A second disadvantage of the algorithm is that it is not applicable for an observation structure that changes with time.
4 The ELP Algorithm
We now turn to present the ELP algorithm (which stands for “Exponentially-weighted algorithm with Linear Programming”). Like all multi-armed bandits algorithms, it is based on a tradeoff between exploration and exploitation. However, unlike standard algorithms, the exploration component is not uniform over the actions, but is chosen carefully to reflect the graph structure at each round. In fact, the optimal choice of the exploration requires us to solve a simple linear program, hence the name of the algorithm. Below, we present the pseudo-code as well as a couple of theorems that bound the expected regret of the algorithm under appropriate parameter choices. The proofs of the theorems appear in the appendix. The first theorem concerns the symmetric observation case, where if choosing action gives information on action , then choosing action must also give information on . The second theorem concerns the general case. We note that in both cases the graph may change arbitrarily in time.
4.1 Undirected Graphs
The following theorem provides a regret bound for the algorithm, as well as appropriate parameter choices, in the case of undirected graphs. Later on, we will discuss the case of directed graphs. In a nutshell, the theorem shows that the regret bound depends on the average independence number of each graph - namely, the size of its largest independent set.
Suppose that for all , is an undirected graph. Suppose we run Algorithm 2 using some , and choosing
(which can be easily done via linear programming) and . Then it holds for any fixed action that
If we choose , then the bound equals
Comparing Thm. 2 with Thm. 1, we note that for any graph , its independence number lower bounds its clique-partition number . In fact, the gap between them can be very large (see Sec. 6). Thus, the attainable regret using the ELP algorithm is better than the one attained by the ExpBan algorithm. Moreover, the ELP algorithm is able to deal with time-changing graphs, unlike the ExpBan algorithm.
If we take worst-case computational efficiency into account, things are slightly more involved. For the ELP algorithm, the optimal value of , needed to obtain Eq. (3), requires knowledge of , but computing or approximating the is NP-hard in the worst case. However, there is a simple fix: we create copies of the ELP algorithm, where copy assumes that equals . Note that one of these values must be wrong by a factor of at most , so the regret of the algorithm using that value would be larger by a factor of at most . Of course, the problem is that we don’t know in advance which of those copies is the best one. But this can be easily solved by treating each such copy as a “meta-action”, and running a standard multi-armed bandits algorithm (such as EXP3) over these actions. Note that the same idea was used in the construction of the ExpBan algorithm. Since there are meta-actions, the additional regret incurred is . So up to logarithmic factors in , we get the same regret as if we could actually compute the optimal value of .
4.2 Directed Graphs
So far, we assumed that the graphs we are dealing with are all undirected. However, a natural extension of this setting is to assume a directed graph, where choosing an action may give us information on the reward of action , but not vice-versa. It is readily seen that the ExpBan algorithm would still work in this setting, with the same guarantee. For the ELP algorithm, we can provide the following guarantee:
Under the conditions of Thm. 2 (with the relaxation that the graphs may be directed), it holds for any fixed action that
where is the clique-partition number of . If we choose , then the bound equals
Note that this bound is weaker than the one of Thm. 2, since as discussed earlier. We do not know whether this bound (relying on the clique-partition number) is tight, but we conjecture that the independence number, which appears to be the key quantity in undirected graphs, is not the correct combinatorial measure for the case of directed graphs333It is possible to construct examples where the analysis of the ELP algorithm necessarily leads to an bound, even when the independence number is . In any case, we note that even with the weaker bound above, the ELP algorithm still seems superior to the ExpBan algorithm, in the sense that it allows us to deal with time-changing graphs, and that an explicit clique decomposition of the graph is not required. Also, we again have the issue of which is determined by a quantity which is NP-hard to compute, i.e. . However, this can be circumvented using the same trick discussed in the context of undirected graphs.
5 Lower Bound
The following theorem provides a lower bound on the regret in terms of the independence number , for a constant graph .
Suppose for all , and that actions which are not linked in get no side-observations whatsoever between them. Then there exists a (randomized) adversary strategy, such that for every and any learning strategy, the expected regret is at least .
A proof is provided in the appendix. The intuition of the proof is that if the graph has independent vertices, then an adversary can make this problem as hard as a standard multi-armed bandits problem, played on actions. Using a known lower bound of for multi-armed bandits on actions, our result follows444We note that if the maximal degree of every node is bounded by , it is possible to get the lower bound for (as opposed to ); see the proof for details..
For constant undirected graphs, this lower bound matches the regret upper bound for the ELP algorithm (Thm. 2) up to logarithmic factors. For directed graphs, the difference between them boils down to the difference between and . For many well-behaved graphs, this gap is rather small. However, for general graphs, the difference can be huge - see the next section for details.
Here, we briefly discuss some concrete examples of graphs , and show how the regret performance of our algorithms depend on their structure. An interesting issue to notice is the potential gap between the performance of our algorithms, through the graph’s independence number and clique-partition number .
First, consider the case where there exists a single action, such that choosing it reveals the rewards of all the other actions. In contrast, choosing the other actions only reveal their own reward. At first blush, it may seem that having such a “super-action”, which reveals everything that happens in the current round, should help us improve our regret. However, the independence number of such a graph is easily seen to be . Based on our lower bound, we see that this “super-action” is actually not helpful at all (up to negligible factors).
Second, consider the case where the actions are endowed with some metric distance function, and edge is in if and only if the distance between is at most some fixed constant . We can think of each action as being in the center of a sphere of radius , such that the reward of action is propagated to every other action in that sphere. In this case, is essentially the number of non-overlapping spheres we can pack in . In contrast, is essentially the number of spheres we need to cover . Both numbers shrink rapidly as increases, improving the regret of our algorithms. However, the sphere covering size can be much larger than the sphere packing size. For example, if the actions are placed as the elements in , we use the metric, and , it is easily seen that the sphere packing number is just . In contrast, the sphere covering number is at least , since we need a separate sphere to cover every element in .
Third, consider the random Erdös - Rényi graph , which is formed by linking every action to every action with probability independently. It is well known that when is a constant, the independence number of this graph is only , whereas the clique-partition number is at least . This translates to a regret bound of for the ExpBan algorithm, and only for the ELP algorithm. Such a gap would also hold for a directed random graph.
7 Empirical Performance Gap between ExpBan and ELP
In this section, we show that the gap between the performance of the ExpBan algorithm and the ELP algorithm can be real, and is not just an artifact of our analysis.
To show this, we performed the following simple experiment: we created a random Erdös - Rényi graph over nodes, where each pair of nodes were linked independently with probability . Choosing any action results in observing the rewards of neighboring actions in the graph. The reward of each action at each round was chosen randomly and independently to be with probability and with probability , except for a single node, whose reward equals with a higher probability of . We then implemented the ExpBan and ELP algorithms in this setting, for . For comparison, we also implemented the standard EXP3 multi-armed bandits algorithm , which doesn’t use any side-observations. All the parameters were set to their theoretically optimal values. The experiment was repeated for varying and over independent runs.
The results are displayed in Figure 1. The -axis is the iteration number, and the -axis is the mean payoff obtained so far, averaged over the runs (the variance in the numbers was minuscule, and therefore we do not report confidence intervals). For , the graph is rather empty, and the advantage of using side observations is not large. As a result, all 3 algorithms perform roughly the same for this choice of . As increases, the value of side-obervations increase, and the the performance of our two algorithms, which utilize side-observations, improves over the standard multi-armed bandits algorithm. Moreover, for intermediate values of , there is a noticeable gap between the performance of ExpBan and ELP. This is exactly the regime where the gap between the clique-partition number (governing the regret bound of ExpBan) and the independence number (governing the regret bound for the ELP algorithm) tends to be larger as well555Intuitively, this can be seen by considering the extreme cases - for a complete graph over nodes, both numbers equal , and for an empty graph over nodes, both numbers equal . For constant , there is a real gap between the two, as discussed in Sec. 6. Finally, for large , the graph is almost complete, and the advantage of ELP over ExpBan becomes small again (since most actions give information on most other actions).
In this paper, we initiated a study of a large family of online learning problems with side observations. In particular, we studied the broad regime which interpolates between the experts setting and the bandits setting of online learning. We provided algorithms, as well as upper and lower bounds on the attainable regret, with a non-trivial dependence on the information feedback structure.
There are many open questions that warrant further study. First, the upper and lower bounds essentially match only in particular settings (i.e., in undirected graphs, where no side-observations whatsoever, other than those dictated by the graph are allowed). Can this gap be narrowed or closed? Second, our lower bounds depend on a reduction which essentially assumes that the graph is constant over time. We do not have a lower bound for changing graphs. Third, it remains to be seen whether other online learning results can be generalized to our setting, such as learning with respect to policies (as in EXP4 ) and obtaining bounds which hold with high probability. Fourth, the model we have studied assumed that the observation structure is known. In many practical cases, the observation structure may be known just partially or approximately. Is it possible to devise algorithms for such cases?
Acknowledgements. This research was supported in part by the Google Inter-university center for Electronic Markets and Auctions.
-  R. Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33:1926–1951, 1995.
-  H. Arslan, Z. N. Chen, and M. G. Di Benedetto. Ultra Wideband Wireless Communication. Wiley - Interscience, 2006.
-  J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, 2009.
-  P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77, 2002.
-  V. Baston. Some cyclic inequalities. Proceedings of the Edinburgh Mathematical Society (Series 2), 19:115–118, 1974.
-  N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
-  N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. In COLT, 2009.
-  R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In STOC, pages 681–690, 2008.
-  J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In NIPS, 2007.
-  L. Li, W. Chu, J. Langford, and R. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
-  P. Rusmevichientong and J. Tsitsiklis. Linearly parameterized bandits. Math. Oper. Res., 35(2):395–411, 2010.
-  D. Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory of Computing, 3(1):103–128, 2007.
Appendix A Proofs
a.1 Proof of Thm. 1
Suppose we split the actions into cliques . First, let us consider the expected regret of the exponentially weighted forecaster ran over any such clique. Denoting the actions of the clique by , the forecaster works as follows: first, it initializes weights to be . At each round, it picks an action with probability , receives the reward , and observes the noisy reward value for each of the other actions. It then updates (for some parameter ) for all .
The analysis of this algorithm is rather standard, with the main twist being that we only observe unbiased estimates of the rewards, rather than the actual reward. For completeness, we provide this analysis in the following lemma.
The expected regret of the forecaster described above, with respect to the actions in clique and under the optimal choice of the parameter is at most .
We define the potential function , and get that
For notational convenience, let . Since , and , we have . Thus, we can use the inequality (which holds for any ), and get the upper bound
Taking logarithms and using the fact that , we get
Summing over all , and canceling the resulting telescopic series, we get
Also, for any fixed action , we have
Taking expectations on both sides, and using the facts that for all , and with probability , we get
Thus, by picking , we get that the expected regret is at most . ∎
Now, we define each such forecaster (one per clique ) as a meta-action, and run the EXP3 algorithm on the meta-actions. By the standard guarantee for this algorithm (see corollary 3.2 in ), the expected regret incurred by that algorithm with respect to any fixed meta-action is at most . Combining this with Lemma 1, we get that the total expected regret of the ExpBan algorithm with respect to any single action is at most
which is at most since .
a.2 Proof of Thm. 2
To prove the theorem, we will need three lemmas. The first one is straightforward and follows from the definition of . The second is a key combinatorial inequality. We were unable to find an occurrence of this inequality in any previous literature, although we are aware of very special cases proven in the context of cyclic sums (see for instance ). The third lemma allows us to derive a more explicit bound by examining a particular choice of .
For all fixed , we have
as well as
It holds that
As to the second part, we have
Let be a graph over nodes, and let denote the independence number of (i.e., the size of its largest independent set). For any , define to be the nodes adjacent to node (including node ). Let be arbitrary positive weights assigned to the node. Then it holds that
We will actually prove the claim for any nonnegative weights (i.e., they are allowed to take values), under the convention that if and as well, then .
Suppose on the contrary that there exist some values for such that . Now, if are non-zero only on an independent set , then
Since , it follows that there exist some adjacent nodes such that . However, we will show that in that case, we can only increase the value of by shifting the entire weight to either node or node , and putting weight at the other node. By repeating this process, we are guaranteed to eventually arrive at a configuration where the weights are non-zero on an independent set. But we’ve shown above that in that case, , so this means the value of this expression with respect to the original configuration was at most as well.
To show this, let us fix (so that ) and consider how the value of the expression changes as we vary . The sum in the expression can be split to 6 parts: when , when , when is a node adjacent to but not to , when is adjacent to but not to , when is adjacent to both, and when is adjacent to neither of them. Decomposing the sum in this way, so that appears everywhere explicitly, we get
It is readily seen that each of the elements in the sum above is convex in . This implies that the maximum of this expression is attained at the extremes, namely either (hence ) or (hence ). This proves that indeed shifting weights between adjacent nodes can only increase the value of , and as discussed earlier, implies the result stated in the lemma. ∎
Consider a graph over nodes , and let be its independence number. For any , define to be the nodes adjacent to node (including node ). Then there exist values of on the -simplex, such that
Let be a largest independent set of , so that . Consider the following specific choice for the values of : For any such that , let , and otherwise. Suppose there was some node such that . By the way we chose values for , this implies that node is not adjacent to any node in , so would also be an independent set, contradicting the assumption that is a largest independent set. But since each value of is either or , it follows that . This is true for any node , from which Eq. (8) follows. ∎
We now turn to the proof of the theorem itself.
Proof of Thm. 2.
With the key lemmas at hand, most of the remaining proof is rather similar to the standard analysis for multi-armed bandits (e.g., ). We define the potential function , and get that
We have that , since by definition of and ,
Using the definition of and the inequality for any , we can upper bound Eq. (9) by
Taking logarithms and using the fact that , we get
Summing over all , and canceling the resulting telescopic series, we get
Also, for any fixed action , we have
Taking expectations on both sides, and using Lemma 2, we get
After some slight manipulations, and using the fact that for all , we get
We note that can be upper bounded by , since by definition of ,
Plugging this in as well as our choice of in the term, and slightly simplifying, we get the upper bound
a.3 Proof of Thm. 3
The proof is very similar to the one of Thm. 2, so we’ll only point out the differences.
Referring to the proof of Thm. 2 in Subsection A.2, The analysis is identical up to Eq. (12). To upper bound the terms there, we can still invoke Lemma 4. However, Lemma 3, which was used to upper bound , not longer applies (in fact, one can show specific counter-examples). Thus, in lieu of Lemma 3, we will opt for the following weaker bound: Let be a smallest possible clique partition of . Then we have
a.4 Proof of Theorem 4
Suppose that we are given a graph with an independence number . Let denote an independent set of nodes (i.e., no two nodes are connected). Suppose we have an algorithm with a low expected regret for every sequence of rewards. We will use this algorithm to form an algorithm for the standard multi-armed bandits problem (with no-side observations). We will then resort to the known lower bound for this problem, to get a lower bound for our setting as well.
Consider first a standard multi-armed bandits game on actions (with no side-observations), with the following randomized strategy for the adversary: the adversary picks one of the actions uniformly at random, and at each round, assigns it a random Bernoulli reward with parameter (where will be specified later). The other actions are assigned a random Bernoulli reward with parameter . Roughly speaking, Theorem 6.11 of  shows that with this strategy and for , the expected regret of any learning algorithm is at least .
Now, suppose that for the setting with side-observations, played over the graph , there exists a learning strategy that achieves expected cumulative regret of at most , for the graph over rounds, with respect to any adversary strategy. We will now show how to use for the standard multi-armed bandits game described above. To that end, arbitrarily assign the actions to the independent nodes in . We will then implement the following strategy : whenever chooses one of the actions in , we choose the corresponding action in the multi-armed bandits problem and feed the reward back to (the reward of all neighboring nodes is 0, which we feed back to as well). Whenever chooses a node not in , we use the next rounds (where is the neighborhood set of ) to do “pure exploration:” we go over all the neighbors of node that belong to in some fixed order, and choose each of them once (since rewards are assumed stochastic the order does not matter). Nodes in are known to yield a reward of . The rewards of node and all its neighbors are then fed to , as if they were side observations obtained in a single round by choosing a node not in . Since the rewards are chosen i.i.d., the distribution of these rewards is identical to the case where was really implemented with side-observations. We denote as the expected regret of this strategy , after rounds.
We make the following observation: suppose achieves an expected regret satisfying
(we can assume this since our goal is to provide a lower bound which will only be smaller). Then the number of times chose actions outside must be smaller than . This is because whenever chooses an action not in it receives a reward of 0 while the highest expected reward is bigger than , so the expected per-round regret would increase by at least .
We apply at each round, till is called times. Let be the (possibly random) number of rounds which elapsed. It holds that , since we have the pure exploration rounds where is not called. In these exploration rounds, we pull arms in , so our expected regret in those rounds is at most