From Bandits to Experts: On the Value of SideObservations
Abstract
We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node is linked to node if sampling provides information on the reward of . This setting naturally interpolates between the wellknown “experts” setting, where the decision maker can view all rewards, and the multiarmed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on nontrivial graphtheoretic properties of the information feedback structure. We also provide partiallymatching lower bounds.
From Bandits to Experts: On the Value of SideObservations
Shie Mannor Department of Electrical Engineering Technion, Israel shie@ee.technion.ac.il Ohad Shamir Microsoft Research New England USA ohadsh@microsoft.com
1 Introduction
One of the most basic learning settings studied in the online learning framework is learning from experts. In its simplest form, we assume that each round , the learning algorithm must choose one of possible actions, which can be interpreted as following the advice of one of “experts”^{1}^{1}1The more general setup, which is beyond the scope of this paper, considers experts providing advice for choosing among actions, where in general [4].. At the end of the round, the performance of all actions, measured here in terms of some reward, is revealed. This process is iterated for rounds, and our goal is to minimize the regret, namely the difference between the total reward of the single best action in hindsight, and our own accumulated reward. We follow the standard online learning framework, in which nothing whatsoever can be assumed on the process generating the rewards, and they might even be chosen by an adversary who has full knowledge of our learning algorithm.
A crucial assumption in this setting is that we get to see the rewards of all actions at the end of each round. However, in many realworld scenarios, this assumption is unrealistic. A canonical example is web advertising, where at any timepoint one may choose only a single ad (or small number of ads) to display, and observe whether it was clicked, but not whether other ads would have been clicked or not if presented to the user. This partial information constraint has led to a flourishing literature on multiarmed bandits problems, which model the setting where we can only observe the reward of the action we chose. While this setting has been long studied under stochastic assumptions, the landmark paper [4] showed that this setting can also be dealt with under adversarial conditions, making the setting comparable to the experts setting discussed above. The price in terms of the provable regret is usually an extra multiplicative factor in the bound. The intuition for this factor has long been that in the bandit setting, we only get “ of the information” obtained in the expert setting (as we observe just a single reward rather than ). While the bandits setting received much theoretical interest, it has also been criticized for not capturing additional sideinformation we often have on the rewards of the different actions. This has led to studying richer settings, which make various assumptions on the relationship between the rewards; see below for more details.
In this paper, we formalize and initiate a study on a range of settings that interpolates between the bandits setting and the experts setting. Intuitively, we assume that after choosing some action , and obtaining the action’s reward, we observe not just action ’s reward (as in the bandit setting), and not the rewards of all actions (as in the experts setting), but rather some (possibly noisy) information on a subset of the other actions. This subset may depend on action in an arbitrary way, and may change from round to round. This information feedback structure can be modeled as a sequence of directed graphs (one per round ), so that an edge from action to action implies that by choosing action , “sufficiently good” information is revealed on the reward of action as well. The case of being the complete graph corresponds to the experts setting. The case of being the empty graph corresponds to the bandit setting. The broad scenario of arbitrary graphs in between the two is the focus of our study.
As a motivating example, consider the problem of web advertising mentioned earlier. In the standard multiarmed bandits setting, we assume that we have no information whatsoever on whether undisplayed ads would have been clicked on. However, in many cases, we do have some sideinformation. For instance, if two ads are for similar vacation packages in Hawaii, and ad was displayed and clicked on by some user, it is likely that the other ad would have been clicked on as well. In contrast, if ad is for running shoes, and ad is for wheelchair accessories, then a user who clicked on one ad is unlikely to clique on the other. This sort of sideinformation can be better captured in our setting.
As another motivating example, consider a sensor network where each sensor collects data from a certain geographic location. Each sensor covers an area that may overlap the area covered by other sensors. At every stage a centralized controller activates one of the sensors and receives input from it. The value of this input is modeled as the integral of some “information” in the covered area. Since the area covered by each of the sensors overlaps the area covered by other sensors, the reward obtained when choosing sensor provides an indication of the reward that would have been obtained when sampling sensor . A related example comes from ultra wideband communication networks, where every agent can select which channel to use for transmission. When using a channel, the agent senses if the transmission was successful, and also receives some indication of the noise level in other channels that are in adjacent frequency bands [2].
Our results portray an interesting picture, with the attainable regret depending on nontrivial properties of these graphs. We provide two practical algorithms with regret guarantees: the ExpBan algorithm that is based on a combination of existing methods, and the more fundamentally novel ELP algorithm that has superior guarantees. We also study lower bounds for our setting. In the case of undirected graphs, we show that the informationtheoretically attainable regret is precisely characterized by the average independence number (or stability number) of the graph, namely the size of its largest independent set. For the case of directed graphs, we obtain a weaker regret which depends on the average cliquepartition number of the graphs. More specifically, our contributions are as follows:

We formally define and initiate a study of the setting that interpolates between learning with expert advice (with regret) that assumes that all rewards are revealed and the multiarmed bandits setting (with regret) that assumes that only the reward of the action selected is revealed. We provide an answer to a range of models in between.

The framework we consider assumes that by choosing each action, other than just obtaining that action’s reward, we can also observe some sideinformation about the rewards of other actions. We formalize this as a graph over the actions, where an edge between two actions means that by choosing one action, we can also get a “sufficiently good” estimate of the reward of the other action. We consider both the case where changes at each round , as well as the case that is fixed throughout all rounds.

We establish upper and lower bounds on the achievable regret, which depends on two combinatorial properties of : Its independence number (namely, the largest number of nodes without edges between them), and its cliquepartition number (namely, the smallest number of cliques into which the nodes can be partitioned).

We present two practical algorithms to deal with this setting. The first algorithm, called ExpBan, combines existing algorithms in a natural way, and applies only when is fixed at all rounds. Ignoring computational constraints, the algorithm achieves a regret bound of . With computational constraints, its regret bound is , where is the size of the minimal clique partition one can efficiently find for . However, note that for general graphs, it is NPhard to find a clique partition for which for any .

The second algorithm, called ELP, is an improved algorithm, which can handle graphs which change between rounds. For undirected graphs, where sampling gives an observation on and vice versa, it achieves a regret bound of . For directed graphs (where the observation structure is not symmetric), our regret bound is at most . Moreover, the algorithm is computationally efficient. This is in contrast to the ExpBan algorithm, which in the worst case, cannot efficiently achieve regret significantly better than .

For the case of a fixed graph , we present an informationtheoretic lower bound on the regret, which holds regardless of computational efficiency.

We present some simple synthetic experiments, which demonstrate that the potential advantage of the ELP algorithm over other approaches is real, and not just an artifact of our analysis.
1.1 Related Work
The standard multiarmed bandits problem assumes no relationship between the actions. Quite a few papers studied alternative models, where the actions are endowed with a richer structure. However, in the large majority of such papers, the feedback structure is the same as in the standard multiarmed bandits. Examples include [11], where the actions’ rewards are assumed to be drawn from a statistical distribution, with correlations between the actions; and [1, 8], where the actions reward’s are assumed to satisfy some Lipschitz continuity property with respect to a distance measure between the actions.
In terms of other approaches, the combinatorial bandits framework [7] considers a setting slightly similar to ours, in that one chooses and observes the rewards of some subset of actions. However, it is crucially assumed that the reward obtained is the sum of the rewards of all actions in the subset. In other words, there is no separation between earning a reward and obtaining information on its value. Another relevant approach is partial monitoring, which is a very general framework for online learning under partial feedback. However, this generality comes at the price of tractability for all but specific cases, which do not include our model.
Our work is also somewhat related to the contextual bandit problem (e.g., [9, 10]), where the standard multiarmed bandits setting is augmented with some sideinformation provided in each round, which can be used to determine which action to pick. While we also consider additional sideinformation, it is in a more specific sense. Moreover, our goal is still to compete against the best single action, rather than some set of policies which use this sideinformation.
2 Problem Setting
Let and . We consider a set of actions . Choosing an action at round results in receiving a reward , which we shall assume without loss of generality to be bounded in . Following the standard adversarial framework, we make no assumptions whatsoever about how the rewards are selected, and they might even be chosen by an adversary. We denote our choice of action at round as . Our goal is to minimize regret with respect to the best single action in hindsight, namely
For simplicity, we will focus on a finitehorizon setting (where the number of rounds is known in advance), on regret bounds which hold in expectation, and on oblivious adversaries, namely that the reward sequence is unknown but fixed in advance (see Sec. 8 for more on this issue).
Each round , the learning algorithm chooses a single action . In the standard multiarmed bandits setting, this results in being revealed to the algorithm, while remains unknown for any . In our setting, we assume that by choosing an action , other than getting , we also get some sideobservations about the rewards of the other actions. Formally, we assume that one receives , and for some fixed parameter is able to construct unbiased estimates for all actions in some subset of , such that and . For any action , we let be the set of actions, for which we can get such an estimate on the reward of action . This is essentially the “neighborhood” of action , which receives sufficiently good information (as parameterized by ) on the reward of action . We note that is always a member of , and moreover, may be larger or smaller depending on the value of we choose. We assume that for all are known to the learner in advance.
Intuitively, one can think of this setting as a sequence of graphs, one graph per round , which captures the information feedback structure between the actions. Formally, we define to be a graph on the nodes , with an edge from node to node if and only if . In the case that if and only if , for all , we say that is undirected. We will use this graph viewpoint extensively in the remainder of the paper.
3 The ExpBan Algorithm
We begin by presenting the ExpBan algorithm (see Algorithm 1 above), which builds on existing algorithms to deal with our setting, in the special case where the graph structure remains fixed throughout the rounds  namely, for all . The idea of the algorithm is to split the actions into cliques, such that choosing an action in a clique reveals unbiased estimates of the rewards of all the other actions in the clique. By running a standard experts algorithm (such as the exponentially weighted forecaster  see [6, Chapter 2]), we can get low regret with respect to any action in that clique. We then treat each such expert algorithm as a metaaction, and run a standard bandits algorithm (such as the EXP3 [4]) over these metaactions. We denote this algorithm as ExpBan, since it combines an experts algorithm with a bandit algorithm.
The following result provides a bound on the expected regret of the algorithm. The proof appears in the appendix.
Theorem 1.
Suppose is fixed for all rounds. If we run ExpBan using the exponentially weighted forecaster and the EXP3 algorithm, then the expected regret is bounded as follows:^{2}^{2}2Using more sophisticated methods, it is now known that the factor can be removed (e.g., [3]). However, we will stick with this slightly less tight analysis for simplicity.
(1) 
For the optimal clique partition, we have , the cliquepartition number of .
It is easily seen that is a number between and . The case corresponds to being a clique, namely, that choosing any action allows us to estimate the rewards of all other actions. This corresponds to the standard experts setting, in which case the algorithm attains the optimal regret. At the other extreme, corresponds to being the empty graph, namely, that choosing any action only reveals the reward of that action. This corresponds to the standard bandit setting, in which case the algorithm attains the standard regret. For general graphs, our algorithm interpolates between these regimes, in a way which depends on .
While being simple and using offtheshelf components, the ExpBan algorithm has some disadvantages. First of all, for a general graph , it is hard to find for any . (This follows from [12] and the fact that the cliquepartition number of equals the chromatic number of its complement.) Thus, with computational constraints, one cannot hope to obtain a bound better than . That being said, we note that this is only a worstcase result, and in practice or for specific classes of graphs, computing a good clique partition might be relatively easy. A second disadvantage of the algorithm is that it is not applicable for an observation structure that changes with time.
4 The ELP Algorithm
We now turn to present the ELP algorithm (which stands for “Exponentiallyweighted algorithm with Linear Programming”). Like all multiarmed bandits algorithms, it is based on a tradeoff between exploration and exploitation. However, unlike standard algorithms, the exploration component is not uniform over the actions, but is chosen carefully to reflect the graph structure at each round. In fact, the optimal choice of the exploration requires us to solve a simple linear program, hence the name of the algorithm. Below, we present the pseudocode as well as a couple of theorems that bound the expected regret of the algorithm under appropriate parameter choices. The proofs of the theorems appear in the appendix. The first theorem concerns the symmetric observation case, where if choosing action gives information on action , then choosing action must also give information on . The second theorem concerns the general case. We note that in both cases the graph may change arbitrarily in time.
4.1 Undirected Graphs
The following theorem provides a regret bound for the algorithm, as well as appropriate parameter choices, in the case of undirected graphs. Later on, we will discuss the case of directed graphs. In a nutshell, the theorem shows that the regret bound depends on the average independence number of each graph  namely, the size of its largest independent set.
Theorem 2.
Suppose that for all , is an undirected graph. Suppose we run Algorithm 2 using some , and choosing
(which can be easily done via linear programming) and . Then it holds for any fixed action that
(2) 
If we choose , then the bound equals
(3) 
Comparing Thm. 2 with Thm. 1, we note that for any graph , its independence number lower bounds its cliquepartition number . In fact, the gap between them can be very large (see Sec. 6). Thus, the attainable regret using the ELP algorithm is better than the one attained by the ExpBan algorithm. Moreover, the ELP algorithm is able to deal with timechanging graphs, unlike the ExpBan algorithm.
If we take worstcase computational efficiency into account, things are slightly more involved. For the ELP algorithm, the optimal value of , needed to obtain Eq. (3), requires knowledge of , but computing or approximating the is NPhard in the worst case. However, there is a simple fix: we create copies of the ELP algorithm, where copy assumes that equals . Note that one of these values must be wrong by a factor of at most , so the regret of the algorithm using that value would be larger by a factor of at most . Of course, the problem is that we don’t know in advance which of those copies is the best one. But this can be easily solved by treating each such copy as a “metaaction”, and running a standard multiarmed bandits algorithm (such as EXP3) over these actions. Note that the same idea was used in the construction of the ExpBan algorithm. Since there are metaactions, the additional regret incurred is . So up to logarithmic factors in , we get the same regret as if we could actually compute the optimal value of .
4.2 Directed Graphs
So far, we assumed that the graphs we are dealing with are all undirected. However, a natural extension of this setting is to assume a directed graph, where choosing an action may give us information on the reward of action , but not viceversa. It is readily seen that the ExpBan algorithm would still work in this setting, with the same guarantee. For the ELP algorithm, we can provide the following guarantee:
Theorem 3.
Under the conditions of Thm. 2 (with the relaxation that the graphs may be directed), it holds for any fixed action that
(4) 
where is the cliquepartition number of . If we choose , then the bound equals
(5) 
Note that this bound is weaker than the one of Thm. 2, since as discussed earlier. We do not know whether this bound (relying on the cliquepartition number) is tight, but we conjecture that the independence number, which appears to be the key quantity in undirected graphs, is not the correct combinatorial measure for the case of directed graphs^{3}^{3}3It is possible to construct examples where the analysis of the ELP algorithm necessarily leads to an bound, even when the independence number is . In any case, we note that even with the weaker bound above, the ELP algorithm still seems superior to the ExpBan algorithm, in the sense that it allows us to deal with timechanging graphs, and that an explicit clique decomposition of the graph is not required. Also, we again have the issue of which is determined by a quantity which is NPhard to compute, i.e. . However, this can be circumvented using the same trick discussed in the context of undirected graphs.
5 Lower Bound
The following theorem provides a lower bound on the regret in terms of the independence number , for a constant graph .
Theorem 4.
Suppose for all , and that actions which are not linked in get no sideobservations whatsoever between them. Then there exists a (randomized) adversary strategy, such that for every and any learning strategy, the expected regret is at least .
A proof is provided in the appendix. The intuition of the proof is that if the graph has independent vertices, then an adversary can make this problem as hard as a standard multiarmed bandits problem, played on actions. Using a known lower bound of for multiarmed bandits on actions, our result follows^{4}^{4}4We note that if the maximal degree of every node is bounded by , it is possible to get the lower bound for (as opposed to ); see the proof for details..
For constant undirected graphs, this lower bound matches the regret upper bound for the ELP algorithm (Thm. 2) up to logarithmic factors. For directed graphs, the difference between them boils down to the difference between and . For many wellbehaved graphs, this gap is rather small. However, for general graphs, the difference can be huge  see the next section for details.
6 Examples
Here, we briefly discuss some concrete examples of graphs , and show how the regret performance of our algorithms depend on their structure. An interesting issue to notice is the potential gap between the performance of our algorithms, through the graph’s independence number and cliquepartition number .
First, consider the case where there exists a single action, such that choosing it reveals the rewards of all the other actions. In contrast, choosing the other actions only reveal their own reward. At first blush, it may seem that having such a “superaction”, which reveals everything that happens in the current round, should help us improve our regret. However, the independence number of such a graph is easily seen to be . Based on our lower bound, we see that this “superaction” is actually not helpful at all (up to negligible factors).
Second, consider the case where the actions are endowed with some metric distance function, and edge is in if and only if the distance between is at most some fixed constant . We can think of each action as being in the center of a sphere of radius , such that the reward of action is propagated to every other action in that sphere. In this case, is essentially the number of nonoverlapping spheres we can pack in . In contrast, is essentially the number of spheres we need to cover . Both numbers shrink rapidly as increases, improving the regret of our algorithms. However, the sphere covering size can be much larger than the sphere packing size. For example, if the actions are placed as the elements in , we use the metric, and , it is easily seen that the sphere packing number is just . In contrast, the sphere covering number is at least , since we need a separate sphere to cover every element in .
Third, consider the random Erdös  Rényi graph , which is formed by linking every action to every action with probability independently. It is well known that when is a constant, the independence number of this graph is only , whereas the cliquepartition number is at least . This translates to a regret bound of for the ExpBan algorithm, and only for the ELP algorithm. Such a gap would also hold for a directed random graph.
7 Empirical Performance Gap between ExpBan and ELP
In this section, we show that the gap between the performance of the ExpBan algorithm and the ELP algorithm can be real, and is not just an artifact of our analysis.
To show this, we performed the following simple experiment: we created a random Erdös  Rényi graph over nodes, where each pair of nodes were linked independently with probability . Choosing any action results in observing the rewards of neighboring actions in the graph. The reward of each action at each round was chosen randomly and independently to be with probability and with probability , except for a single node, whose reward equals with a higher probability of . We then implemented the ExpBan and ELP algorithms in this setting, for . For comparison, we also implemented the standard EXP3 multiarmed bandits algorithm [4], which doesn’t use any sideobservations. All the parameters were set to their theoretically optimal values. The experiment was repeated for varying and over independent runs.
The results are displayed in Figure 1. The axis is the iteration number, and the axis is the mean payoff obtained so far, averaged over the runs (the variance in the numbers was minuscule, and therefore we do not report confidence intervals). For , the graph is rather empty, and the advantage of using side observations is not large. As a result, all 3 algorithms perform roughly the same for this choice of . As increases, the value of sideobervations increase, and the the performance of our two algorithms, which utilize sideobservations, improves over the standard multiarmed bandits algorithm. Moreover, for intermediate values of , there is a noticeable gap between the performance of ExpBan and ELP. This is exactly the regime where the gap between the cliquepartition number (governing the regret bound of ExpBan) and the independence number (governing the regret bound for the ELP algorithm) tends to be larger as well^{5}^{5}5Intuitively, this can be seen by considering the extreme cases  for a complete graph over nodes, both numbers equal , and for an empty graph over nodes, both numbers equal . For constant , there is a real gap between the two, as discussed in Sec. 6. Finally, for large , the graph is almost complete, and the advantage of ELP over ExpBan becomes small again (since most actions give information on most other actions).
8 Discussion
In this paper, we initiated a study of a large family of online learning problems with side observations. In particular, we studied the broad regime which interpolates between the experts setting and the bandits setting of online learning. We provided algorithms, as well as upper and lower bounds on the attainable regret, with a nontrivial dependence on the information feedback structure.
There are many open questions that warrant further study. First, the upper and lower bounds essentially match only in particular settings (i.e., in undirected graphs, where no sideobservations whatsoever, other than those dictated by the graph are allowed). Can this gap be narrowed or closed? Second, our lower bounds depend on a reduction which essentially assumes that the graph is constant over time. We do not have a lower bound for changing graphs. Third, it remains to be seen whether other online learning results can be generalized to our setting, such as learning with respect to policies (as in EXP4 [4]) and obtaining bounds which hold with high probability. Fourth, the model we have studied assumed that the observation structure is known. In many practical cases, the observation structure may be known just partially or approximately. Is it possible to devise algorithms for such cases?
Acknowledgements. This research was supported in part by the Google Interuniversity center for Electronic Markets and Auctions.
References
 [1] R. Agrawal. The continuumarmed bandit problem. SIAM J. Control and Optimization, 33:1926–1951, 1995.
 [2] H. Arslan, Z. N. Chen, and M. G. Di Benedetto. Ultra Wideband Wireless Communication. Wiley  Interscience, 2006.
 [3] J.Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, 2009.
 [4] P. Auer, N. CesaBianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77, 2002.
 [5] V. Baston. Some cyclic inequalities. Proceedings of the Edinburgh Mathematical Society (Series 2), 19:115–118, 1974.
 [6] N. CesaBianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
 [7] N. CesaBianchi and G. Lugosi. Combinatorial bandits. In COLT, 2009.
 [8] R. Kleinberg, A. Slivkins, and E. Upfal. Multiarmed bandits in metric spaces. In STOC, pages 681–690, 2008.
 [9] J. Langford and T. Zhang. The epochgreedy algorithm for multiarmed bandits with side information. In NIPS, 2007.
 [10] L. Li, W. Chu, J. Langford, and R. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
 [11] P. Rusmevichientong and J. Tsitsiklis. Linearly parameterized bandits. Math. Oper. Res., 35(2):395–411, 2010.
 [12] D. Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory of Computing, 3(1):103–128, 2007.
Appendix A Proofs
a.1 Proof of Thm. 1
Suppose we split the actions into cliques . First, let us consider the expected regret of the exponentially weighted forecaster ran over any such clique. Denoting the actions of the clique by , the forecaster works as follows: first, it initializes weights to be . At each round, it picks an action with probability , receives the reward , and observes the noisy reward value for each of the other actions. It then updates (for some parameter ) for all .
The analysis of this algorithm is rather standard, with the main twist being that we only observe unbiased estimates of the rewards, rather than the actual reward. For completeness, we provide this analysis in the following lemma.
Lemma 1.
The expected regret of the forecaster described above, with respect to the actions in clique and under the optimal choice of the parameter is at most .
Proof.
We define the potential function , and get that
For notational convenience, let . Since , and , we have . Thus, we can use the inequality (which holds for any ), and get the upper bound
Taking logarithms and using the fact that , we get
Summing over all , and canceling the resulting telescopic series, we get
(6) 
Also, for any fixed action , we have
(7) 
Combining Eq. (6) with Eq. (7) and rearranging, we get
Taking expectations on both sides, and using the facts that for all , and with probability , we get
Thus, by picking , we get that the expected regret is at most . ∎
Now, we define each such forecaster (one per clique ) as a metaaction, and run the EXP3 algorithm on the metaactions. By the standard guarantee for this algorithm (see corollary 3.2 in [4]), the expected regret incurred by that algorithm with respect to any fixed metaaction is at most . Combining this with Lemma 1, we get that the total expected regret of the ExpBan algorithm with respect to any single action is at most
which is at most since .
a.2 Proof of Thm. 2
To prove the theorem, we will need three lemmas. The first one is straightforward and follows from the definition of . The second is a key combinatorial inequality. We were unable to find an occurrence of this inequality in any previous literature, although we are aware of very special cases proven in the context of cyclic sums (see for instance [5]). The third lemma allows us to derive a more explicit bound by examining a particular choice of .
Lemma 2.
For all fixed , we have
as well as
Proof.
It holds that
As to the second part, we have
∎
Lemma 3.
Let be a graph over nodes, and let denote the independence number of (i.e., the size of its largest independent set). For any , define to be the nodes adjacent to node (including node ). Let be arbitrary positive weights assigned to the node. Then it holds that
Proof.
We will actually prove the claim for any nonnegative weights (i.e., they are allowed to take values), under the convention that if and as well, then .
Suppose on the contrary that there exist some values for such that . Now, if are nonzero only on an independent set , then
Since , it follows that there exist some adjacent nodes such that . However, we will show that in that case, we can only increase the value of by shifting the entire weight to either node or node , and putting weight at the other node. By repeating this process, we are guaranteed to eventually arrive at a configuration where the weights are nonzero on an independent set. But we’ve shown above that in that case, , so this means the value of this expression with respect to the original configuration was at most as well.
To show this, let us fix (so that ) and consider how the value of the expression changes as we vary . The sum in the expression can be split to 6 parts: when , when , when is a node adjacent to but not to , when is adjacent to but not to , when is adjacent to both, and when is adjacent to neither of them. Decomposing the sum in this way, so that appears everywhere explicitly, we get
It is readily seen that each of the elements in the sum above is convex in . This implies that the maximum of this expression is attained at the extremes, namely either (hence ) or (hence ). This proves that indeed shifting weights between adjacent nodes can only increase the value of , and as discussed earlier, implies the result stated in the lemma. ∎
Lemma 4.
Consider a graph over nodes , and let be its independence number. For any , define to be the nodes adjacent to node (including node ). Then there exist values of on the simplex, such that
(8) 
Proof.
Let be a largest independent set of , so that . Consider the following specific choice for the values of : For any such that , let , and otherwise. Suppose there was some node such that . By the way we chose values for , this implies that node is not adjacent to any node in , so would also be an independent set, contradicting the assumption that is a largest independent set. But since each value of is either or , it follows that . This is true for any node , from which Eq. (8) follows. ∎
We now turn to the proof of the theorem itself.
Proof of Thm. 2.
With the key lemmas at hand, most of the remaining proof is rather similar to the standard analysis for multiarmed bandits (e.g., [4]). We define the potential function , and get that
(9) 
We have that , since by definition of and ,
Using the definition of and the inequality for any , we can upper bound Eq. (9) by
Taking logarithms and using the fact that , we get
Summing over all , and canceling the resulting telescopic series, we get
(10) 
Also, for any fixed action , we have
(11) 
Combining Eq. (10) with Eq. (11) and rearranging, we get
Taking expectations on both sides, and using Lemma 2, we get
After some slight manipulations, and using the fact that for all , we get
We note that can be upper bounded by , since by definition of ,
Plugging this in as well as our choice of in the term, and slightly simplifying, we get the upper bound
(12) 
Now, we recall that the terms were chosen so as to minimize the bound above. Thus, we can upper bound it by any fixed choice of . Invoking Lemma 4, as well as Lemma 3, the theorem follows. ∎
a.3 Proof of Thm. 3
The proof is very similar to the one of Thm. 2, so we’ll only point out the differences.
Referring to the proof of Thm. 2 in Subsection A.2, The analysis is identical up to Eq. (12). To upper bound the terms there, we can still invoke Lemma 4. However, Lemma 3, which was used to upper bound , not longer applies (in fact, one can show specific counterexamples). Thus, in lieu of Lemma 3, we will opt for the following weaker bound: Let be a smallest possible clique partition of . Then we have
Plugging this upper bound as well as Lemma 4 into Eq. (12), and using the fact that for any graph , the result follows.
a.4 Proof of Theorem 4
Suppose that we are given a graph with an independence number . Let denote an independent set of nodes (i.e., no two nodes are connected). Suppose we have an algorithm with a low expected regret for every sequence of rewards. We will use this algorithm to form an algorithm for the standard multiarmed bandits problem (with noside observations). We will then resort to the known lower bound for this problem, to get a lower bound for our setting as well.
Consider first a standard multiarmed bandits game on actions (with no sideobservations), with the following randomized strategy for the adversary: the adversary picks one of the actions uniformly at random, and at each round, assigns it a random Bernoulli reward with parameter (where will be specified later). The other actions are assigned a random Bernoulli reward with parameter . Roughly speaking, Theorem 6.11 of [6] shows that with this strategy and for , the expected regret of any learning algorithm is at least .
Now, suppose that for the setting with sideobservations, played over the graph , there exists a learning strategy that achieves expected cumulative regret of at most , for the graph over rounds, with respect to any adversary strategy. We will now show how to use for the standard multiarmed bandits game described above. To that end, arbitrarily assign the actions to the independent nodes in . We will then implement the following strategy : whenever chooses one of the actions in , we choose the corresponding action in the multiarmed bandits problem and feed the reward back to (the reward of all neighboring nodes is 0, which we feed back to as well). Whenever chooses a node not in , we use the next rounds (where is the neighborhood set of ) to do “pure exploration:” we go over all the neighbors of node that belong to in some fixed order, and choose each of them once (since rewards are assumed stochastic the order does not matter). Nodes in are known to yield a reward of . The rewards of node and all its neighbors are then fed to , as if they were side observations obtained in a single round by choosing a node not in . Since the rewards are chosen i.i.d., the distribution of these rewards is identical to the case where was really implemented with sideobservations. We denote as the expected regret of this strategy , after rounds.
We make the following observation: suppose achieves an expected regret satisfying
(we can assume this since our goal is to provide a lower bound which will only be smaller). Then the number of times chose actions outside must be smaller than . This is because whenever chooses an action not in it receives a reward of 0 while the highest expected reward is bigger than , so the expected perround regret would increase by at least .
We apply at each round, till is called times. Let be the (possibly random) number of rounds which elapsed. It holds that , since we have the pure exploration rounds where is not called. In these exploration rounds, we pull arms in , so our expected regret in those rounds is at most . Moreover, by the observation above, the number of such rounds is at most