Online learning with feedback graphs and switching costs

Online learning with feedback graphs
and switching costs

Anshuka Rangi
University of California, San Diego &Massimo Franceschetti
University of California, San Diego
Abstract

We study online learning when partial feedback information is provided following every action of the learning process, and the learner incurs switching costs for changing his actions. In this setting, the feedback information system can be represented by a graph, and previous works studied the expected regret of the learner in the case of a clique (Expert setup), or disconnected single loops (Multi-Armed Bandits (MAB)). This work provides a lower bound on the expected regret in the Partial Information (PI) setting, namely for general feedback graphs –excluding the clique. Additionally, it shows that all algorithms that are optimal without switching costs are necessarily sub-optimal in the presence of switching costs, which motivates the need to design new algorithms. We propose two new algorithms: Threshold Based EXP3 and EXP3.SC. For the two special cases of symmetric PI setting and MAB, the expected regret of both of these algorithms is order optimal in the duration of the learning process. Additionally, Threshold Based EXP3 is order optimal in the switching cost, whereas EXP3.SC is not. Finally, empirical evaluations show that Threshold Based EXP3 outperforms the previously proposed order-optimal algorithms EXP3 SET in the presence of switching costs, and Batch EXP3 in the MAB setting with switching costs.

 

Online learning with feedback graphs
and switching costs


  Anshuka Rangi University of California, San Diego Massimo Franceschetti University of California, San Diego

\@float

noticebox[b]\end@float

1 Introduction

Online learning has a wide variety of applications like classification, estimation, and ranking, and it has been investigated in different areas, including learning theory, control theory, operations research, and statistics. The problem can be viewed as a one player game against an adversary. The game runs for rounds and at each round the player chooses an action from a given set of actions. Every action performed at round carries a loss, that is a real number in the interval . The losses for all pairs are assigned by the adversary before the game starts. The player also incurs a fixed and known Switching Cost (SC) every time he changes his action, that is an arbitrary real number . The expected regret is the expectation of the sum of losses associated to the actions performed by the player plus the SCs minus the losses incurred by the best fixed action in hindsight. The goal of the player is to minimize the expected regret over the duration of the game.

Based on the feedback information received after each action, online learning can be divided into three categories: Multi-Armed Bandit (MAB), Partial Information (PI), and Expert setting. In a MAB setting, at any given round the player only incurs the loss corresponding to the selected action, which implies the player only observes the loss of the selected action. In a PI setting, the player incurs the loss of the selected action , as well as observes the losses that he would have incurred in that round by taking actions in a subset of . This feedback system can be viewed as a time-varying directed graph with nodes, where a directed edge in indicates that performing an action at round also reveals the loss that the player would have incurred if action was taken at round . In an Expert setting, taking an action reveals the losses that the player would have incurred by taking any of the other actions in that round. In this extremal case, the feedback system corresponds to a time-invariant, undirected clique.

Online learning with PI has been used to design a variety of systems Gentile and Orabona (2014); Katariya et al. (2016); Zong et al. (2016); Rangi et al. (2018c). In these applications, feedback captures the idea of side information provided to the player during the learning process. For example, the performance of an employee can provide information about the performance of other employees with similar skills, or the rating of a web page can provide information on ratings of web pages with similar content. In most of these applications, switching between the actions is not free. For example, a company incurs a cost associated to the learning phase while shifting an employee among different tasks, or switching the content of a web page frequently can exasperate users and force them to avoid visiting it. Similarly, re-configuring the production line in a factory is a costly process, and changing the stock allocation in an investment portfolio is subject to certain fees. Despite the many applications where both SC and PI are an integral part of the learning process, the study of online learning with SC has been limited only to the MAB and Expert settings. In the MAB setting, it has been shown that the expected regret of any player is at least Dekel et al. (2014), and that Batch EXP3 is an order optimal algorithm Arora et al. (2012). In the Expert setting, it has been shown that the expected regret is at least Cesa-Bianchi and Lugosi (2006), and order optimal algorithms have been proposed in Geulen et al. (2010); Gyorgy and Neu (2014). The PI setup has been investigated only in the absence of SC, and for any fixed feedback system with independence number , it has been shown that the expected regret is at least  Mannor and Shamir (2011).

Scenarios Threshold based EXP3 EXP3.SC Lower Bound
For all ,
Symmetric PI
MAB
Equi-informational
Table 1: Comparison of Threshold based EXP3 and EXP3.SC.

1.1 Contributions

We provide a lower bound on the expected regret for any sequence of feedback graphs in the PI setting with SC. We show that for any sequence of feedback graphs with independence sequence number , the expected regret of any player is at least . We then show that for with for all , the expected regret of any player is at least , where is the set of unique feedback graphs in the sequence , and is the number of rounds for which the feedback graph is seen in rounds. These results introduce a new figure of merit in the PI setting, which can also be used to generalize the lower bound given in the PI setting without SC  Mannor and Shamir (2011). A consequence of these results is that the presence of SC changes the asymptotic regret by at least a factor . Additionally, these results also recover the lower bound on the expected regret in the MAB setting Dekel et al. (2014).

We also show that in the PI setting for any algorithm that is order optimal without SC, there exists an assignment of losses from the adversary that forces the algorithm to make at least switches, thus increasing its asymptotic regret by at least a factor . This shows that any algorithm that is order optimal in the PI setting without SC, is necessarily sub-optimal in the presence of SC, and motivates the development of new algorithms in the PI setting and in the presence of SC.

We propose two new algorithms for the PI setting with SC: Threshold-Based EXP3 and EXP3.SC. Threshold-Based EXP3 requires the knowledge of in advance, whereas EXP3.SC does not. The performance of these algorithms is given for different scenarios in Table 1. The algorithms are order optimal in and for two special cases of feedback information system: symmetric PI setting i.e. the feedback graph is fixed and un-directed, and MAB. In these two cases, equals and respectively. The state-of-art algorithm EXP3 SET in PI setting without SC is known to be order optimal only for these cases as well Alon et al. (2017). Threshold Based EXP3 is order optimal in the SC as well, while EXP3.SC has an additional factor of in its expected regret. In the time-varying case, for sequence , the expected regret is dependent on the worst and instances of the ratio of and , where are the sizes of the maximal acyclic subgraphs of arranged in non-increasing order, and . Finally, Table 1 also provides the performance in the equi-informational setting, namely when is undirected and all the maximal acyclic subgraphs in have the same size. The proofs of all these results are available online Rangi and Franceschetti (2018b).

Numerical comparison shows that Threshold Based EXP3 outperforms EXP3 SET in the presence of SCs. Threshold Based EXP3 also outperforms Batch EXP3, which is another order optimal algorithm for the MAB setting with SC Arora et al. (2012).

1.2 Related Work

In the absence of SC, the lower bound on the expected regret is known for all three categories of online learning problems. In the MAB setting, the expected regret is at least  Auer et al. (2002); Cesa-Bianchi and Lugosi (2006); Rangi et al. (2018d). In the PI setting with fixed feedback graph , the expected regret is at least  Mannor and Shamir (2011). In the Expert setting, the expected regret is at least Cesa-Bianchi and Lugosi (2006). All three cases present an asymptotic regret factor . In contrast, in the presence of SC the expected regrets for MAB and Expert settings present different factors, namely and respectively. The expected regret is at least in the MAB setting and in the Expert setting Dekel et al. (2014). This work provides the lower bound on the expected regret for the PI setting in the presence of SC. For the case without SC, this work establishes that the lower bound in PI setting is .

The PI setting was first considered in Alon et al. (2013); Mannor and Shamir (2011), and many of its variations have been studied without SC Alon et al. (2015, 2013); Caron et al. (2012); Rangi et al. (2018b); Langford and Zhang (2008); Kocák et al. (2016); Rangi et al. (2018a); Wu et al. (2015); Rangi and Franceschetti (2018a). In the adversarial setting we described, all of these algorithms are order optimal in the MAB and symmetric PI settings, but they also require the player to have knowledge of the graph before performing an action. The algorithm EXP3 SET does not require such knowledge Alon et al. (2017). We show that all of these algorithms are sub-optimal in the PI setting with SC, and propose new algorithms that are order optimal in the MAB and symmetric PI settings.

In the expert setting with SC, there are two order optimal algorithms with expected regret Geulen et al. (2010); Gyorgy and Neu (2014). In the MAB setting with SC, Batch EXP3 is an order optimal algorithm with expected regret Arora et al. (2012). This algorithm has also been used to solve a variant of the MAB setting Feldman et al. (2016). In the MAB setting, our algorithm has the same order of expected regret as Batch EXP3 but it numerically outperforms Batch EXP3.

There is a large literature on a continuous variation of the MAB setting, where the number of actions depends on the number of rounds . In this setting, the case without the SC was investigated in Auer et al. (2007); Bubeck et al. (2011); Kleinberg (2005); Yu and Mannor (2011). Recently, the case including SC has also been studied in Koren et al. (2017a, b). In Koren et al. (2017a), the algorithm Slowly Moving Bandits (SMB) has been proposed and in Koren et al. (2017b), it has been extended to different settings. These algorithms incur an expected regret linear in when applied in our discrete setting.

2 Problem Formulation

Before the game starts, the adversary fixes a loss sequence , assigning a loss in to actions for rounds. At round , the player performs an action , and incurs the loss assigned by the adversary. If , then the player also incurs a cost in addition to the loss .

In the PI setting, the feedback system can be viewed as a time-varying directed graph with nodes, where a directed edge indicates that choosing action at round also reveals the loss that the player would have incurred if action were taken at round . Let . Following the action , the player observes the losses he would have incurred in round by performing actions in the subset . Since the player always observes its own loss, . In a MAB setup, the feedback graph has only self loops, i.e. for all and , . In an Expert setup, is a undirected clique i.e. for all and , . The expected regret of a player’s strategy is defined as

(1)

In words, the expected regret is the expectation of the sum of losses associated to the actions performed by the player plus the SCs minus the losses incurred by the best fixed action in the hindsight, and the objective of the player is to minimize the expected regret.

3 Lower Bound in PI setting with SC

We start by defining the independence sequence number for a sequence of graphs .

Definition 3.1.

Given , let be the set of all the possible independent sets of the graph . The independence sequence number is the largest cardinality among all intersections of the independent sets , where . Namely,

(2)
Definition 3.2.

The independence sequence set is the set attaining the maximum in (2).

We use the notion of to provide a lower bound on the expected regret in the PI setting with SC.

Theorem 1.

For any with , there exists a constant and an adversary’s strategy (Algorithm 1) such that for all , and for any player’s strategy , the expected regret of is at least .

The proof of Theorem 1 relies on Yao’s minimax principle Yao (1977). A randomized adversary strategy is constructed such that the expected regret of a player, whose action at any round is a deterministic function of his past observations, is at least . This adversary strategy is described in Algorithm 1, and is a generalization of the one proposed to establish similar bounds in the MAB setup Dekel et al. (2014). The generalization is different than the one proposed for the PI setting without SC Mannor and Shamir (2011).

Input:, with ;
Set and .
Choose an arm uniformly at random
Draw variables such that , .
For all and , assign
where , For all , , and
Output: loss sequence .
Algorithm 1 Adversary’s strategy

Since is known to the adversary, it computes the independence sequence set , and the cardinality of this set is . For all and , there exists no edge in the graph between the actions and . Thus, the selection of any action in provides no information about the losses of the other actions in . The adversary selects the optimal action uniformly at random from , and assigns an expected loss of . The remaining actions in are assigned an expected loss of . On the other hand, since provides information about the losses of actions in , action is assigned an expected loss of to compensate for this additional information. In practice, even a small bias compensates for the extra information provided by an action in .

In the PI setup without SC, for a fixed feedback graph , the expected regret is at least Alon et al. (2017). The lower bound is provided only for a fixed feedback system, and the lower bound for a general time-varying feedback system is left as an open question Alon et al. (2017). This also motivates the investigation of different graph theoretic measures to study the PI setting Alon et al. (2017). Theorem 1 provides a lower bound for a general time-varying feedback system for the PI setting in presence of SC. The lower bound is dependent on the independence sequence number of . Thus, the ideas introduced in Theorem 1 can be extended to close this gap in the literature of PI setting without SC.

Lemma 2.

In the PI setting without SC, for any with , there exists a constant and an adversary’s strategy such that for any player’s strategy , the expected regret of is at least .

Using Theorem 1 and Lemma 2, it can be concluded that the presence of SC changes the asymptotic regret by at least a factor . In the MAB setup, , and Theorem 1 recovers the bounds provided in Dekel et al. (2014).

We now focus on the assumption in Theorem 1, i.e. . This is satisfied in many networks of practical interest. For example, networks modeled as -random graphs where is the probability of having edge between two nodes. The expected independence number of these graphs is Coja-Oghlan and Efthymiou (2015). Since the probability of each node being in independent set is same, the expected value of is , and is the expected node degree which is usually a constant as is inversely proportional to . This is greater than one for large values of , and small values of .

Algorithm 1 depends on the independence sequence set whose cardinality is non-increasing in . In such cases, the adversary can split the sequence of feedback graphs into multiple sub-sequences i.e. say sub-sequences such that , , and for all , is an empty set. For each sub-sequence , compute the independence sequence set and assign losses independently of other sub-sequences according to Algorithm 1. This adversary’s strategy, which we call Algorithm 1.1, gives the following bound on the expected regret.

Theorem 3.

For any split of into disjoint sub-sequences with and , there exists a constant and an adversary’s strategy (Algorithm 1.1) such that for any player’s strategy , the expected regret of is at least , where is the length of sub-sequence .

With the insight provided by Theorem 3, the regret can be made large with an appropriate split of into sub-sequences. This can be formulated as a sub-modular optimization problem where the objective is:

(3)
(4)

This can be solved using greedy algorithms developed in the context of sub-modular maximization.

Until now, we have been focusing on designing an adversary’s strategy for maximizing the regret for a given sequence of feedback graphs . Now, we briefly discuss the case when can also be chosen by the adversary. If the adversary is not constrained about the choice of feedback graphs, then the feedback graph that maximizes the expected regret would be a feedback graph with only self loops, as this reveals the least amount of information. If the adversary is constrained by the choice of independence number, i.e. for all , , then the optimal value of (3) is achieved for a sequence of fixed feedback graphs i.e. for all , , which implies .

We now discuss the trade-off between the loss incurred and the number of switches performed by the player.

Lemma 4.

If the expected regret computed ignoring the SC of any algorithm is , then there exists a loss sequence such that makes at least switches.

Along the same lines of Lemma 4, it can also be shown that if the expected number of switches of is , then the expected regret without SC is at least . This provides the lower bound on the expected regret given the SC is constrained by a fixed budget. Using Lemma 4, if the expected regret without SC of is , then there exists a loss sequence that forces to make at least switches. This implies the regret of with the SC is linear in . Thus, any algorithm that is order optimal without SC, is necessarily sub-optimal in the presence of SC, which motivates the design of new algorithms in our setting.

4 Algorithms in PI setting with SC

In this section, we introduce the two algorithms Threshold Based EXP3 and EXP3.SC for an uninformed setting where is only revealed after the action has been performed. This is common in a variety of applications. For instance, a user’s selection of some product allows to infer that the user might be interested in similar products. However, no action on the recommended products may mean that user might not be interested in the product, does not need it or did not check the products. Thus, the feedback is revealed only after the action has been performed.

In Threshold Based EXP3 (Algorithm 2), each action is assigned a weight at round . When the loss of action is observed at round , i.e. , is computed by penalizing exponentially by the empirical loss . At round , is the sampling distribution where . At round , action is selected with probability if the threshold event is true, where

(5)

and . The event contains two threshold conditions, one on the variable and the other on the empirical losses.

Initialization: ; For all , , and ; ;
for   do
     if  or or (see (5)) then
         if  then
              
              
         end if
         Update .
         Choose with probability .
         Set and for all , set
     else
         For all , ,
         and ; ;
     end if
     For all , observe the pair .
     For all , , where
end for
Algorithm 2 Threshold based EXP3

The threshold event is critical in balancing the trade-off between the number of switches and the loss incurred by the player. corresponds to the first selection of action, and incurs no SC. In , the variable tracks the number of rounds (or time instances) since the event occurred last time. If the choice of a new action has not been considered for past rounds, then forces the player to choose an action according to the updated sampling distribution at round . The threshold condition in ensures that the regret incurred due to the selection of a sub-optimal action does not grow continuously while trying to save on the SC between the actions. The event is independent of the observed losses, and will occur at most times. Unlike event , the event is dependent on the losses and , for all . Each loss tracks the total empirical loss of action observed until round , i.e.

where is the latest round at which is true. On the other hand, each loss represents the total empirical loss of action observed between rounds and , i.e.

This loss tracks the total empirical loss observed after the selection of an action at time instance . The event balances exploration and exploitation while taking into account the SC. In , the first condition ensures that the player has sufficient amount of information about the losses of all other actions before exploitation is considered. Given sufficient exploration has been performed, the second condition triggers the exploitation. The selection of a new action is considered when the empirical loss incurred by the current action , following its selection at , becomes significant in comparison to the total empirical loss incurred by the other actions . Since the total empirical loss of an action increases with , it is desirable that the threshold increases with as well. Since the increment in is bounded above by at round , for all , implies that

(6)

Thus, ensures that the player reconsiders the action selection if the loss incurred due to the current selection becomes significant in comparison to the total empirical loss of other actions. The event also ensures that the loss incurred due to the current selection is sufficiently smaller than the total empirical loss of other actions (see (6)). The event ensures that the sampling distribution has changed significantly from the previous sampling distribution before selecting the action again. Thus, balances exploration and exploitation based on the observed losses.

Batch EXP3, the order optimal algorithm in MAB with SC, is EXP3 performed in batches of . A similar strategy to design an algorithm for the PI setting with SC will fail because unlike MAB setting, the feedback graph can change at every round , and this requires an update of empirical losses based on at every round. In our algorithm, the computation of empirical loss is dependent on via . Additionally, Batch EXP3 does not utilize the information about the observed losses, which is captured in . The following theorem presents the performance guarantees of our algorithm.

Theorem 5.

The following statements hold for Threshold Based EXP3:
The expected regret without accounting for SC is

(7)

where .
The expected number of switches is

(8)

Letting , the expected regret (1) is at most

(9)

In a symmetric PI setting i.e. for all is un-directed and fixed, the expected regret (1) is at most

(10)
Initialization: For all , ; , ,
for   do
     For all , update:
       
     Choose with probability ,
     else, with probability .
     For all , observe the pair .
     For all , update ,
     where and
     .
end for
Algorithm 3 EXP3.SC

In the PI setting, captures the information provided by the feedback graph . As increases, the information provided by about the losses of actions decreases. The regret of the algorithm depends on the instances of (see Theorem 5 ). This is because the algorithm makes a selection of a new action times in expectation (see Theorem 5 ), and is not available in advance to influence the selection of the action. Also, the ratio is bounded above by and has no affect on order of . The bounds of the algorithm on the expected regret are tight in two special cases. In the symmetric PI setting, the expected regret of Threshold Based EXP3 is (see Theorem 5 ), hence, the algorithm is order optimal. In the MAB setting, the expected regret of Threshold Based EXP3 is , hence, the algorithm is order optimal. The state-of-art algorithm for the case without SCs is known to be order optimal only for these cases as well, and the key challenges for closing this gap are highlighted in the literatureAlon et al. (2017).

EXP3.SC (Algorithm 3) is another algorithm in PI setting with SC. The key differences between Threshold based EXP3 and EXP3.SC are highlighted here. Unlike Threshold based EXP3, EXP3.SC does not require the knowledge of the number of rounds . Threshold based EXP3 favors the selection of action at regular intervals based on the event . On contrary, EXP3.SC chooses a new action with probability which is decreasing in . Thus, the algorithm favors exploration in the initial rounds, and favors exploitation as increases. In Threshold based EXP3, the scaling exponent is a constant dependent on . On contrary, in EXP3.SC, the scaling exponent is time-varying, and is decreasing in . The following theorem provides the performance guarantees of EXP3.SC.

Theorem 6.

The expected regret (1) of EXP3.SC is at most

where .

In symmetric PI and MAB settings, the expected regret of EXP3.SC is and respectively. Hence, the algorithm is order optimal in and , and has an additional factor of in the performance guarantees. In EXP3.SC, the dependency on is removed at the expense of an additional factor of in its performance.

In an alternative setting where the number of switches are constraint to be , it can be shown using Lemma 4 that the expected regret without SC is at least . The two algorithms in this setting are also simple variations of our two algorithms: Threshold based EXP3 and EXP3.SC. Threshold based EXP3 can be adapted by using threshold , and . EXP3.SC can be adapted by using and . These adapted algorithms would be order optimal in MAB and symmetric PI settings as well.

5 Performance Evaluation

In this section, we numerically compare the performance of Threshold based EXP3 with EXP3 SET and Batch EXP3 in PI and MAB setups with SC respectively. We do not compare the performance of our algorithm with the ones proposed in the Expert setting with SC because in MAB and PI setups, the player needs to balance the exploration-exploitation trade-off, while in the Expert setting the player is only concerned about the exploitation. Hence, there is a fundamental discontinuity in the design of algorithms as we move from the Expert to the PI setting. This gap is also evident from the discontinuity in the lower bounds in these settings, for the Expert setting the expected regret is at least , while for the PI setting the expected regret is at least , for which excludes the clique feedback graph.

We evaluate these algorithms by simulations because in real data sets, the adversary’s strategy is not necessarily unfavorable for the players. Hence, the trends in the performance can vary widely across different data sets. For this reason, in the literature only algorithms in stochastic setups rather than adversarial setups are typically evaluated on real data sets Katariya et al. (2016); Zong et al. (2016). In our simulations, the adversary uses the Algorithm 1, and .

Figure 1 shows that the Threshold based EXP3 outperforms EXP3 SET in the presence of SC. Additionally, the expected regret and the number of switches of EXP3 SET grow linearly with . These observations are in line with our theoretical results presented in Lemma 4. The results presented here are for , and . Similar trends were observed for different value of and .

Figure 2 shows that Threshold based EXP3 outperforms Batch EXP3 in MAB setup with SC. The gap in the performance of these algorithm increases with (Figure 2(a)). Additionally, the number of switches performed by threshold based EXP3 is larger than the number of switches performed by Batch EXP3 (Figure 2(b) and (d)). The former algorithm utilizes the information about the observed losses via to balance the trade off between the regret and the number of switches. On contrary, Batch EXP3 does not utilize any information from the observed losses, and switches the action only after playing an action times. Note that MAB setup reveals the least information about the losses, and performance gap due to utilization of this information is significant (Figure 2). This gap in performance grows as decreases.

In summary, Threshold Based EXP3 outperforms both EXP3 SET and Batch EXP3 in PI and MAB settings with SC respectively. Threshold Based EXP3 fills a gap in the literature by providing a solution for the PI setting with SC, and improves upon the existing literature in the MAB setup.

(a) For
(b) For
Figure 1: Performance evaluation of EXP3 SET and Threshold based EXP3 for K=25
(a) For
(b) For
(c) For
(d) For
Figure 2: Performance evaluation of Batch EXP3 and Threshold based EXP3 in MAB setting

6 Conclusion

This work focuses on online learning in the PI setting with SC in the presence of an adversary. The lower bound on the expected regret is presented in the PI setup in terms of independence sequence number. There is a need to design new algorithms in this setting because any algorithm that is order optimal without SC is necessarily sub-optimal in the presence of SC. Two algorithms, Threshold Based EXP3 and EXP3.SC, are proposed and their performance is evaluated in terms of expected regret. These algorithms are order optimal in in two cases: symmetric PI and MAB setup. Numerical comparisons show that the Threshold Based EXP3 outperforms EXP3 SET and Batch EXP3 in PI setting with SC.

As future work, algorithms can be designed in a partially informed setting and a fully informed setting. In the partially informed setting, the feedback graph at round is revealed following the action at round . Thus, the feedback graphs are revealed one at a time in advance at the beginning of each round. In the fully informed setting, the entire sequence of feedback graphs is revealed before the game starts. Since the adversary is aware of , these settings are important to study from the player’s end as well. Note that without SC, the algorithms in both the partially informed and fully informed settings can exploit the feedback graphs at every round in a greedy manner, and perform an action accordingly. Hence, the algorithm in partially informed setting is also optimal in a fully informed setting in the absence of SC. On the contrary, in the presence of SC, a greedy exploitation of the feedback structure is not possible at every round. Hence, in fully informed setting with SC, the player chooses an action based on such that the selected action balances the trade off between the regret and the SC. Thus, the partially informed and fully informed settings of PI are of particular interest in the presence of SC, and is an interesting area for further study.

References

  • Alon et al. (2015) Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. (2015). Online learning with feedback graphs: Beyond bandits. In JMLR WORKSHOP AND CONFERENCE PROCEEDINGS, volume 40. Microtome Publishing.
  • Alon et al. (2017) Alon, N., Cesa-Bianchi, N., Gentile, C., Mannor, S., Mansour, Y., and Shamir, O. (2017). Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785–1826.
  • Alon et al. (2013) Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. (2013). From bandits to experts: A tale of domination and independence. In Advances in Neural Information Processing Systems, pages 1610–1618.
  • Arora et al. (2012) Arora, R., Dekel, O., and Tewari, A. (2012). Online bandit learning against an adaptive adversary: from regret to policy regret. arXiv preprint arXiv:1206.6400.
  • Auer et al. (2002) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77.
  • Auer et al. (2007) Auer, P., Ortner, R., and Szepesvári, C. (2007). Improved rates for the stochastic continuum-armed bandit problem. In International Conference on Computational Learning Theory, pages 454–468. Springer.
  • Bubeck et al. (2011) Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, C. (2011). X-armed bandits. Journal of Machine Learning Research, 12(May):1655–1695.
  • Caron et al. (2012) Caron, S., Kveton, B., Lelarge, M., and Bhagat, S. (2012). Leveraging side observations in stochastic bandits. arXiv preprint arXiv:1210.4839.
  • Cesa-Bianchi and Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
  • Coja-Oghlan and Efthymiou (2015) Coja-Oghlan, A. and Efthymiou, C. (2015). On independent sets in random graphs. Random Structures & Algorithms, 47(3):436–486.
  • Dekel et al. (2014) Dekel, O., Ding, J., Koren, T., and Peres, Y. (2014). Bandits with switching costs: T 2/3 regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 459–467. ACM.
  • Feldman et al. (2016) Feldman, M., Koren, T., Livni, R., Mansour, Y., and Zohar, A. (2016). Online pricing with strategic and patient buyers. In Advances in Neural Information Processing Systems, pages 3864–3872.
  • Gentile and Orabona (2014) Gentile, C. and Orabona, F. (2014). On multilabel classification and ranking with bandit feedback. Journal of Machine Learning Research, 15(1):2451–2487.
  • Geulen et al. (2010) Geulen, S., Vöcking, B., and Winkler, M. (2010). Regret minimization for online buffering problems using the weighted majority algorithm. In COLT, pages 132–143.
  • Gyorgy and Neu (2014) Gyorgy, A. and Neu, G. (2014). Near-optimal rates for limited-delay universal lossy source coding. IEEE Transactions on Information Theory, 60(5):2823–2834.
  • Katariya et al. (2016) Katariya, S., Kveton, B., Szepesvari, C., and Wen, Z. (2016). Dcm bandits: Learning to rank with multiple clicks. In International Conference on Machine Learning, pages 1215–1224.
  • Kleinberg (2005) Kleinberg, R. D. (2005). Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, pages 697–704.
  • Kocák et al. (2016) Kocák, T., Neu, G., and Valko, M. (2016). Online learning with erdős-rényi side-observation graphs. In Uncertainty in Artificial Intelligence.
  • Koren et al. (2017a) Koren, T., Livni, R., and Mansour, Y. (2017a). Bandits with movement costs and adaptive pricing. arXiv preprint arXiv:1702.07444.
  • Koren et al. (2017b) Koren, T., Livni, R., and Mansour, Y. (2017b). Multi-armed bandits with metric movement costs. In Advances in Neural Information Processing Systems, pages 4122–4131.
  • Langford and Zhang (2008) Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pages 817–824.
  • Mannor and Shamir (2011) Mannor, S. and Shamir, O. (2011). From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692.
  • Nemhauser and Wolsey (1978) Nemhauser, G. L. and Wolsey, L. A. (1978). Best algorithms for approximating the maximum of a submodular set function. Mathematics of operations research, 3(3):177–188.
  • Rangi and Franceschetti (2018a) Rangi, A. and Franceschetti, M. (2018a). Multi-armed bandit algorithms for crowdsourcing systems with online estimation of workers’ ability. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 1345–1352. International Foundation for Autonomous Agents and Multiagent Systems.
  • Rangi and Franceschetti (2018b) Rangi, A. and Franceschetti, M. (2018b). Online learning with feedback graphs and switching costs. arXiv preprint arXiv:1810.09666.
  • Rangi et al. (2018a) Rangi, A., Franceschetti, M., and Marano, S. (2018a). Consensus-based chernoff test in sensor networks. In 2018 IEEE Conference on Decision and Control (CDC), pages 6773–6778. IEEE.
  • Rangi et al. (2018b) Rangi, A., Franceschetti, M., and Marano, S. (2018b). Decentralized chernoff test in sensor networks. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 501–505. IEEE.
  • Rangi et al. (2018c) Rangi, A., Franceschetti, M., and Marano, S. (2018c). Distributed chernoff test: Optimal decision systems over networks. arXiv preprint arXiv:1809.04587.
  • Rangi et al. (2018d) Rangi, A., Franceschetti, M., and Tran-Thanh, L. (2018d). Unifying the stochastic and the adversarial bandits with knapsack. arXiv preprint arXiv:1811.12253.
  • Wu et al. (2015) Wu, Y., György, A., and Szepesvári, C. (2015). Online learning with gaussian payoffs and side observations. In Advances in Neural Information Processing Systems, pages 1360–1368.
  • Yao (1977) Yao, A. C.-C. (1977). Probabilistic computations: Toward a unified measure of complexity. In Foundations of Computer Science, 1977., 18th Annual Symposium on, pages 222–227. IEEE.
  • Yu and Mannor (2011) Yu, J. Y. and Mannor, S. (2011). Unimodal bandits. In ICML, pages 41–48. Citeseer.
  • Zong et al. (2016) Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., and Kveton, B. (2016). Cascading bandits for large-scale recommendation problems. arXiv preprint arXiv:1603.05359.

Appendix A Proof of Theorem 1

Proof.

Without loss of generality, let the independent sequence set formed of actions (or “arms”) from to . Given the sequence of feedback graphs , let be the number of times the action is selected by the player in rounds. Let be the total number of times the actions are selected from the set . Let denote expectation conditioned on , and denote the probability conditioned on . Additionally, we define as the probability conditioned on event . Therefore, under , all the actions in the independent sequence set, i.e. , incur an expected regret of , whereas, the expected regret of actions is . Let be the corresponding conditional expectation. For all and , and denote the unclipped and clipped loss of the action respectively. Assuming the unclipped losses are observed by the player, then is the sigma field generated by the unclipped losses, and is the set of actions whose losses are observed at time , following the selection of , according to the feedback graph . The observed sequence of unclipped losses will be referred as . Additionally, is the sigma field generated by the clipped losses, for all , where , and the observed sequence of clipped losses will be referred as . By definition, .

Let be the sequence of actions selected by a player over the time horizon . Then, the regret of the player corresponding to clipped loses is

(11)

where is the number of switches in the action selection sequence , and is the cost of each switch in action. Now, we define the regret which corresponds to the unclipped loss function in Algorithm 1 as following

(12)

Using (Dekel et al., 2014, Lemma 4), we have