Cooperative Online Learning: Keeping your Neighbors Updated

# Cooperative Online Learning: Keeping your Neighbors Updated

Nicolò Cesa-Bianchi Department of Computer Science, Università degli Studi di Milano, Italy Tommaso R. Cesari Department of Computer Science, Università degli Studi di Milano, Italy Claire Monteleoni Department of Computer Science, University of Colorado Boulder, Colorado
###### Abstract

We study an asynchronous online learning setting with a network of agents. At each time step, some of the agents are activated, requested to make a prediction, and pay the corresponding loss. The loss function is then revealed to these agents and also to their neighbors in the network. Our results characterize how much knowing the network structure affects the regret as a function of the model of agent activations. When activations are stochastic, the optimal regret (up to constant factors) is shown to be of order , where is the horizon and is the independence number of the network. We prove that the upper bound is achieved even when agents have no information about the network structure. When activations are adversarial the situation changes dramatically: if agents ignore the network structure, a lower bound on the regret can be proven, showing that learning is impossible. However, when agents can choose to ignore some of their neighbors based on the knowledge of the network structure, we prove a sublinear regret bound, where is the clique-covering number of the network.

111Submitted and currently under revision.

## 1 Introduction

Distributed asynchronous online learning settings with communication constraints arise naturally in several applications. For example, large-scale learning systems are often geographically distributed, and in domains such as finance or online advertising, each agent must serve high volumes of prediction requests. If agents keep updating their local models in an online fashion, then bandwidth and computational constraints may preclude a central processor from having access to all the observations from all sessions, and synchronizing all local models at the same time. An example in a different domain is mobile sensor networks cooperating towards a common goal, such as environmental monitoring. Sensor readings provide instantaneous, full-information feedback and energy-saving constraints favor short-range communication over long-range. Online learning algorithms distributed over spatial locations have already been proposed for problems in the field of climate informatics [11, 12], and have shown empirical performance advantages compared to their global (i.e., non-spatially distributed) online learning counterparts.

Motivated by these real-life applications, we introduce and analyze an online learning setting in which a network of agents solves a common online convex optimization problem by sharing feedback with their network neighbors. Agents do not have to be synchronized. At each time step, only some of the agents are requested to make a prediction and pay the corresponding loss: we call these agents “active”. Because the feedback (i.e., the current loss function) received by the active agents is communicated to their neighbors, both active agents and their neighbors can use the feedback to update their local models. The lack of global synchronization implies that agents who are not requested to make a prediction get “free feedback” whenever someone is active in their neighborhood. Since in online convex optimization the sequence of loss functions is fully arbitrary, it is not clear whether this free feedback can improve the system’s performance. In this paper, we characterize under which conditions and to what extent such improvements are possible.

Our goal is to control the network regret, which we define by summing the average instantaneous regret of the active agents at each time step. In order to build some intuition on this problem, consider the following two extreme cases where, for the sake of simplicity, we assume exactly one agent is active at each time step. If no communication is possible among the agents, then each agent learns in isolation over the subset of time steps when they are active. Assuming each agent runs a standard online learning algorithm with regret bounded by —such as Online Mirror Descent (OMD)— the network regret is at most of order , where and is the number of agents. Next, consider a fully connected graph, where agents share their feedback with the rest of the network. Each local instance of OMD now sees the same loss sequence as the other instances, so the sequence of predictions is the same, no matter which agents are chosen to be active. The network regret is then bounded by , as in the single-instance case. Our goal is to understand the regret when the communication network corresponds to an arbitrary graph .

We consider two natural activation mechanisms for the agents: stochastic and adversarial. In the stochastic setting, at each time step each agent is independently active with probability , where is a fixed and unknown number in . Under this assumption, we show that when each agent runs OMD, the network regret is , where is the independence number of the communication graph. Note that this bound smoothly interpolates the two extreme cases of no communication () and full communication (). From this viewpoint, can be viewed as the number of “effective instances” that are implicitly maintained by the system. It is not hard to prove that this upper bound cannot be improved upon: fix a network and a maximal independent set in of size . Define if belongs to the independent set and otherwise. Then no two nodes that can ever become active are adjacent in , and we reduced the problem to that of learning with non-commmunicating agents over time steps. Since there are instances of the standard online convex optimization problem on which any agent strategy has regret , we obtain that the network regret must be at least of order . Note that this lower bound also applies to algorithms that have complete preliminary knowledge of the graph structure, and can choose to ignore or process any feedback coming from their neighbors. In contrast, the OMD instances used to prove the upper bound are fully oblivious both to the graph structure and to the source of their feedback (i.e., whether their agent is active as opposed to being the neighbor of an active agent).

In the adversarial activation setting, nodes are activated according to some unknown deterministic schedule. Surprisingly, under the same assumption of obliviousness about the feedback source which we used to prove the upper bound for stochastic activations, we show that on certain network topologies a deterministic schedule of activations can force a linear regret on any algorithm, thus making learning impossible. On the other hand, if agents are free to use feedback only from a subset of their neighbors chosen with knowledge of the graph structure, then the network regret of OMD is , where is the clique-covering number of the communication graph. Hence, unlike the stochastic case, where the knowledge of the graph is not required to achieve optimality, in the adversarial case the ability of choosing the feedback source based on the graph structure is both a necessary and sufficient condition for sublinear regret.

The extension of the OMD analysis to a multiagent setting with communication (Theorem 2), and the lower bound for the adversarial activation setting (Theorem 5) are the main technical novelties of the paper.

## 2 Related Works

The study of cooperative nonstochastic online learning on networks was pioneered by Awerbuch and Kleinberg [2], who investigated a bandit setting in which the communication graph is a clique, users are clustered so that the loss function at time may differ across clusters, and some users may be non-cooperative. More recently, a similar line of work was pursued by Cesa-Bianchi et al. [3], where they derive graph-dependent regret bounds for nonstochastic bandits on arbitrary networks when the loss function is the same for all nodes and the feedbacks are broadcast to the network with a delay corresponding to the shortest path distance on the graph. Although their regret bounds —like ours— are expressed in terms of the network independence number, this happens for reasons that are very different from ours, and by means of a different analysis. In their setting all agents are simultaneously active at each time step, and sharing the feedback serves the purpose of reducing the variance of the importance-weighted loss estimates. A node with many neighbors observes the current loss function evaluated at all the points corresponding to actions played by the neighbors. Hence, in that context cooperation serves to bring the bandit feedback closer to a full information setting.

In contrast, we study a full information setting in which agents get free and meaningful feedback only when they are not requested to predict.222Two adjacent agents that are simultaneously active exchange their feedback, but this does not bring any new information to either agent because we are in a full information setting and the loss function is the same for all nodes. Therefore, in our setting cooperation corresponds to faster learning (through the free feedback that is provided over time) within the full information model, as opposed to [3] where cooperation increases feedback within a single time-step. An even more recent work considering bandit networks studies a stochastic bandit model with simultaneous activation and constraints on the amount of communication between neighbors [10]. Their regret bounds scale with the spectral gap of the communication network. The work of Sahu and Kar [14] investigates a different partial information model of prediction with expert advice where each agent is paired with an expert, and agents see only the loss of their own expert. The communication model includes delays, and the regret bound depends on a quantity related to the mixing time of a certain random walk on the network. Zhao et al. [18] study a decentralized online learning setting in which losses are characterized by two components, one adversarial and another stochastic. They show upper bounds on the regret in terms of a constant representing the magnitude of the adversarial component and another constant measuring the randomness of the stochastic part.

The idea of varying the amount of feedback available to learning agents has also appeared in single-agent settings. In the sleeping experts model [5], different subsets of actions are available to the learner at different time steps. In our multi-agent setting, instead, actions are always available while the agents are occasionally sleeping. An algorithmic reduction between the two settings seems unlikely to exist because actions and agents play completely different roles in the learning process. In the learning with feedback graphs model [1, 9], each selection of an action reveals to the learner the loss of the actions that are adjacent to it in a given graph. In our model, each time an active agent plays an action, the loss vector is revealed to the agents that are adjacent to the active learner. There is again a similarity between actions and agents in the two settings, but to the best of our knowledge there is no algorithmic reduction from multi-agent problems to single-agent problems. Yet, it should not come as a surprise that some general graph-theoretic tools —like Lemma 2— are used in the analysis of both single-agent and multi-agent models.

A very active area of research involves distributed extensions of online convex optimization, in which the global loss function is defined as a sum of local convex functions, each associated with an agent. Agents are run over the local optimization problem corresponding to their local functions and communicate with their neighborhood to find a point in the decision set approximating the loss of the best global action. This problem has been studied in various settings: distributed convex optimization —see, e.g., [4, 15] and references therein, distributed online convex optimization [8], and a dynamic regret extension of distributed online convex optimization [16]. Unlike our work, these papers consider distributed extensions of OMD (and Nesterov dual averaging) based on generalizations of the consensus problems. The resulting performance bounds scale inversely in the spectral gap of the communication network.

## 3 Preliminaries and definitions

Let be a communication network, i.e., an undirected graph over a set of agents. Without loss of generality, assume . For any agent , we denote by the set of nodes containing the agent and the neighborhood . The independence number is the cardinality of the biggest independent set of , i.e., the cardinality of the biggest subset of agents, no two of which are neighbors.

We study the following cooperative online convex optimization protocol: initially, hidden from the agents, the environment picks a sequence of subsets of active agents and a sequence of differentiable convex real loss functions defined on a convex decision set . Then, for each time step ,

1. [topsep=0pt,parsep=0pt,itemsep=0pt]

2. each agent predicts with ,

3. each agent receives as feedback,

4. the system incurs the loss (defined as when ).

We assume each agent runs an instance of the same online algorithm. Each instance learns a local model generating predictions . This local model is updated whenever a feedback is received. We call paid feedback the feedback received by when (i.e., the agent is active) and free feedback the feedback received by when (i.e., the agent is not active but in the neighborhood of some active agent). The goal is to minimize the network regret as a function of the unknown number of time steps,

 RT=T∑t=11|St|∑v∈Stℓt(xt(v))−infx∈XT∑t=1ℓt(x) (1)

Note that only the losses of active agents contribute to the network regret.

## 4 Online Mirror Descent

We now review the standard Online Mirror Descent algorithm (OMD) —see Algorithm 1— and its analysis. Let be a convex function. We say that is the convex conjugate of if . We say that is -strongly convex on with respect to a norm if there exists such that, for all we have . The following well-known result can be found in [17, Lemma 2.19 and subsequent paragraph].

###### Lemma 1.

Let be a strongly convex function on . Then the convex conjugate is everywhere differentiable on .

The following result —see, e.g., [13, bound (6) in Corollary 1 with set to zero]— shows an upper bound on the regret of OMD.

###### Theorem 1.

Let be a differentiable function -strongly convex with respect to . Then the regret of OMD run with , for , satisfies

 T∑t=1ℓt(xt) −infx∈XT∑t=1ℓt(x)≤Dη√T+η2σT∑t=11√t∥∇ℓt∥2∗

where and is the dual norm of . If , then choosing gives .

A popular instance of OMD is the standard online gradient descent algorithm, corresponding to choosing equal to a closed Euclidean ball centered at the origin, and setting for all , where is the Euclidean norm. Another instance is the Hedge algorithm for prediction with expert advice, corresponding to choosing equal to the probability simplex, and setting .

## 5 Stochastic Activations

In this section we analyze the performance of OMD when the sets of active agents are chosen stochastically. As discussed in the introduction, in this setting we do not require any ad-hoc interface between each OMD instance and the rest of the network. In particular, we make the following assumption.

###### Assumption 1 (Oblivious network interface).

An online algorithm is run with an oblivious network interface if for each agent it holds that:

1. [topsep=0pt,parsep=0pt,itemsep=0pt]

2. runs an instance of ,

3. uses the same initialization and learning rate as the other instances,

4. makes predictions and updates while being oblivious to whether or .

This assumption implies that each instance is oblivious to both the network topology and the location of the agent in the network. Moreover, instances make an update whenever they have the opportunity to do so, (i.e., whenever they or some of their neighbors are active). The purpose of this assumption is to show that communication might help OMD even without any network-specific tuning. In concrete applications, one might use ad-hoc OMD variants that rely on the knowledge of the task at hand, and decrease the regret even further. However, the lower bound proven in Section 6 shows that the regret cannot be decreased significantly even when agents have full knowledge of the graph.

We start by considering a slightly simplified stochastic activation setting, where only a single agent is activated at each time step (i.e., for all ). The more general stochastic case is analyzed at the end of this section.

We assume that the active agents are drawn i.i.d. from an unknown fixed distribution on . The main result of this section is an upper bound on the regret of the network when all agents run the basic OMD (Algorithm 1) with an oblivious network interface. We show that in this case the network achieves the same regret guarantee as the single-agent OMD (Theorem 1) multiplied by the square root of independence number of the communication network.

Before proving the main result, we state a combinatorial lemma that allows to upper bound the sum of a ratio of probabilities over the vertices of an undirected graph with the independence number of the graph [6, 9]. The proof is included for completeness.

###### Lemma 2.

Let be an undirected graph with independence number and any probability distribution on such that for all . Then

 ∑v∈VqvQv≤αG
###### Proof.

Initialize , fix , and denote . For fix and shrink until . Being undirected, we have , therefore the number of times that an action can be picked this way is upper bounded by . Denoting , this implies

 ∑v∈VqvQv=m∑k=1∑v∈N′wkqvQv≤m∑k=1∑v∈N′wkqvQwk≤m∑k=1∑v∈NwkqvQwk=m≤αG

The following holds for any differentiable function , -strongly convex with respect to some norm .

###### Theorem 2.

Consider a network of agents and assume for each , where is drawn i.i.d. from some fixed and unknown distribution on . If all agents run OMD with an oblivious network interface and using , for , then the network regret satisfies

 E[RT]≤(Dη+ηL22σ)√αGT

where , , and is the dual norm of . In particular, choosing gives .

###### Proof.

Fix , any sequence of realizations , and any in the support of the activation distribution . Note that the OMD instance run by , makes an update at time only when . Hence, letting and applying Theorem 1,

 T∑t=1rt(v)I{v∈Nvt}≤Dη√Tv+ηL22σT∑t=1I{v∈Nvt}√∑ts=1I{v∈Nvs}≤(Dη+ηL22σ)√Tv (2)

where , the addends after the first inequality are intended to be null when the denominator is zero, and we used . Note that is independent of , as it only depends on the subset of , , such that . Denote by the probability . Let be the -algebra generated by . Since is independent of , . Therefore, taking expectation with respect to on both sides of (2), using , and applying Jensen’s inequality, yields

 E [T∑t=1rt(v)Qv]≤(Dη+ηL22σ)√QvT (3)

Now, letting , we have that is equal to

 E[∑v∈V′T∑t=1rt(v)I{vt=v}]=E[∑v∈V′T∑t=1rt(v)E[I{vt=v}∣Ft−1]]=∑v∈V′qvE[T∑t=1rt(v)]

Dividing both sides of (3) by , we can write

 E[RT(x)]≤(Dη+ηL22σ)∑v∈V′qv√TQv≤(Dη+ηL22σ)√T∑v∈V′qvQv≤(Dη+ηL22σ)√αT

where in the last two inequalities we applied Jensen’s inequality and Lemma 2. Observing that and recalling that was chosen arbitrarily in concludes the proof. ∎

Note that the proof of the previous result gives a tighter upper bound on the network regret in terms of the independence number of the subgraph induced by the support of .

Next, we consider the setting in which we allow the activation of more than one agent per time step. At the beginning of the process, the environment draws an i.i.d. sequence of Bernoulli random variables with some unknown fixed parameter for each agent . The active set at time is then defined as . Note that, unlike the previous setting, now in general.

We state an upper bound on the regret that the network incurs if all agents run OMD with an oblivious network interface (for a proof, see Appendix A). Our upper bound is expressed in terms of a constant depending on the probabilities of activating each agent and such that . The result holds for any differentiable function , -strongly convex with respect to some norm .

###### Theorem 3.

Consider a network of agents. Assume that, at each time step each agent is independently activated with probability . If all agents run OMD with an oblivious network interface and using , for , the network regret satisfies

 E[RT]≤(Dη+ηL22σ)√QT

for some , , and , where is the dual norm of . In particular, choosing gives .

In order to compare the previous upper bound to Theorem 2, consider the case for all . Without loss of generality, assume (the regret is zero when vanishes). Then

 Q=Q(q)=1N∑v∈V1−(1−q)N1−(1−q)|Nv|

(for a proof, see Theorem 7 in Appendix A and proceed as in the proof of Lemma 3). A direct computation of the of the first derivative of the addends shows that these functions are decreasing in , hence where the last inequality follows by Lemma 2. Note that the lower bound is attained if the probabilities of picking agents at each time step are all . In this case all agents are activated at each time step, the graph structure over the set of agents becomes irrelevant and the model reduces to a single-agent problem. The inequality is not a coincidence due to the constant . Indeed, one can prove that this is always the case, up to a small constant factor (for a proof., see Lemma 4 in Appendix A).

The previous results shows that paying the average price of multiple activations is never worse (up to a small constant factor) than drawing a single agent per time step, and it can be significantly better. A similar argument shows a tighter bound when the activation probabilities satisfy , which allows to recover the upper bound on the network regret proven in Theorem 2. This is consistent with the intuition that —in expectation— picking a single agent at random according to a distribution is the same as picking each independently with probability . Similarly to the case , the previous result gives a tighter upper bound on the network regret in terms of the independence number of the subgraph induced by the subset of containing all agents with . Note that the setting discussed in this section smoothly interpolates between the single-agent setting ( for all ), cooperative learning with one agent stochastically activated at each time step (), and beyond (), where a non trivial fraction of the total number rounds is skipped.

## 6 Lower Bound for Stochastic Activations

In this section we show that, for any communication network with stochastic agent activations, the best possible regret rate is of order . This holds even when agents are not restricted to use an oblivious network interface. The idea is that if the distribution from which active agents are drawn is supported on an independent set of cardinality , then the problem reduces to that of an edgeless graph with agents. We sketch the proof for the case when .

###### Theorem 4.

There exists a convex decision set in such that, for each communication network and for arbitrary (and possibly different) online learning algorithms run by the agents, for some sequence , where , is drawn i.i.d. from some fixed distribution on , and the expectation is taken with respect to the random draw of the .

###### Proof sketch.

Let be the probability simplex in . Let be any communication graph and its independence number. We consider linear losses defined on . Let be the uniform distribution over a maximal independent set . Fix now any cooperative online linear optimization algorithm for this setting. Since each active agent belongs to for all with probability , it suffices to analyze the updates of the algorithm for these agents. Indeed, no other agent incurs any loss at any time-step. Since is an independent set, each agent makes an update at round if and only if . This happens with probability , independently of . Each agent is therefore running an independent single-agent online linear optimization problem for an average of rounds. It is well-known [7, Theorem 3.2] that any algorithm for online linear optimization on the simplex with losses bounded in incurs regret over rounds in the worst case. Consequently, the regret of the network satisfies

An analogous lower bound can be proven for the case of multiple agent activations per time step. Indeed, define for each agent belonging to some fixed maximal independent set and otherwise. This again leads to independent single-agent online linear optimization problems for an average of rounds each, and an argument similar to the one in the proof of Theorem 4 gives the result.

## 7 Nonstochastic Activations

In this section we drop the stochasticity assumption on the agents’ activations and focus on the case where active agents are picked from by an adversary. The goal is to control the regret (1) for any individual sequence of pairs where is a convex loss and , without any stochastic assumptions on the mechanism generating these pairs. For the rest of this section, we focus on the special case where for all and denote by the active node at time .

We start by proving that learning with adversarial activations is impossible if we use an oblivious network interface. We prove this result in the setting of prediction with expert advice with two actions and binary losses, a special case of online convex optimization. The idea of the lower bound is that if the communication network is a star graph, the environment is able to make both actions look equally good to all peripheral agents, even if one of the two actions is actually better than the other. This is done by drawing the good action at random, then activating an agent depending on the outcome of the draw. For a small fraction of the times the good action has loss one, the central agent is activated. Since the central agent shares feedback with all peripheral agents, we can amplify this loss by a factor of , and thus make the good action look to all peripheral agents as bad as the bad action.

###### Theorem 5.

For each there exists a convex decision set in and a graph with vertices such that, whenever agents are run on using instances of any online learning algorithm with an oblivious network interface, then for some sequence of convex losses and active agents.

###### Proof.

Fix and let be the probability simplex in . Let be the star graph with central agent , and peripheral agents . Because our losses are linear on , the online convex optimization problem is equivalent to prediction with expert advice with two experts (or actions), and we may denote losses using loss vectors where and index the actions. A good action is drawn uniformly at random. Denote the other one (i.e., the bad one) by . To keep notation tidy, we define loss vectors by . Fix any . The loss vectors are drawn i.i.d. at random, according to the following joint distribution:

 P(ℓt=(0,1))=12P(ℓt=(1,0))=12−ε+εN−1P(ℓt=(0,0))=ε−εN−1

Recall that only a single agent is active at any time. At each time step , the adversary decides whether to activate the central agent or a peripheral agent, depending on the realization of . If , then a random peripheral agent is activated. Otherwise, we set

 P(ℓt=(1,0),vt=a0)=εN−1andP(ℓt=(1,0),vt=ai)=\nicefrac12−εN−1a1,…,aN−1

Note that when , then all peripheral agents receive feedback . Similarly, when a peripheral agent is active at time , then receives feedback . For , let be the event: agent receives the loss vector as feedback. The following statements then hold for each peripheral agent ,

 P(E(ai,0,1)) =\nicefrac12N−1P(E(ai,0,0))=εN−1−ε(N−1)2 P(E(ai,1,0)) =\nicefrac12−εN−1+εN−1=\nicefrac12N−1

Hence, each instance managed by a peripheral agent observes loss vectors and with the same probability proportional to , and loss vector with probability proportional to . Since the network interface is oblivious, the instance cannot distinguish between paid and free feedback (which would reveal the good action), and incurs an expected loss of each time . Using the fact that a peripheral agent is active when with probability , the system’s expected total loss is at least (we lower bound the loss of the central agent by zero). Since the expected loss of is , the expected regret of the system satisfies

 E[RT]≥(1−ε2−12+ε−εN−1)T≥T8

where we picked and used in the last inequality. Therefore, there exists some sequence such that , concluding the proof. ∎

We complement the above negative result by showing that when algorithms are run without the oblivious network interface, and agents are free to use feedback only from a subset of their neighbors chosen with knowledge of the graph structure, then the network regret of OMD is . The quantity is the clique-covering number of the communication graph , which corresponds to the smallest cardinality of a clique cover of (a clique cover is a partition of the vertices such that the nodes in every element of the partition form a clique in the graph). The intuition behind this result is simple: fix a clique cover and let the agents in the same clique of the cover know each other. Now, if each agent ignores all feedback coming from agents in other cliques, then the agents in the same clique make exactly the same sequence of prediction and updates. Therefore, the effective number of OMD instances that are being run is equal to .

The following result holds for any differentiable function , -strongly convex with respect to some norm .

###### Theorem 6.

Consider a network of agents, a clique cover where , and let be the unique element of the cover which each belongs to. For any sequence of active agents, assume each agent runs OMD using (with ) while making updates only at those time steps such that . Then the network regret satisfies

 E[RT]≤(Dη+ηL22σ)√¯¯¯¯χGT

where , , and is the dual norm of . In particular, choosing gives .

###### Proof.

Fix any clique and any . Let be the time steps such that . Since each agent ignores the feedback coming from other cliques, the nodes in perform exactly the same updates, and therefore make exactly the same predictions. This means that, for any , the predictions in the set are all equal to the same common value denoted by . Fix any and, for any , let . By Theorem 1 we have that

 ∑t∈Tcrt(Kc)≤(Dη+ηL22σ)√Tc .

Therefore, recalling that and using Jensen’s inequality,

 T∑t=1rt(vt)=¯¯¯χG∑c=1∑t∈Tcrt(Kc)≤¯¯¯χG∑c=1(Dη+ηL22σ)√Tc≤(Dη+ηL22σ)√¯¯¯¯χGT

concluding the proof. ∎

Theorems 5 and 6 show that with adversarial activations the knowledge of the graph is crucial for learning (e.g., for achieving sublinear regret). Since , it is not clear whether the better rate can be proven in the adversarial activation setting when agents do not use the oblivious network interface.

## 8 Conclusions

We introduced a cooperative learning setting in which agents, sitting on the nodes of a communication network, run instances of an online learning algorithm with the common goal of minimizing their regret. In order to investigate how the knowledge of the graph topology affects regret in cooperative online learning under different activation mechanisms, we introduced the notion of oblivious network interface. This prevents agents from doing any network-specific tuning or even accessing their neighborhood structure. When activations are stochastic, we showed that sharing losses among neighbors is enough to guarantee optimal regret rates even with the oblivious network interface. Surprisingly, when activations are adversarial the situation changes completely. There exist problem instances in which any algorithm that runs with the oblivious network interface suffers linear regret. In this case knowing graph structure is not only necessary to perform optimally, but even to have sublinear regret.

Other interesting variants of this settings could be studied in the future. For example, at the beginning of each round, active agents could be allowed to ask the predictions of some of their neighbors, and base their prediction upon it. In this case, we conjecture that the optimal regret rate would scale with the dominating number of the graph, which is always smaller or equal to the independence number.

## Acknowledgments

Nicolo Cesa-Bianchi and Tommaso Cesari gratefully acknowledge partial support by the COST Action CA16228 “European Network for Game Theory” (GAMENET) and by the Google Focused Award “Algorithms and Learning for AI” (ALL4AI).

## References

• Alon et al. [2015] Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Annual Conference on Learning Theory, volume 40. Microtome Publishing, 2015.
• Awerbuch and Kleinberg [2008] Baruch Awerbuch and Robert Kleinberg. Competitive collaborative learning. Journal of Computer and System Sciences, 74(8):1271–1288, 2008.
• Cesa-Bianchi et al. [2016] Nicolo Cesa-Bianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. Delay and cooperation in nonstochastic bandits. JMLR Workshop and Conference Proceedings (COLT 2016), 49:605–622, 2016.
• Duchi et al. [2012] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592–606, 2012.
• Freund et al. [1997] Yoav Freund, Robert E Schapire, Yoram Singer, and Manfred K Warmuth. Using and combining predictors that specialize. In In Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing. Citeseer, 1997.
• Griggs [1983] Jerrold R Griggs. Lower bounds on the independence number in terms of the degrees. Journal of Combinatorial Theory, Series B, 34(1):22–39, 1983.
• Hazan [2016] Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
• Hosseini et al. [2013] Saghar Hosseini, Airlie Chapman, and Mehran Mesbahi. Online distributed optimization via dual averaging. In 52nd Annual IEEE Conference on Decision and Control (CDC), pages 1484–1489. IEEE, 2013.
• Mannor and Shamir [2011] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692, 2011.
• Martínez-Rubio et al. [2018] David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic multi-armed bandits. arXiv preprint arXiv:1810.04468, 2018.
• McQuade and Monteleoni [2012] Scott McQuade and Claire Monteleoni. Global climate model tracking using geospatial neighborhoods. In Proc. Twenty-Sixth AAAI Conference on Artificial Intelligence, Special Track on Computational Sustainability and AI, pages 335–341, 2012.
• McQuade and Monteleoni [2017] Scott McQuade and Claire Monteleoni. Spatiotemporal global climate model tracking. Large-Scale Machine Learning in the Earth Sciences; Data Mining and Knowledge Discovery Series. Srivastava, A., Nemani R., Steinhaeuser, K. (Eds.), CRC Press, Taylor & Francis Group, 2017.
• Orabona et al. [2015] Francesco Orabona, Koby Crammer, and Nicolo Cesa-Bianchi. A generalized online mirror descent with applications to classification and regression. Machine Learning, 99(3):411–435, 2015.
• Sahu and Kar [2017] Anit Kumar Sahu and Soummya Kar. Dist-Hedge: A partial information setting based distributed non-stochastic sequence prediction algorithm. In IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 528–532. IEEE, 2017.
• Scaman et al. [2018] Kevin Scaman, Francis Bach, Sébastien Bubeck, Laurent Massoulié, and Yin Tat Lee. Optimal algorithms for non-smooth distributed optimization in networks. In Advances in Neural Information Processing Systems, pages 2745–2754, 2018.
• Shahrampour and Jadbabaie [2018] Shahin Shahrampour and Ali Jadbabaie. Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control, 63(3):714–725, 2018.
• Shalev-Shwartz [2012] Shai Shalev-Shwartz. Introduction to online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
• Zhao et al. [2019] Yawei Zhao, Chen Yu, Peilin Zhao, and Ji Liu. Decentralized online learning: Take benefits from others’ data without sharing your own to track global trend. arXiv preprint arXiv:1901.10593, 2019.

## Appendix A Stochastic Activations: Multiple Agents

In this section we present all missing results related to the stochastic activation model with multiple activations per time step. Recall that, at the beginning of the process, the environment draws an i.i.d. sequence of Bernoulli random variables with some unknown fixed parameter for each agent . The active set at time is then defined as . Note that, unlike when only one agent is active at each time step, now in general. Before the main result, we give some definitions and prove a technical combinatorial lemma that is leveraged in the analysis.

Denote by the set of all agents such that . For each , let

 cv=∑S⊂{1,…,N}∖{v}λS,v1+|S| (4)

where the convex coefficients are defined by

 (N∏w=1qw)⎛⎝∏u∈{1,…,N}∖({v}∪S)(1−qu)⎞⎠

Let also be the probability

 P⎛⎝v∈⋃w∈StNw⎞⎠=1−∏w∈Nv(1−qw)>0 (5)

that agent is updated at time —note that is independent of .

###### Lemma 3.

Let be independent Bernoulli random variables with strictly positive parameters respectively. Then, for all ,

 E[X(v)∑mw=1X(w)]=qvcv

where we define when .

###### Proof.

Fix any . Let be the set and let be the -algebra generated by . Then

 E[X(v)∑mw=1X(w)]=E[E[X(v)∑mw=1X(w)∣∣∣Fv]]=qvE[11+∑w∈SvX(w)]

Denote the last expectation by . Since for all , , Fubini’s theorem yields

 cv =∫∞0E[e−t(1+∑w∈SvX(w))]dt =∫∞0e−t∏w∈SvE[e−tXt(w)]dt =∫∞0e−t∏w∈Sv(qwe−t+1−qw)dt =∫10∏w∈Sv(qwx+1−qw)dx =∫10∑S⊂Svx|S|(∏w∈Sqw)⎛⎝∏u∈Sv∖S(1−qu)⎞⎠dx

Now set and note that . Substituting in the last identity gives

 cv=∑S⊂SvλS,v∫10x|S|dx=∑S⊂SvλS,v1+|S|

We now give an upper bound on the regret that the network incurs if all agents run OMD with an oblivious network interface. Our upper bound is expressed in terms of a constant depending on the probabilities of activating each agent and such that . The result holds for any differentiable function , -strongly convex with respect to some norm .

###### Theorem 7.

Consider a network of agents. Assume that, at each time step each agent is independently activated with probability . If all agents run OMD with an oblivious network interface and using , for , the network regret satisfies

 E[RT]≤(Dη+ηL22σ)√QT

where , , , and is the dual norm of . In particular, choosing gives .

###### Proof.

Fixing an arbitrary , setting , and proceeding as in Theorem 2 yields, for each ,

 E[T∑t=1rt(v)]≤(Dη+ηL22σ)√TQv (6)

Now we write , where

 E[RT(x)] =E[T∑t=11∑w∈VXt(w)∑v∈V′rt(v)Xt(v)] =T∑t=1∑v∈V′E[Xt(v)∑w∈VXt(w)]E[rt(v)] =∑v∈V′qvcvT∑t=1E[rt(v)] (7)

and the last identity follows by Lemma 3. Putting identity (7) and inequality (6) together gives