An Online Optimization Approach for Multi-Agent Tracking of Dynamic Parameters in the Presence of Adversarial Noise

# An Online Optimization Approach for Multi-Agent Tracking of Dynamic Parameters in the Presence of Adversarial Noise

Shahin Shahrampour and Ali Jadbabaie This work was supported by ONR BRC Program on Decentralized, Online Optimization.Shahin Shahrampour is with the Department of Electrical Engineering at Harvard University, Cambridge, MA 02138 USA. (e-mail: shahin@seas.harvard.edu).Ali Jadbabaie is with the Institute for Data, Systems, and Society at Massachusetts Institute of Technology, Cambridge, MA 02139 USA. (email: jadbabai@mit.edu).
###### Abstract

This paper addresses tracking of a moving target in a multi-agent network. The target follows a linear dynamics corrupted by an adversarial noise, i.e., the noise is not generated from a statistical distribution. The location of the target at each time induces a global time-varying loss function, and the global loss is a sum of local losses, each of which is associated to one agent. Agents noisy observations could be nonlinear. We formulate this problem as a distributed online optimization where agents communicate with each other to track the minimizer of the global loss. We then propose a decentralized version of the Mirror Descent algorithm and provide the non-asymptotic analysis of the problem. Using the notion of dynamic regret, we measure the performance of our algorithm versus its offline counterpart in the centralized setting. We prove that the bound on dynamic regret scales inversely in the network spectral gap, and it represents the adversarial noise causing deviation with respect to the linear dynamics. Our result subsumes a number of results in the distributed optimization literature. Finally, in a numerical experiment, we verify that our algorithm can be simply implemented for multi-agent tracking with nonlinear observations.

## I Introduction

Distributed estimation, detection, and tracking is ubiquitous in engineering applications ranging from sensor and robotic networks to social networks, and it has received a lot of attention for many years [1, 2, 3, 4, 5]. In these scenarios, the task is to estimate the value of a parameter which may or may not be dynamic. A group of agents aim to accomplish this task as a team. Each individual agent only partially observes the parameter, but the global spread of observations in the network allows agents to estimate the parameter collaboratively. This would require agents to aggregate local information, and many methods use consensus protocols as a critical component [6]. It is well-known that when agents’ observations are linear with respect to the parameter, the tracking problem is equivalent to minimizing a global quadratic loss, written as a sum of local quadratic losses (see e.g. [7]). However, in general, the global loss can be more complicated, resulting in nonlinear observations.

In real-world applications, the parameter of interest is often time-varying. Therefore, regardless of the structure of the loss, the dynamic nature of the problem brings forward two issues: (i) The local losses are observed in an online or sequential fashion, i.e., the local losses are disclosed to agents only after they form their estimates at each round, and they are not aware of future loss functions. Therefore, the problem must be solved in an online setting. (ii) The online algorithm should mimic the performance of its offline counterpart in which the losses are known a priori. The gap between the two is often called regret. Tracking the minimizer of the global loss over time introduces the notion of dynamic regret [8]. This framework has been studied in centralized online optimization [8, 9, 10, 11, 12], where the hardness of the problem is captured via the variation in the minimizer sequence.

To address these issues in this paper, we adopt an online optimization approach to formulate distributed tracking. We consider tracking of a dynamic parameter or a moving target in a network of agents. The dynamics of the target is linear and known to agents, but the target deviates from this dynamics due to an unstructured or adversarial disturbance or noise. In other words, the noise is not necessarily generated from a statistical distribution, or it can be highly correlated to its past values over time. At each time instance, the target induces a global convex loss whose minimizer coincides with the target location. The global loss is a sum of local losses, where each local loss is associated to a specific agent. Agents exchange noisy local gradients according to a communication protocol to track the moving target.

Our problem setup is reminiscent of a distributed Kalman [13]. However, we differentiate the two as follows: (i) We do not assume that the target is driven by a Gaussian noise. Nor do we assume that this noise has a statistical distribution. Instead, we consider an adversarial-noise model with unknown structure. (ii) Agents observations are not necessarily linear; in fact, the observations are noisy local gradients that are non-linear when the loss is not quadratic. Furthermore, our focus is on the finite-time analysis rather than asymptotic results.

We propose a decentralized version of the Mirror Descent algorithm, developed by Nemirovksi and Yudin [14]. Using the notion of Bregman divergence in lieu of Euclidean distance for projection, Mirror Descent has been shown to be a powerful tool in large-scale optimization. Our algorithm consists of three interleaved updates: (i) each agent follows the noisy local gradient while staying close to previous estimates in the local neighborhood; (ii) agents take into account the dynamics of the moving target; (iii) agents average their estimates in their local neighborhood in a consensus step.

We then use a dynamic notion of regret to measure the difference between our online decentralized algorithm and its offline centralized version. We establish a regret bound that scales inversely in the spectral gap of the network, and it represents the adversarial noise causing deviation with respect to the linear dynamics. We further show that from optimization perspective our result subsumes two important classes of decentralized optimization in the literature: (i) decentralized optimization of time-invariant losses, and (ii) decentralized optimization of time-variant losses for fixed targets. This generalization is achieved by allowing the loss function and the target value to vary simultaneously. We also provide a numerical experiment to show that our algorithm can be simply implemented to work with nonlinear observations in multi-agent tracking.

Related Literature on Decentralized Optimization: In [15], decentralized mirror descent has been developed for time-invariant functions in the case that agents receive the gradients with a delay. Moreover, Rabbat in [16] proposes a decentralized mirror descent for stochastic composite optimization problems and provide guarantees for strongly convex regularizers. Duchi et al. [17] study dual averaging for distributed optimization, and the extension of dual averaging to online distributed optimization is considered in [18]. Mateos-Núnez and Cortés [19] consider online optimization using subgradient descent of local functions, where the graph structure is time-varying. In [20], a decentralized variant of Nesterov’s primal-dual algorithm is proposed for online optimization. In [21], distributed online optimization is studied for strongly convex objective functions over time-varying networks. Our setup follows the work of [22] on decentralized online mirror descent, but we extend the results to high probability bounds on the dynamic regret.

The set for any integer
Transpose of the vector
The -th element of vector
Identity matrix of size
The -dimensional probability simplex
Standard inner product operator
-norm operator
The dual norm of
The -th largest eigenvalue of in magnitude

## Ii Problem Formulation and Algorithm

### Ii-a Dynamical Model and Optimization Perspective

Consider a -dimensional moving target following the linear dynamics for a finite time as

 x⋆t+1=Ax⋆t+vt,          t∈[T] (1)

where is known, and is an adversarial noise, i.e., the sequence is neither generated according to a statistical distribution, nor it is independent over time. Our goal is to track , and regardless of the observation model, a distribution-dependent mechanism, such as Kalman or particle filter, cannot solve the problem since the noise does not assume a statistical distribution.

In the centralized version of the tracking problem above, the observations of are realized through a time-varying, global loss function. That is, consider the tracking problem above as an optimization, where is the minimizer of the global loss at time . Let be a convex, compact set, and represent the global loss by at time . As the global loss varies over time, the goal is to track the minimizer of , which is . The offline and centralized version of our problem can be viewed as follows

 minimizex1,…,xT T∑t=1ft(xt) (2) subject to xt∈X,t∈[T].

We are interested to solve the problem above in an online and decentralized fashion. In particular, the global function at time is a sum of local functions as

 ft(x):=1nn∑i=1fi,t(x), (3)

where is a local convex function on for all . We consider a network of agents facing two challenges when solving problem (2): (i) agent receives information only about and does not observe the global loss function , which is common to decentralized schemes; (ii) The functions are revealed to agents sequentially along the time horizon, i.e., at any time instance , agent has observed for , whereas the agent does not know for , which is common to online settings.

The agents interact with each another, and their relationship is captured via an undirected graph , where denotes the set of nodes, and is the set of edges. Each agent assigns a positive weight for the information received from agent , and the set of neighbors of agent is defined as .

While the problem framework is reminiscent of a distributed Kalman [13], there are fundamental distinctions in our setup: (i) The adversarial noise is neither Gaussian nor of known statistical distribution. It can be thought as a noise with unknown structure, which represents the deviation from the dynamics111In online optimization, the focus is not on distribution of data. Instead, data is thought to be generated arbitrarily, and its effect is observed through the loss functions[23].. (ii) Agents observations are not necessarily linear; in fact, the observations are local gradients of and are non-linear when the objective is not quadratic. The other implicit distinction in this work is our focus on finite-time analysis rather than asymptotic results.

From optimization perspective, our framework subsumes two important classes of decentralized optimization in the literature:

• Existing methods often consider time-invariant objectives (see e.g. [24, 17, 15]). This is simply the special case where and in (2).

• Online algorithms deal with time-varying functions, but often the network’s objective is to minimize the temporal average of over a fixed variable (see e.g. [18, 19]). This can be captured by our setup when in (2).

However, in the tracking problem, functions and comparator variables evolve simultaneously, i.e., the variables are not constrained to be fixed in (2). Recall that is the minimizer of the global loss function at time . Then, the solution to problem (2) is simply . Denote by the estimate of agent for at time . To exhibit the online nature of problem (2), we reformulate it using the notion of dynamic regret as follows

 {Reg}dT=1nn∑i=1T∑t=1ft(xi,t)−T∑t=1ft(x⋆t). (4)

Then, the objective is to minimize the dynamic regret above which measures the gap between the online algorithm and its offline version. Our performance bound shall exhibit the impact of system noise, i.e., we want to prove a regret bound in terms of

 ∥vt∥=∥∥x⋆t+1−Ax⋆t∥∥, (5)

which represents the deviation of the moving target with respect to dynamics . Note that generalizing the results to the linear time-variant dynamics is straightforward, i.e., when is replaced by in (1).

### Ii-B Technical Assumptions

To solve the multi-agent online optimization (4), we propose to decentralize the Mirror Descent algorithm [14]. Mirror Descent has been shown to be a powerful method in large-scale optimization by using Bregman divergence in lieu of Euclidean distance in the projection step. Before defining Bregman divergence and elaborating the algorithm, we start by stating a couple of standard assumptions on loss functions and agents communication.

###### Assumption 1

For any , the function is Lipschitz continuous on with a uniform constant . That is,

 |fi,t(x)−fi,t(y)|≤L∥x−y∥,

for any .

###### Assumption 2

The network is connected222The setup is generalizable to when network connectivity changes over time, and the communication matrix is time-varying., i.e., there exists a path from any agent to any agent . Also, the matrix is symmetric and doubly stochastic with positive diagonal. That is,

 n∑i=1[W]ij=n∑j=1[W]ij=1.

The connectivity constraint in Assumption 2 guarantees the information flow in the network.

We now outline the notion of Bregman divergence, which is critical in the development of Mirror Descent. Consider a compact, convex set , and let denote a 1-strongly convex function on with respect to a norm . That is,

 R(x)≥R(y)+⟨∇R(y),x−y⟩+12∥x−y∥2.

for any . Then, the Bregman divergence with respect to the function is defined as follows:

 DR(x,y):=R(x)−R(y)−⟨x−y,∇R(y)⟩.

The definition of the Bregman divergence and the strong convexity of imply that

 DR(x,y)≥12∥x−y∥2, (6)

for any . Two famous examples of Bregman divergence are the Euclidean distance and the Kullback-Leibler (KL) divergence generated from and , respectively.

###### Assumption 3

Let and be vectors in . We assume that the Bregman divergence satisfies the separate convexity in the following sense

 DR(x,n∑i=1α(i)yi)≤n∑i=1α(i)DR(x,yi),

where is on the -dimensional simplex.

The assumption is satisfied for commonly used cases of Bregman divergence. For instance, the Euclidean distance evidently respects the condition. The KL-divergence also satisfies the constraint, and we refer the reader to Theorem 6.4. in [25] for the proof.

###### Assumption 4

The Bregman divergence satisfies a Lipschitz condition of the form

 |DR(x,z)−DR(y,z)|≤K∥x−y∥,

for all .

When the function is Lipschitz on , the Lipschitz condition on the Bregman divergence is automatically satisfied. Again, for the Euclidean distance the assumption evidently holds. In the particular case of KL divergence, the condition can be achieved via mixing a uniform distribution to avoid the boundary (see e.g. [11] for more comments on the assumption).

###### Assumption 5

The dynamics is assumed to be non-expansive. That is, the condition

 DR(Ax,Ay)≤DR(x,y),

holds for all , and .

The assumption postulates a natural constraint on the dynamics : it does not allow the effect of a poor estimation (at one step) to be amplified as the algorithm moves forward.

### Ii-C Decentralized Tracking via Online Mirror Descent

We now propose our algorithm to solve the problem formulated in terms of dynamic regret in (4). In our setting, agents observations are gradients of the local losses. However, common in distributed state estimation and tracking, these observations are noisy. Hence, denoting the local gradient of agent at time by , the agent only receives representing the stochastic gradient. The stochastic oracle that provides noisy gradients satisfies the following constraints333For simplicity, we use one constant to bound gradients as well as the stochastic gradients.

 (7)

where is the -field containing all information prior to the outset of round . A commonly used model to generate stochastic gradients satisfying (7) is an additive, bounded, zero-mean noise. Agents then track the moving target using a decentralized variant of Mirror Descent as follows444We set to be the vector of all zeros to initialize the algorithm. In general, any initialization could work for the algorithm.

 ^xi,t+1 =argminx∈X{ηt⟨x,∇i,t⟩+DR(x,yi,t)}, (8a) xi,t =A^xi,t,    and     yi,t=n∑j=1[W]ijxj,t, (8b)

where is the step-size sequence, and is the given dynamics in (1) which is common knowledge. In these updates, represents the estimate of agent of the moving target at time . The step-size sequence should be tuned for different cases, but it is generally non-increasing and positive.

The update (8a) allows an agent to follow the noisy local gradient while keeping the estimate close to those of the local neighborhood. This closeness occurs by minimizing the Bregman divergence. On the other hand, the first update in (8b) takes into account the dynamics of the moving target, and the second update in (8b) is the consensus term averaging the estimates in the local neighborhood.

## Iii Theoretical Results

In this section, we state our theoretical result on the non-asymptotic performance of the decentralized online mirror descent for tracking dynamic parameters. Theorem 1 proves a bound on the dynamic regret, which captures the deviation of the moving target from the dynamics (tracking error), the decentralization cost (network error), and the impact of stochastic gradients (stochastic error). We show that this theorem recovers previous rates on decentralized optimization once the tracking error is removed. Also, it recovers previous rates on centralized online optimization in dynamic setting when the network error is eliminated. The proof is given in Appendix (Section A).

###### Theorem 1

Consider a moving target with the dynamical model of (1). Further consider the distributed, online tracking problem formulated in (4), where denotes the local estimate of agent of the moving target at time . Let the local estimates be generated by updates (8a)-(8b), where the stochastic gradients satisfy the condition (7). Given Assumptions [1-5], the dynamic regret can be bounded as

 {Reg}dT ≤ETrack+ENet+EStoch,

with probability at least , where

 T∑t=1Kηt+1ETrack :=2R2ηT+1+T∑t=1Kηt+1∥∥x⋆t+1−Ax⋆t∥∥+L2T∑t=1ηt2 T∑t=1Kηt+1ENet :=4L2√nT∑t=1t−1∑τ=0ητσt−τ−12(W) T∑t=1Kηt+1EStoch :=8LR√−Tlogδ,

and

In view of (1), the dynamical model of the target is described with the noise . The term shows the dependence of performance bound to noise by aggregating the errors over time. Also, and are the errors related to network and stochastic gradients, respectively.

In Section II, we discussed that our setup generalizes some of the previous results. It is now important to see that this generalization is valid in the sense that our result can recover those special cases:

• When the global loss is time-invariant, the target is fixed, i.e., the dynamics and in (1). In this case in Theorem 1, the term involving in is equal to zero, and we can use the step-size sequence to recover the result of comparable algorithms, such as Theorem 4 in [17] on distributed dual averaging.

• The same argument holds when the global loss is time-variant, but the target is fixed. This setup is studied, for instance, in [18] via distributed online dual averaging with exact gradients. Disregarding in our bound due to stochastic gradients, since again, we recover Corollary 3 in [18].

• When the graph is complete, and hence . We then recover the results of [9] on centralized online learning (for linear dynamics) with exact gradients once we remove due to stochastic gradients.

## Iv Numerical Experiment: Tracking Maneuvering Targets

In Mirror Descent algorithm, one has freedom over the selection of the Bregman divergence. A particularly well-known type of Bregman is the Euclidean distance, commonly used in state estimation and tracking dynamic parameters. We focus on this scenario in this section to provide the numerical experiments for our method.

We consider a slowly maneuvering target in the plane and assume that each position component of the target evolves independently according to a near constant velocity model [26]. The state of the target at each time consists of four components: horizontal position, vertical position, horizontal velocity, and vertical velocity. We represent the state at time by , and therefore, the state space model takes the form

 x⋆t+1=Ax⋆t+vt,

where is the system noise, and using for Kronecker product, can be written as

 A=I2⊗[1ϵ01],

with being the sampling interval555The sampling interval of (seconds) is equivalent to the sampling rate of .. The goal is to cooperatively track in a network of agents. This problem has been studied in the context of distributed Kalman filtering [13, 27], state estimation [28, 29, 30], and particle filtering [31, 32, 33]. However, in contrast to Kalman filtering, we do not assume that the system noise is Gaussian. Also, as opposed to particle filtering, we do not receive a large number of samples (particles) per iteration since our setup is online, i.e., agents only observe one sample per time. Furthermore, we do not assume a statistical distribution on in our analysis, which differentiates our framework from state estimation. We adopt a model-free approach where the noise can be adversarial (deterministic), stochastic with dependence over time, or of some complex structure. We generate the noise as follows. At each time we draw a sample from a zero-mean Gaussian distribution with covariance matrix as follows

 Σ=σ2νI2⊗[ϵ3/3ϵ2/2ϵ2/2ϵ],

for the sampling interval seconds which amounts to frequency . Then, we let the system noise be . Though is generated from Gaussian distribution, the mismatch noise is non-Gaussian and can have a complicated distribution. The constant takes different values in each experiment, and we describe this choice later.

We consider a sensor network of agents located on a grid. Agents aim to track the moving target collaboratively. Agents observe a noisy version of the target through a local loss function, and these observations are nonlinear. In particular, let the quantity be a noisy version of one coordinate of as follows

 zi,t=e⊤kix⋆t+wi,t,

where denotes a random noise, and is the -th unit vector in the standard basis of for . We partition the agents into four groups, and for each group we select one specific from the set . The random noise satisfies the standard assumption of being zero-mean and finite-variance. Again, to show that our results are not dependent on Gaussian noise, we generate independently from a uniform distribution on .

Then, at time the local loss for agent takes the form

 fi,t(x):=14E[(zi,t−e⊤kix)4∣∣Ft−1,x⋆t],

resulting in the global loss

 ft(x):=14nn∑i=1E[(zi,t−e⊤kix)4∣∣Ft−1,x⋆t],

where is the -field containing all information in . It is straightforward to see that is the minimizer of the global loss. Observation of agent at time is the stochastic gradient of the local loss

 ∇fi,t(x)=(zi,t−e⊤kix)3eki.

We derive an explicit update to form an estimate of . We use Euclidean distance as the Bregman divergence in updates (8a)-(8b) to get666We assume that the state of the target remains in a convex, compact set, and the updates can keep the estimate in the set without the projection step. This assumption can be satisfied in the finite-time domain.

 xi,t =n∑j=1[W]ijAxj,t−1+ηtAeki(zi,t−1−e⊤kixi,t−1)3,

and tune the step size to . The update is akin to consensus+innovation updates in the literature (see e.g. [2, 7]) though we recall that the observation is nonlinear, and the system noise is arbitrary.

It is proved in [7] that in decentralized tracking, the dynamic regret can be presented in terms of the tracking error when the local losses are quadratic. More specifically, the expected dynamic regret averages the tracking error over space and time (when normalized by ). While here we deal with polynomial loss of power four, the connection between tracking error and dynamic regret still holds true. Therefore, using the result of Theorem 1 we can expect that once the parameter does not deviate too much from the dynamics, i.e., when is small, the bound on the dynamic regret as well as the collective tracking error is small.

We show this intuitive idea by setting to different vaues. Larger values for are expected to cause more deviations from the dynamics and larger dynamic regret (worse performance). In Fig. 1, we plot the normalized dynamic regret for . Note that for each value of , we run the experiment only once to investigate the high probability bound in Theorem 1. As expected, the performance improves once tends to smaller values.

We next restrict our attention to the case that . For one run of this case, we provide a snapshot of the target trajectory (in red) in Fig. 2 and plot the estimator trajectory (in blue) for agents . Fig. 2 suggests that agents’ estimators closely follow the trajectory of the moving target with high probability.

## V Conclusion

In this paper, we addressed tracking of a moving target in a network of agents. The target follows a linear dynamics which is common knowledge to agents, but it deviates from this dynamics due to an additive noise of an unknown structure. We formulated the problem as an online optimization of a global time-varying loss in a distributed fashion. The global loss at each time is a sum of a finite number of local losses, and each agent in the network holds a private copy of one local loss. Agents are unaware of the future loss functions as the local losses only become available to them sequentially. They exchange noisy local gradients with each other to track the value of the target.

Our proposed algorithm for this setup can be cast as a decentralized version of Mirror Descent. We however incorporated two more steps to include agents interactions and dynamics of the target. We used a notion of network dynamic regret to measure the performance of our algorithm versus its offline counterpart. We established that the regret bound scales inversely in the spectral gap of the network and captures the deviation of the target with respect to the dynamics. Our results generalized a number of results in online and offline distributed optimization. Also, numerical experiments verified the applicability of our algorithm to multi-agent tracking with nonlinear observations. Future directions include studying the algorithm in the case that several observations are available per round, i.e., when agents can receive multiple noisy gradients per time. The method can be useful in the sensor networks where each sensor can have multiple measurements from different sources.

## References

• [1] F. Bullo, J. Cortes, and S. Martinez, Distributed control of robotic networks: a mathematical approach to motion coordination algorithms.   Princeton University Press, 2009.
• [2] S. Kar, J. M. Moura, and K. Ramanan, “Distributed parameter estimation in sensor networks: Nonlinear observation models and imperfect communication,” IEEE Transactions on Information Theory, vol. 58, no. 6, pp. 3575–3605, 2012.
• [3] S. Shahrampour, S. Rakhlin, and A. Jadbabaie, “Online learning of dynamic parameters in social networks,” in Advances in Neural Information Processing Systems, 2013.
• [4] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Distributed detection: Finite-time analysis and impact of network topology,” IEEE Transactions on Automatic Control, vol. 61, no. 11, pp. 3256–3268, 2016.
• [5] A. Nedić, A. Olshevsky, and C. A. Uribe, “Network independent rates in distributed learning,” in American Control Conference (ACC), 2016.   IEEE, 2016, pp. 1072–1077.
• [6] A. Jadbabaie, J. Lin, and A. S. Morse, “Coordination of groups of mobile autonomous agents using nearest neighbor rules,” IEEE Transactions on Automatic Control, vol. 48, no. 6, pp. 988–1001, 2003.
• [7] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Distributed estimation of dynamic parameters: Regret analysis,” in American Control Conference (ACC), July 2016, pp. 1066–1071.
• [8] M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” International Conference on Machine Learning (ICML), 2003.
• [9] E. C. Hall and R. M. Willett, “Online convex optimization in dynamic environments,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 4, pp. 647–662, 2015.
• [10] O. Besbes, Y. Gur, and A. Zeevi, “Non-stationary stochastic optimization,” Operations Research, vol. 63, no. 5, pp. 1227–1244, 2015.
• [11] A. Jadbabaie, A. Rakhlin, S. Shahrampour, and K. Sridharan, “Online optimization: Competing with dynamic comparators,” in Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 2015, pp. 398–406.
• [12] A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online optimization in dynamic environments: Improved regret rates for strongly convex problems,” in IEEE Conference on Decision and Control (CDC).   IEEE, 2016, pp. 7195–7201.
• [13] R. Olfati-Saber, “Distributed kalman filtering for sensor networks,” in IEEE Conference on Decision and Control, 2007, pp. 5492–5498.
• [14] D. Yudin and A. Nemirovskii, “Problem complexity and method efficiency in optimization,” 1983.
• [15] J. Li, G. Chen, Z. Dong, and Z. Wu, “Distributed mirror descent method for multi-agent optimization with delay,” Neurocomputing, vol. 177, pp. 643–650, 2016.
• [16] M. Rabbat, “Multi-agent mirror descent for decentralized stochastic optimization,” in Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2015 IEEE 6th International Workshop on.   IEEE, 2015, pp. 517–520.
• [17] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: convergence analysis and network scaling,” IEEE Transactions on Automatic Control, vol. 57, no. 3, pp. 592–606, 2012.
• [18] S. Hosseini, A. Chapman, and M. Mesbahi, “Online distributed optimization via dual averaging,” in IEEE Conference on Decision and Control (CDC), 2013, pp. 1484–1489.
• [19] D. Mateos-Núnez and J. Cortés, “Distributed online convex optimization over jointly connected digraphs,” IEEE Transactions on Network Science and Engineering, vol. 1, no. 1, pp. 23–37, 2014.
• [20] A. Nedić, S. Lee, and M. Raginsky, “Decentralized online optimization with global objectives and local communication,” in IEEE American Control Conference (ACC), 2015, pp. 4497–4503.
• [21] M. Akbari, B. Gharesifard, and T. Linder, “Distributed online convex optimization on time-varying directed graphs,” IEEE Transactions on Control of Network Systems, 2015.
• [22] S. Shahrampour and A. Jadbabaie, “Distributed online optimization in dynamic environments using mirror descent,” arXiv preprint arXiv:1609.02845, 2016.
• [23] S. Shalev-Shwartz, “Online learning and online convex optimization,” Foundations and Trends in Machine Learning, vol. 4, no. 2, pp. 107–194, 2011.
• [24] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
• [25] H. H. Bauschke and J. M. Borwein, “Joint and separate convexity of the bregman distance,” Studies in Computational Mathematics, vol. 8, pp. 23–36, 2001.
• [26] Y. Bar-Shalom, Tracking and data association.   Academic Press Professional, Inc., 1987.
• [27] F. S. Cattivelli and A. H. Sayed, “Diffusion strategies for distributed kalman filtering and smoothing,” IEEE Transactions on automatic control, vol. 55, no. 9, pp. 2069–2084, 2010.
• [28] U. Khan, S. Kar, A. Jadbabaie, J. M. Moura et al., “On connectivity, observability, and stability in distributed estimation,” in IEEE Conference on Decision and Control (CDC), 2010, pp. 6639–6644.
• [29] S. Das and J. M. Moura, “Distributed state estimation in multi-agent networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 4246–4250.
• [30] D. Han, Y. Mo, J. Wu, S. Weerakkody, B. Sinopoli, and L. Shi, “Stochastic event-triggered sensor schedule for remote state estimation,” IEEE Transactions on Automatic Control, vol. 60, no. 10, pp. 2661–2675, 2015.
• [31] D. Gu, “Distributed particle filter for target tracking,” in IEEE International Conference on Robotics and Automation (ICRA), 2007, pp. 3856–3861.
• [32] O. Hlinka, O. Sluciak, F. Hlawatsch, P. M. Djuric, and M. Rupp, “Likelihood consensus and its application to distributed particle filtering,” IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4334–4349, 2012.
• [33] J. Li and A. Nehorai, “Distributed particle filtering via optimal fusion of gaussian mixtures,” in Information Fusion (Fusion), 2015 18th International Conference on.   IEEE, 2015, pp. 1182–1189.
• [34] A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,” Operations Research Letters, vol. 31, no. 3, pp. 167–175, 2003.

## Appendix A Appendix

We make use of two technical lemmas (Lemma 2 and 3) proved in the Appendix of [22]. We state their results here and use them in the proof of Theorem 1.

###### Lemma 2

Let be a convex set in a Banach space , denote a 1-strongly convex function on with respect to a norm , and represent the Bregman divergence with respect to , respectively. Furthermore, assume that the local functions are Lipschitz continuous (Assumption 1), the matrix is doubly stochastic (Assumption 2), and the mapping is non-expansive (Assumption 5). Then, the local estimates generated by the updates (8a)-(8b) satisfy

 ∥xi,t+1−¯xt+1∥≤L√nt∑τ=0ητσt−τ2(W),

for any , where .

###### Lemma 3

Let be a convex set in a Banach space , denote a 1-strongly convex function on with respect to a norm , and represent the Bregman divergence with respect to , respectively. Furthermore, assume that the matrix is doubly stochastic (Assumption 2), the Bregman divergence satisfies the Lipschitz condition and the separate convexity (Assumptions 3-4), and the mapping is non-expansive (Assumption 5). Then, it holds that

 1nn∑i=1T∑t=1(1ηt DR(x⋆t,yi,t)−1ηtDR(x⋆t,^xi,t+1)) ≤2R2ηT+1+T∑t=1Kηt+1∥∥x⋆t+1−Ax⋆t∥∥,

where

In what follows, we provide the proof of Theorem 1.

### A-a Proof of Theorem 1

Recall the definition of dynamic regret in (4). Using the Lipschitz continuity of (Assumption 1) as well as the fact that the global loss is the sum of local losses (Eq. (3)), we get

 1nn∑i=1ft(xi,t) −ft(x⋆t)=ft(xi,t)−ft(¯xt)+ft(¯xt)−ft(x⋆t) =L∥xi,t−¯xt∥+1nn∑i=1fi,t(¯xt)−1nn∑i=1fi,t(x⋆t),

Using the Lipschitz continuity of for , we simplify above as follows

 ft(xi,t)−ft(x⋆t)≤1nn∑i=1fi,t(xi,t)−1nn∑i=1fi,t(x⋆t) +L∥xi,t−¯xt∥+Lnn∑i=1∥xi,t−¯xt∥. (9)

The second line can be controlled via Lemma 2, so we focus on the first term in the above bound. We have by convexity of that

 fi,t(xi,t)−fi,t(x⋆t)≤⟨∇i,t,xi,t−x⋆t⟩ =⟨∇i,t,xi,t−x⋆t⟩+⟨∇i,t−∇i,t,xi,t−x⋆t⟩ =⟨∇i,t,^xi,t+1−x⋆t⟩+⟨∇i,t,xi,t−yi,t⟩ +⟨∇i,t,yi,t−^xi,t+1⟩+⟨∇i,t−∇i,t,xi,t−x⋆t⟩, (10)

We now need to bound each of the terms on the right hand side of (10). The stochastic gradients are bounded in view of (7). Therefore, using Hölder’s inequality for any primal-dual norm pair, we get

 12ηt∥yi,t−^xi,t+1∥2+ηt2L2⟨∇i,t,yi,t−^xi,t+1⟩ ≤L∥∥yi,t−^xi,t+1∥∥ ≤12ηt∥∥yi,t−^xi,t+1∥∥2+ηt2L2, (11)

where the last line is due to AM-GM inequality. We now recall update (8b) and use Assumptions 1 and 2 to derive

 ⟨∇i,t,xi,t−yi,t⟩=⟨∇i,t,xi,t−¯xt+¯xt−yi,t⟩ =⟨∇i,t,xi,t−¯xt⟩+n∑j=1[W]ij⟨∇i,t,¯xt−xj,t⟩ ≤2L2√nt−1∑τ=0ητσt−τ−12(W), (12)

where in the last line we appealed to Lemma 2. Finally, the optimality of in (8a) implies (see e.g. Lemma 4.1 in [34]) that

 ⟨∇i,t,^xi,t+1−x⋆t⟩ ≤1ηtDR(x⋆t,yi,t)−1ηtDR(x⋆t,^xi,t+1) −1ηtDR(^xi,t+1,yi,t) ≤1ηtDR(x⋆t,yi,t)−1ηtDR(x⋆t,^xi,t+1) −12ηt∥∥^xi,t+1−yi,t∥∥2, (13)

since the Bregman divergence satisfies for any in view of (6). Substituting (11), (12), and (13) into the bound (10), we derive

 fi,t(xi,t)−fi,t(x⋆t) ≤ηt2L2+2L2√nt−1∑τ=0ητσt−τ−12(W) +1ηtDR(x⋆t,yi,t)−1ηtDR(x⋆t,^xi,t+1) +⟨∇i,t−∇i,t,xi,t−x⋆t⟩. (14)

To bound the last term, we note that

 E[⟨∇i,t−∇i,t,xi,t−x⋆t⟩∣∣Ft−1] =⟨E[∇i,t−∇i,t∣∣Ft−1],xi,t−x⋆t⟩=0.

Also, due to (6) we have , which entails

 ⟨∇i,t−∇i,t,xi,t−x⋆t⟩≤∥∥xi,t−x⋆t∥∥∥∇i,t−∇i,t∥∗≤4LR.

Therefore, summing over forms a bounded difference martingale, and we can use Azuma’s inequality to get

 P(n∑i=1T∑t=1⟨∇i,t−∇i,t,xi,t−x⋆t⟩≥ε)≤e−ε232Tn2L2R2.

Setting the probability above to and solving for implies

 1nn∑i=1T∑t=1⟨∇i,t−∇i,t,xi,t−x⋆t⟩≤8LR√−Tlogδ,

with probability at least . Summing (14) over and and incorporating the bound above into the last term, we get

 1nn∑i=1T∑t=1fi,t(xi,t)−fi,t(x⋆t)≤ L2T∑t=1ηt2+2L2√nT∑t=1t−1∑τ=0ητσt−τ−12(W) +1nn∑i=1T∑t=1(1ηtDR(x⋆t,yi,t)−1ηtDR(x⋆t,^xi,t+1)) +8LR√−Tlogδ,

with probability at least . Applying Lemma 3 to above, we can simplify as

 1nn∑i=1T∑t=1fi,t(xi,t)−fi,t(x⋆t)≤ L2T∑t=1ηt2+2L2√nT∑t=1t−1∑τ=0ητσt−τ−12(W) +2R2ηT+1+T∑t=1Kηt+1∥∥x⋆t+1−Ax⋆t∥∥ +8LR√−Tlogδ.

We now return to sum (9) over . We apply the bound above and the bound in Lemma 2, respectively to the first and second line in (9) to finish the proof.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters