DJAM: distributed Jacobi asynchronous method for learning personal models

# DJAM: distributed Jacobi asynchronous method for learning personal models

Inês Almeida João Xavier Member IEEE This work was partially supported by Fundação para a Ciência e Tecnologia (FCT), Portugal, under Project UID/EEA/50009/2013, and by grant PD/BD/135012/2017 from FCT. The authors would also like to express their gratitude to Dušan Jakovetić for his valuable input. Emails: almeida.inesb@gmail.com, jxavier@isr.ist.utl.pt
###### Abstract

Processing data collected by a network of agents often boils down to solving an optimization problem. The distributed nature of these problems calls for methods that are, themselves, distributed. While most collaborative learning problems require agents to reach a common (or consensus) model, there are situations in which the consensus solution may not be optimal. For instance, agents may want to reach a compromise between agreeing with their neighbors and minimizing a personal loss function. We present DJAM, a Jacobi-like distributed algorithm for learning personalized models. This method is implementation-friendly: it has no hyperparameters that need tuning, it is asynchronous, and its updates only require single-neighbor interactions. We prove that DJAM converges with probability one to the solution, provided that the personal loss functions are strongly convex and have Lipschitz gradient. We then give evidence that DJAM is on par with state-of-the-art methods: our method reaches a solution with error similar to the error of a carefully tuned ADMM in about the same number of single-neighbor interactions.

## I Learning personal models

Consider agents, each with a personal loss function: , , for agent . For example, could be the loss of a model parameterized by on agent ’s personal dataset. The agents are the nodes of an undirected, connected network.

Each agent aims to find a model that minimizes both the mismatch with its neighbors’ models and its personal loss. More specifically, agents aim to solve

 minθ1,…,θn12n∑i

where is a symmetric matrix that mirrors the topology of the network: if agents and are connected in the network; otherwise. The weight controls the degree of agreement we want between agents and : a larger enforces more similarity between the corresponding agents’ models.

Closest related works. Optimization problem (1) has been addressed in [1]. For convex loss functions that are quadratic, the authors suggest a distributed algorithm, which we refer to as Model Propagation Algorithm (MPA). In each round of MPA, an agent wakes up at random, interacts with one of its neighbours, and both go back to sleep; the pattern repeats for the following rounds. MPA is an algorithm that is easy to implement because it is asynchronous (each agent has its own clock to wake up), has no parameter to tune, and involves only single-neighbour interactions (the agent that wakes up does not need to coordinate message-passing with several neighbours). The authors in [1] prove that MPA converges to the solution of (1) in expectation (mean-value), for quadratic loss functions; for these functions, the iterations of the method we propose coincide with those of MPA. For more general loss functions, those authors suggest a different algorithm, based on ADMM, which needs parameter tuning to reach optimal performance. This ADMM-based algorithm for collaborative learning (CL-ADMM), will be compared with our algorithm in Section IV.

Problem (1), with the same kind of asynchronous single-neighbour interactions, can also be tackled by the algorithm proposed in [2]. In the language of [2], this corresponds to having agents deviate from the “rational” decision at each round (the rational decision would require each agent to interact will all its neighbors). For such “irrational” decisions, the authors show that, with probability one, the iterations of their algorithm will visit infinitely often a neighborhood of the solution of (1), although the iterations may continually escape that neighborhood.

Contributions. We show that a simple Jacobi-like distributed algorithm, which we call DJAM, can solve (1) with the same kind of asynchronous single-neighbor interactions. DJAM, which can also be seen as a randomized block-coordinate method, has no parameters that need tuning. For continuously differentiable personal loss functions that are strongly convex and have Lipschitz gradient, that is, such that

 (∇fi(x)−∇fi(y))T(x−y)≥mi∥x−y∥2 (2)

and

 ∥∇fi(x)−∇fi(y)∥≤Mi∥x−y∥ (3)

for some , and all , we show that DJAM converges to the solution of (1) with probability one. The values of and are used only for proving convergence; they need not be known when implementing DJAM.

DJAM improves on MPA not only because it applies to a larger class of functions than quadratics, but also because it converges in a stronger sense: as the proof of Theorem 1 ahead shows, the DJAM iterations are uniformly bounded; thus, the convergence in expectation in [1] follows by the dominated convergence theorem from our convergence with probability one. Our result only applies to a (somewhat) more restricted class of functions than the one of [2], but our convergence mode is stronger than the one of [2].

Other related work. Although [1, 2] are the closest works that we are aware of, many other distributed algorithms solve variations of problem (1). We now mention some representative work.

A number of distributed algorithms allow agents to solve an underlying optimization problem by reaching consensus on the solution. They use techniques ranging from distributed (sub)gradient descent [3], [4] to more elaborate techniques such as EXTRA [5], distributed ADMM [6, 7], dual averaging [8], and distributed Augmented Lagrangean (AL) [9]. Some algorithms aim at more specific optimization tasks such as distributed lasso regression [10], distributed SVMs [11], and distributed RFVL networks [12]. All of these methods aim at reaching consensus solutions—all agents converge to the same value. Conversely, in problem (1), agents want to find different (personalized) values.

The related problem of network lasso is dealt with in [13]; however, the cost in [13] puts a strong emphasis on neighbouring models being exactly equal, whereas in our case we want them to be similar, but not necessarily equal. The methods proposed in [14] and [15] can tackle more general problems, but both require that agents communicate with all their neighbors before updating, while our method needs only communications between two agents at a time.

## Ii Djam

A naive Jacobi-like approach to solve (1) would work as follows: at each round , one agent , picked at random, would update its model according to

 θi(t+1)=argminθi12∑k∈NiWik∥θi−θk(t)∥2+fi(θi),

where is the set of neighbors of agent . This naive approach, however, has a major drawback: it requires that agent communicates with all its neighbors—to receive their up-to-date models —before updating its own model. Coordinating such message-passing, at each round, is cumbersome. A lighter scheme, involving only a single pair of agents at a time, is simpler to implement in practice, and requires fewer communications.

The key idea, which we borrow from [1], is to have each agent keep its own model as well as (often outdated) versions of its neighbors’ models, for . The versions of each pair of neighbors are updated whenever they communicate with each other. More specifically, at each round , agent wakes up and chooses a neighbor to communicate with. They begin by exchanging information on their models, meaning that and . All other variables remain unchanged. Afterwards, both agents update their own model via

 Θll(t+1)=argminθl12∑k∈NlWlk∥θl−Θkl(t+1)∥2+fl(θl) (4)

for .

For the purpose of analyzing DJAM, we merge these two steps into a single one. Since the personal model can be created at any time at agent via (4), it need not be stored. This means that, at round of DJAM, two neighboring agents and will compute and share their own models with each other:

 Θji(t+1)=argminθj12∑k∈NjWjk∥θj−Θkj(t)∥2+fj(θj), (5)

and similarly for . Mind that the right-hand side of (5) is computed by agent and sent to agent , who stores the result in the variable on the left-hand side of (5).

## Iii Proof of convergence for DJAM

We now prove that DJAM, the algorithm with updates given by (5), converges with probability one to the solution of (1). We omit some laborious (but otherwise painless) technical steps that would make the notation and proofs too lengthy.

Let be the set of edges of the network that links the agents. The network need not be fully connected: each agent is connected only to a subset of the remaining agents. We assume that at each round (A1) one edge of is chosen at random, independently of previous choices; and (A2) each edge in has a fixed, positive probability of being chosen.

We stack some supporting lemmas (with abridged proofs) before reaching the main result, Theorem 1. We begin with a simple property of the edge selection process.

###### Lemma 1.

Under assumptions (A1) and (A2), each edge in  is chosen infinitely often.

###### Proof.

This follows from the divergent part of the Borel-Cantelli Lemma [16, Theorem 4.2.4]. ∎

The number of times a given edge is chosen between rounds and (with ) is a random variable defined as , where if edge is chosen at round , and zero otherwise. We now define a useful family of stopping times . We let and, for ,

 Tm+1\vbox\scriptsize.\scriptsize.=min{t∣S(i,j)(Tm+1,t)≥1,∀(i,j)∈E}.

In words, is the first round after by which all edges have been chosen at least once. Note that Lemma 1 tells us that any is finite with probability one.

Our next lemma states an intuitive property of : the sequence grows unbounded.

###### Lemma 2.

Under assumptions (A1) and (A2), as , with probability one.

###### Proof.

for any . ∎

Finally, we state an important consequence of assumptions (2) and (3) on each personal loss function .

###### Lemma 3.

Let , and take the function . Then, for any and ,

 ∥(∇Fj)−1(a)−(∇Fj)−1(b)∥≤(mj+wj)−1∥a−b∥.
###### Proof.

Note that is a bijective map because, from standard convex theory, is. Now, let and . Clearly, . Multiplying both sides of this equality by yields , where the inequality is due to the strong convexity of . By the Cauchy-Schwartz inequality, , and, thus, . It follows that , the desired result. ∎

We now give our main convergence result.

###### Theorem 1 (DJAM converges with probability one).

Let be the solution of (1). Let , , , be updated via (5) whenever edge is chosen at round , and similarly for . Then, for any pair of agents , as , with probability one.

###### Proof.

Let , and be defined as in Lemma 3. Suppose edge is chosen at time ; the update rule (5) can be rewritten in terms of as . Similarly, we have for each component of the solution.

Lemma 3 allows us to find that

 ∥Θji(t+1)−Θ∗j∥ ≤ (mj+wj)−1∥∑Wjk(Θkj(t)−Θ∗k)∥ ≤ (mj+wj)−1∑Wjk∥Θkj(t)−Θ∗k∥ ≤ (mj+wj)−1wjmaxk∥Θkj(t)−Θ∗k∥ ≤ βmaxk∥Θkj(t)−Θ∗k∥,

where .

Let us define the maximum error at round between the agents’ estimates and the solution,

 V(t):=maxl,k∥Θkl(t)−Θ∗k∥. (6)

If edge is chosen at round , we have, by the derivation above, that

 ∥Θji(t+1)−Θ∗j∥≤V(t). (7)

If that edge is not chosen, then and, by definition of , . We conclude that (7) holds for any pair and, so, . Since , the limit (which is a random variable) is thus always well defined. The goal of the proof is to show with probability one.

Consider now a sequence of randomly-chosen edges in which all edges of the graph appear infinitely often. Note that, by Lemma  1, such sequences exist with probability one; this means that the following reasoning holds with probability one. Since, by definition of , all edges were selected at least once between and , we know that

 ∥Θji(Tm+1)−Θ∗j∥≤(mj+wj)−1wjV(Tm)

for all edges . It follows, by the definition of , that

 V(Tm+1)≤βV(Tm). (8)

Let us take the limit in (8). Lemma 2 says that , and it follows from (6) that . Thus, owing to , we must have , which implies as , for any . ∎

Convergence in expectation. Note that (8) implies for all . This means that the iterations are uniformly bounded. Since, with probability one as , we conclude (by the dominated convergence theorem) that the convergence also takes place in expectation. We thus obtain the convergence mode of [1], as a particular case.

## Iv Field estimation example

Setup. Following [2], we consider a field estimation setup that leads to a problem of the form (1). The agents are spread in a region and wish to profile a certain quantity, say, temperature, over the region: agent  cares only about the value of the quantity at its location, . Assume that the true values of the temperatures, , are drawn from a prior distribution: a normal distribution with known mean and covariance ; as in [2], we assume that the off-diagonal elements of match the sparsity of the network, that is, if and only if . Agent  measures , where models identically distributed sensor noise (for simplicity), which is independent across agents.

MAP estimation. A maximum a posteriori (MAP) approach seeks the that maximizes ; or, equivalently, the that minimizes

 12(∑i∼jσij(θi−θj)2+n∑i=1σiiθ2i)+n∑i=1ϕ(yi−θi), (9)

where and depends on the distribution of the noise . We let be a Huber penalty function to handle outliers [17]. Finally, defining the personal loss functions as puts (9) in the form (1). Also, assumptions (2) and (3) hold.

Results: comparing DJAM with CL-ADMM. Since the algorithm MPA from [1] applies only to quadratic functions, we use the ADMM-based algorithm CL-ADMM from [1] to compare with DJAM. Note that both CL-ADMM and DJAM converge to the solution with probability one. The algorithm CL-ADMM, however, being based on ADMM, has a parameter to tune—the parameter in the quadratic penalization part of the augmented Lagrangian function. This parameter, which we refer to as , is known to affect noticeably the convergence speed of ADMM.

The results for a field estimation instance are shown in Figure 1. It shows, across rounds , the relative error between an agent’s private model and the solution component : . The relative error was averaged over agents and over 100 Monte Carlo trials where, in each Monte Carlo run, we choose a different set of edges along time.

Figure 1 confirms that the speed of convergence of CL-ADMM varies with the parameter noticeably. In fact, we verified in other simulations (omitted due to lack of space) that the optimal varied significantly with the number of agents, with the range of values for , and with the noise distribution—we found the optimal  for those simulations by careful hand-tuning. In contrast, DJAM is on par with the best in Figure 1, and needs no parameter tuning.

## References

• [1] P. Vanhaesebrouck, A. Bellet, and M. Tommasi, “Decentralized collaborative learning of personalized models over networks,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (A. Singh and J. Zhu, eds.), vol. 54 of Proceedings of Machine Learning Research, (Fort Lauderdale, FL, USA), pp. 509–517, PMLR, 20–22 Apr 2017.
• [2] C. Eksin and A. Ribeiro, “Distributed network optimization with heuristic rational agents,” IEEE Transactions on Signal Processing, vol. 60, pp. 5396–5411, Oct 2012.
• [3] K. Kvaternik and L. Pavel, “Lyapunov analysis of a distributed optimization scheme,” in International Conference on NETwork Games, Control and Optimization (NetGCooP 2011), pp. 1–5, Oct 2011.
• [4] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
• [5] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
• [6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
• [7] E. Wei and A. Ozdaglar, “On the O(1/k) convergence of asynchronous distributed alternating direction method of multipliers,” IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2013.
• [8] I. Colin, A. Bellet, J. Salmon, and S. Clémençon, “Gossip dual averaging for decentralized optimization of pairwise functions,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1388–1396, JMLR.org, 2016.
• [9] D. Jakovetić, J. M. F. Moura, and J. M. F. Xavier, “Linear convergence rate of a class of distributed augmented lagrangian algorithms,” IEEE Transactions on Automatic Control, vol. 60, no. 4, pp. 922–936, 2015.
• [10] G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparse linear regression,” IEEE Trans. Signal Processing, vol. 58, no. 10, pp. 5262–5276, 2010.
• [11] P. A. Forero, A. Cano, and G. B. Giannakis, “Consensus-based distributed support vector machines,” J. Mach. Learn. Res., vol. 11, pp. 1663–1707, Aug. 2010.
• [12] S. Scardapane, D. Wang, M. Panella, and A. Uncini, “Distributed learning for random vector functional-link networks,” Information Sciences, vol. 301, pp. 271 – 284, 2015.
• [13] D. Hallac, J. Leskovec, and S. Boyd, “Network lasso: Clustering and optimization in large graphs,” in International Conference on Knowledge Discovery & Data Mining, KDD: proceedings / International Conference on Knowledge Discovery & Data Mining, pp. 387–396, 2015.
• [14] F. Facchinei, G. Scutari, and S. Sagratella, “Parallel selective algorithms for nonconvex big data optimization,” IEEE Transactions on Signal Processing, vol. 63, pp. 1874–1889, April 2015.
• [15] I. Necoara and D. Clipici, “Parallel random coordinate descent method for composite minimization: Convergence analysis and error bounds,” SIAM Journal on Optimization, vol. 26, no. 1, pp. 197–226, 2016.
• [16] K. Chung, A Course in Probability Theory. Academic Press, 2001.
• [17] P. Huber, J. Wiley, and W. InterScience, Robust statistics. Wiley New York, 1981.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters