FROST – Fast rowstochastic optimization
with uncoordinated stepsizes
Abstract
In this paper, we discuss distributed optimization over directed graphs, where doublystochastic weights cannot be constructed. Most of the existing algorithms overcome this issue by applying pushsum consensus, which utilizes columnstochastic weights. The formulation of columnstochastic weights requires each agent to know (at least) its outdegree, which may be impractical in e.g., broadcastbased communication protocols. In contrast, we describe FROST (Fast RowstochasticOptimization with uncoordinated STepsizes), an optimization algorithm applicable to directed graphs that does not require the knowledge of outdegrees; the implementation of which is straightforward as each agent locally assigns weights to the incoming information and locally chooses a suitable stepsize. We show that FROST converges linearly to the optimal solution for smooth and stronglyconvex functions given that the largest stepsize is positive and sufficiently small.
I Introduction
In this paper, we study distributed optimization, where agents are tasked to solve the following problem:
where each objective, , is private and known only to agent . The goal of the agents is to find the global minimizer of the aggregate cost, , via local communication with their neighbors and without revealing their private objective functions. This formulation has recently received great attention due to its extensive applications in e.g., machine learning [1, 2, 3, 4, 5], control [6], cognitive networks, [7, 8], and source localization [9, 10].
Early work on this topic includes Distributed Gradient Descent (DGD) [11, 12], which is computationally simple but is slow due to a diminishing stepsize. The convergence rates are for general convex functions and for stronglyconvex functions, where is the number of iterations. With a constant stepsize, DGD converges faster albeit to an inexact solution [13, 14]. Related work also includes methods based on the Lagrangian dual [15, 16, 17, 18]to achieve faster convergence, albeit at the expense of more computation. To achieve both fast convergence and computational simplicity, some fast distributed firstorder methods have been proposed. A Nesterovtype approach [19] achieves for smooth convex functions with bounded gradient assumption. EXTRA [20] exploits the difference of two consecutive DGD iterates to achieves a linear convergence to the optimal solution. Exact Diffusion [21, 22] applies an AdaptthenCombine structure [23] to EXTRA and generalizes the symmetric doublystochastic weights required in EXTRA to locallybalanced rowstochastic weights over undirected graphs. Of significant relevance to this paper is a distributed gradient tracking technique built on dynamic consensus [24], which enables each agent to asymptotically learn the gradient of the global objective function. This technique was first proposed simultaneously in [25, 26]. Refs. [25, 27] combine it with the DGD structure to achieve improved convergence for smooth and convex problems. Refs. [26, 28], on the other hand, propose the NEXT framework for a more general class of nonconvex problems.
All of the aforementioned methods assume that the multiagent network is undirected. In practice, it may not be possible to achieve undirected communication. It is of interest, thus, to develop algorithms that are fast and are applicable to arbitrary directed graphs. The challenge here lies in the fact that doublystochastic weights, standard in many distributed optimization algorithms, cannot be constructed over arbitrary directed graphs. In particular, the weight matrices in directed graphs can only be either rowstochastic or columnstochastic, but not both.
We now discuss related work on directed graphs. Early work based on DGD includes subgradientpush [29, 30] and DirectedDistributed Gradient Descent (DDGD) [31, 32], with a sublinear convergence rate of . Some recent work extends these methods to asynchronous networks [33, 34, 35]. To accelerate the convergence, DEXTRA [36] combines pushsum [37] and EXTRA [20] to achieve linear convergence given that the stepsize lies in some nontrivial interval. This restriction on the stepsize is later relaxed in ADDOPT/PushDIGing [38, 39], which linearly converge for a sufficiently small stepsize. Of relevance is also [40], where distributed nonconvex problems are considered with columnstochastic weights. More recent work [41, 42] proposes the and algorithms, which employ both row and uncoordinated stochastic weights to achieve (accelerated) linear convergence over arbitrary stronglyconnected graphs. Note that although the construction of doublystochastic weights is avoided, all of the aforementioned methods require each agent to know its outdegree to formulate doubly or columnstochastic weights. This requirement may be impractical in situations where the agents use a broadcastbased communication protocol. In contrast, Refs. [43, 44] provide algorithms that only use rowstochastic weights. Rowstochastic weight design is simple and is further applicable to broadcastbased methods.
In this paper, we focus on optimization with rowstochastic weights following the recent work in [43, 44]. We propose a fast optimization algorithm, termed as FROST (Fast Rowstochastic Optimization with uncoordinated STepsizes), which is applicable to both directed and undirected graphs with uncoordinated stepsizes among the agents. Distributed optimization (based on gradient tracking) with uncoordinated stepsizes has been previously studied in [25, 45, 46], over undirected graphs with doublystochastic weights, and in [47], over directed graphs with columnstochastic weights. These works introduce a notion of heterogeneity among the stepsizes, defined respectively as the relative deviation of the stepsizes from their average in [48, 45], and as the ratio of the largest to the smallest stepsize in [46, 47]. It is then shown that when the heterogeneity is small enough, i.e., the stepsizes are very close to each other, and when the largest stepsize follows a bound as a function of the heterogeneity, the proposed algorithms linearly converge to the optimal solution. A challenge in this formulation is that choosing a sufficiently small, local stepsize does not ensure small heterogeneity, while no stepsize can be chosen to be zero. In contrast, a major contribution of this paper is that we establish linear convergence with uncoordinated stepsizes when the upper bound on the stepsizes is independent of any notion of heterogeneity. The implementation of FROST therefore is completely local, since each agent locally chooses a sufficiently small stepsize, independent of other stepsizes, and locally assigns rowstochastic weights to the incoming information. In addition, our analysis shows that all stepsizes except one can be zero for the algorithm to work, which is a novel result in distributed optimization. We show that FROST converges linearly to the optimal solution for smooth and stronglyconvex functions.
Notation: We use lowercase bold letters to denote vectors and uppercase italic letters to denote matrices. The matrix, , represents the identity, whereas () is the dimensional uncoordinated vector of all ’s (’s). We further use to denote an dimensional vector of all ’s except at the th location. For an arbitrary vector, , we denote its th element by and is a diagonal matrix with on its main diagonal. We denote by , the Kronecker product of two matrices, and . For a primitive, rowstochastic matrix, , we denote its left and right Perron eigenvectors by and , respectively, such that ; similarly, for a primitive, columnstochastic matrix, , we denote its left and right Perron eigenvectors by and , respectively, such that [49]. For a matrix, , we denote as its spectral radius and as a diagonal matrix consisting of the corresponding diagonal elements of . The notation denotes the Euclidean norm of vectors and matrices, while denotes the Frobenius norm of matrices. Depending on the argument, we denote either as a particular matrix norm, the choice of which will be clear in Lemma 1, or a vector norm that is compatible with this matrix norm, i.e., for all matrices, , and all vectors, [49].
We now describe the rest of the paper. Section II states the problem and assumptions. Section III reviews related algorithms that use doublystochastic or columnstochastic weights and shows the intuition behind the analysis of these types of algorithms. In Section IV, we provide the main algorithm, FROST, proposed in this paper. In Section V, we develop the convergence properties of FROST. Simulation results are provided in Section VI and Section VII concludes the paper.
Ii Problem Formulation
Consider agents communicating over a stronglyconnected network, , where is the set of agents and is the set of edges, , such that agent can send information to agent , i.e., . Define as the collection of inneighbors, i.e., the set of agents that can send information to agent . Similarly, as the set of outneighbors of agent . Note that both and include agent . The agents are tasked to solve the following problem:
where is a private cost function only known to agent . We denote the optimal solution of P1 as . We will discuss different distributed algorithms related to this problem under the applicable set of assumptions, described below.
Assumption A1.
The graph, , is undirected and connected.
Assumption A2.
The graph, , is directed and stronglyconnected.
Assumption A3.
Each local objective, , is convex with bounded subgradient.
Assumption A4.
Each local objective, , is smooth and stronglyconvex, i.e., and ,

there exists a positive constant such that

there exists a positive constant such that
Clearly, the Lipschitzcontinuity and strongconvexity constants for the global objective function, , are and , respectively.
Assumption A5.
Each agent in the network has and knows its unique identifier, e.g., .
Assumption A6.
Each agent knows its outdegree in the network, i.e., the number of its outneighbors.
Iii Related work
In this section, we discuss related distributed firstorder methods and provide an intuitive explanation for each one of them.
Iiia Algorithms using doublystochastic weights
A wellknown solution to distributed optimization over undirected graphs is Distributed Gradient Descent (DGD) [11, 12], which combines distributed averaging with a local gradient step. Each agent maintains a local estimate, , of the optimal solution, , and implements the following iteration:
(1) 
where is doublystochastic and respects the graph topology. The stepsize is diminishing such that and . Under the Assumptions A1, A3, and A6, DGD converges to at the rate of . The convergence rate is slow because of the diminishing stepsize. If a constant stepsize is used in DGD, i.e., , it converges faster to an error ball, proportional to , around [13, 14]. This is because is not a fixedpoint of the above iteration when the stepsize is a constant.
To accelerate the convergence, Refs. [25, 27] recently propose a distributed firstorder method based on gradient tracking, which uses a constant stepsize and replaces the local gradient, at each agent in DGD, with an asymptotic estimator of the global gradient^{1}^{1}1EXTRA [20] is another related algorithm, which uses the difference between two consecutive DGD iterates to achieve linear convergence to the optimal solution.. The algorithm is updated as follows [25, 27]:
(2a)  
(2b) 
initialized with and an arbitrary at each agent. The first equation is essentially a descent method, after mixing with neighboring information, where the descent direction is , instead of as was in Eq. (1). The second equation is a global gradient estimator when viewed as dynamic consensus [52], i.e., asymptotically tracks the average of local gradients: . It is shown in Ref. [27, 45, 39] that converges linearly to under Assumptions A1, A4, A6, with a sufficiently small stepsize, . Note that these methods, Eq. (1) and Eqs. (2a)(2b), are not applicable to directed graphs as they require doublystochastic weights.
IiiB Algorithms using columnstochastic weights
We first consider the case when DGD in Eq. (1) is applied to a directed graph and the weight matrix is columnstochastic but not rowstochastic. It can be obtained that [31]:
(3) 
where . From Eq. (3), it is clear that the average of the estimates, , converges to , as Eq. (3) can be viewed as a centralized gradient method if each local estimate converges to . However, since the weight matrix is not rowstochastic, the estimates of agents will not reach an agreement [31]. This discussion motivates combining DGD with an algorithm, called pushsum, briefly discussed next, that enables agreement over directed graphs with columnstochastic weights.
IiiB1 Pushsum consensus
Pushsum [53, 37] is a technique to achieve averageconsensus over arbitrary digraphs. At time , each agent maintains two state vectors, , , and an auxiliary scalar variable, , initialized with . Pushsum performs the following iterations:
(4a)  
(4b)  
(4c) 
where is columnstochastic. Eq. (4a) can be viewed as an independent algorithm to asymptotically learn the right Perron eigenvector of ; recall that the right Perron eigenvector of is not because is not rowstochastic and we denote it by . In fact, it can be verified that and that . Therefore, the limit of , as the ratio of over , is the average of the initial values:
In the next subsection, we present subgradientpush that applies pushsum to DGD, see [31, 32] for an alternate approach that does not require eigenvector estimation of Eq. (4a).
IiiB2 SubgradientPush
To solve Problem P1 over arbitrary directed graphs, Refs. [29, 30] develop subgradientpush with the following iterations:
(5a)  
(5b)  
(5c) 
initialized with and an arbitrary at each agent. The stepsize, , satisfies the same conditions as in DGD. To understand these iterations, note that Eqs. (5a)(5c) are nearly the same as Eqs. (4a)(4c), except that there is an additional gradient term in Eq. (5b), which drives the limit of to . Under the Assumptions A2, A3 and A6, subgradientpush converges to at the rate of . For extensions of subgradientpush to asynchronous networks, see recent work [33, 34, 35]. We next describe an algorithm that significantly improves this convergence rate.
IiiB3 ADDOPT/PushDIGing
ADDOPT [38], extended to timevarying graphs in PushDIGing [39], is a fast algorithm over directed graphs, which converges at a linear rate to under the Assumptions A2, A4, and A6, in contrast to the sublinear convergence of subgradientpush. The three vectors, , , , and a scalar maintained at each agent , are updated as follows:
(6a)  
(6b)  
(6c)  
(6d) 
where each agent is initialized with , , and an arbitrary . We note here that ADDOPT/PushDIGing essentially applies pushsum to the algorithm in Eqs. (2a)(2b), when the doublystochastic weights therein are replaced by columnstochastic weights.
IiiB4 The algorithm
As we can see, subgradientpush and ADDOPT/PushDIGing, described before, have a nonlinear term that comes from the division by the eigenvector estimation. In contrast, the algorithm, introduced in [41] and extended to with the addition of a heavyball momentum term in [42] and to timevarying graphs in [54], removes this nonlinearity and remains applicable to directed graphs by a simultaneous application of row and columnstochastic weights^{2}^{2}2See [31, 32] for related work with sublinear rate based on surplus consensus [55].. Each agent maintains two variables: , , where, as before, is the estimate of , and tracks the average gradient, . The algorithm, initialized with and arbitrary at each agent, performs the following iterations.
(7a)  
(7b) 
where is rowstochastic and is columnstochastic. It is shown that converges linearly to for sufficiently small stepsizes under the Assumptions A2, A4 and A6 [41]. Therefore, can be viewed as a generalization of the algorithm in Eqs. (2a)(2b) as the doublystochastic weights therein are replaced by row and columnstochastic weights. Furthermore, it is shown in [42] that ADDOPT/PushDIGing in Eqs. (6a)(6d) in fact can be derived from an equivalent form of after a state transformation on the update; see [42] for details. For applications of the algorithm to distributed least squares, see, for instance, [56].
Iv Algorithms using Rowstochastic Weights
All of the aforementioned methods require at least each agent to know its outdegree in the network in order to construct doubly or columnstochastic weights. This requirement may be infeasible, e.g., when agents use broadcastbased communication protocols. Rowstochastic weights, on the other hand, are easier to implement in a distributed manner as every agent locally assigns an appropriate weight to each incoming variable from its inneighbors. In the next section, we describe the main contribution of this paper, i.e., a fast optimization algorithm that uses only rowstochastic weights and uncoordinated stepsizes.
To motivate the proposed algorithm, we first consider DGD in Eq. (1) over directed graphs when the weight matrix in DGD is chosen to be rowstochastic, but not columnstochastic. From consensus arguments and the fact that the stepsize goes to , it can be verified that the agents achieve agreement. However, this agreement is not on the optimal solution. This can be shown [31] by defining an accumulation state, , where is the left Perron eigenvector of the rowstochastic weight matrix, to obtain
(8) 
It can be verified that the agents agree to the limit of the above iteration, which is suboptimal since this iteration minimizes a weighted sum of the objective functions and not the sum. This argument leads to a modification of Eq. (8) that cancels the imbalance in the gradient term caused by the fact that is not a vector of all ’s, a consequence of losing the columnstochasticity in the weight matrix. The modification, introduced in [43], is implemented as follows:
(9a)  
(9b) 
where is rowstochastic and the algorithm is initialized with and an arbitrary at each agent. Eq. (9a) asymptotically learns the left Perron eigenvector of the rowstochastic weight matrix , i.e., . The above algorithm achieves a sublinear convergence rate of under the Assumptions A2, A3, and A5, see [43] for details.
Iva FROST (Fast Rowstochastic Optimization with uncoordinated STepsizes)
Based on the insights that gradient tracking and constant stepsizes provide exact and fast linear convergence, we now describe FROST that adds gradient tracking to the algorithm in Eqs. (9a)(9b) while keeping constant but uncoordinated stepsizes at the agents. Each agent at the th iteration maintains three variables, , and . At th iteration, agent performs the following update:
(10a)  
(10b)  
(10c) 
where ’s are the uncoordinated stepsizes locally chosen at each agent and the rowstochastic weights, , respect the graph topology such that:
The algorithm is initialized with an arbitrary , , and . We point out that the initial condition for Eq. (10a) and the divisions in Eq. (10c) require each agent to have a unique identifier. Clearly, Assumption A5 is applicable here. Note that Eq. (10c) is a modified gradient tracking update, first applied to optimization with rowstochastic weights in [44], where the divisions are used to eliminate the imbalance caused by the left Perron eigenvector of the (rowstochastic) weight matrix . We note that the algorithm in [44] requires identical stepsizes at the agents and thus is a special case of Eqs. (10a)(10c).
For analysis purposes, we write Eqs. (10a)(10c) in a compact vectormatrix form. To this aim, we introduce some notation as follows: let , , and collect the local variables , and in a vector in , respectively, and define
Since the weight matrix is primitive with positive diagonals, it is straightforward to verify that is invertible for any . Based on the notation above, Eqs. (10a)(10c) can be written compactly as follows:
(11a)  
(11b)  
(11c) 
where , and is arbitrary. We emphasize that the implementation of FROST needs no knowledge of agent’s outdegree anywhere in the network in contrast to the earlier related work in [30, 29, 31, 32, 36, 38, 39, 41, 42]. Note that Refs. [21, 22] also use rowstochastic weights but require an additional locallybalanced assumption and are only applicable to undirected graphs.
V Convergence Analysis
In this section, we present the convergence analysis of FROST described in Eqs. (11a)(11c). We first define a few additional variables as follows:
Since is primitive and rowstochastic, from the PerronFrobenius theorem [49], we note that , where is the left Perron eigenvector of .
Va Auxiliary relations
We now start the convergence analysis with a key lemma regarding the contraction of the augmented weight matrix under an arbitrary norm.
Lemma 1.
Let Assumption A2 hold and consider the augmented weight matrix . There exists a vector norm, , such that ,
where is some constant.
Proof.
As shown above, the existence of a norm in which the consensus process with rowstochastic matrix is a contraction does not follow the standard norm argument for doublystochastic matrices [27, 39]. The ensuing arguments built on this notion of contraction under arbitrary norms were first introduced in [38] for columnstochastic weights and in [44] for rowstochastic weights; these arguments are harmonized later to hold simultaneously for both row and columnstochastic weights in [41, 42]. The next lemma, a direct consequence of the contraction introduced in Lemma 1, is a standard result from consensus and Markov chain theory [57].
Lemma 2.
Consider , generated from the weight matrix . We have:
where is some positive constant and is the contraction factor defined in Lemma 1.
Proof.
As a consequence of Lemma 2, we next establish the linear convergence of the sequences and .
Lemma 3.
The following inequalities hold : (a) ; (b) .
Proof.
The proof of (a) is as follows:
where the last inequality uses Lemma 2 and the fact that . The result in (b) is straightforward by applying (a), i.e.,
which completes the proof. ∎
The next lemma presents the dynamics that govern the evolution of the weighted sum of ; recall that , in Eq. (11c), asymptotically tracks the average of local gradients, .
Lemma 4.
The following equation holds for all :
(12) 
Proof.
Recall that . We obtain from Eq. (11c) that
Doing this iteratively, we have that
With the initial conditions that and , we complete the proof. ∎
The next lemma, a standard result in convex optimization theory from [58], states that the distance to the optimal solution contracts in each step in the centralized gradient method.
Lemma 5.
Let and be the strongconvexity and Lipschitzcontinuity constants for the global objective function, , respectively. Then and , we have
where .
With the help of the previous lemmas, we are ready to derive a crucial contraction relationship in the proposed algorithm.
VB Contraction relationship
Our strategy to show convergence is to bound , , and as a linear function of their values in the last iteration and ; this approach extends the work in [27] on doublystochastic weights to rowstochastic weights. We will present this relationship in the next lemmas. Before we proceed, we note that since all vector norms are equivalent in , there exist positive constants such that: First, we derive a bound for , the consensus error of the agents.
Lemma 6.
The following inequality holds, :
(13) 
where is the equivalencenorm constant such that and is the largest stepsize among the agents.
Next, we derive a bound for , i.e., the optimality gap between the accumulation state of the network, , and the optimal solution, .
Lemma 7.
If , the following inequality holds, :
(14) 
where and is the equivalencenorm constant such that .
Proof.
Recalling that and , We have the following:
(15) 
Since the last term in the inequality above matches the second last term in Eq. (7), we only need to handle the first term. We further note that:
Now, we derive a upper bound for the first term in Eq. (VB),
(16) 
If , according to Lemma 5,
(17) 
where . Next we derive a bound for .
(18) 
where it is straightforward to bound as
(19) 
Since and from Lemma 4, we have:
(20) 
where we use Lemma 3. Combining Eqs. (VB)(20), we finish the proof. ∎
Next, we bound , the error in gradient estimation.
Lemma 8.
The following inequality holds, :