Distributed Regularized Dual Gradient Algorithm for Constrained Convex Optimization over Time-Varying Directed Graphs

# Distributed Regularized Dual Gradient Algorithm for Constrained Convex Optimization over Time-Varying Directed Graphs

Chuanye Gu Zhiyou Wu Jueyou Li School of Mathematical Sciences, Chongqing Normal University, Chongqing, 400047, China
###### Abstract

We investigate a distributed optimization problem over a cooperative multi-agent time-varying network, where each agent has its own decision variables that should be set so as to minimize its individual objective subject to local constraints and global coupling constraints. Based on push-sum protocol and dual decomposition, we design a distributed regularized dual gradient algorithm to solve this problem, in which the algorithm is implemented in time-varying directed graphs only requiring the column stochasticity of communication matrices. By augmenting the corresponding Lagrangian function with a quadratic regularization term, we first obtain the bound of the Lagrangian multipliers which does not require constructing a compact set containing the dual optimal set when compared with most of primal-dual based methods. Then, we obtain that the convergence rate of the proposed method can achieve the order of for strongly convex objective functions, where is the iterations. Moreover, the explicit bound of constraint violations is also given. Finally, numerical results on the network utility maximum problem are used to demonstrate the efficiency of the proposed algorithm.

###### keywords:
Convex optimization, Distributed algorithm, Dual decomposition, Regularization, Multi-agent network.
\newproof

proofProof

## 1 Introduction

In recent years it is witnessed the unprecedented growth in the research for solving many optimization problems over multi-agent networks nedic2009distributed (); nedic2010constaints (); jakovetic2014fast (); nedic2015distributed (). Distributed optimization has been found in a lot of application domains, such as distributed finite-time optimal rendezvous problem johansson2008subgradient (), wireless and social networks baingana2014proximalgradient (), mateos2012distributed (), power systems bolognani2015distributed (), zhangdistributed2016 (), robotics martinea2007on (), and so on. There is indeed a long history in the optimization community of this problem, see tsitsiklis1986distributed ().

Based on consensus schemes, there are mainly three categories of algorithms designed for distributed optimization in the literatures, including primal consensus distributed algorithms, dual consensus distributed algorithms and primal-dual consensus distributed algorithms, see nedic2009distributed (); ram2010distributed (); c2012dual (); zhuon2012 (); li2015gradient (); lorenzo2016next (). In most to previous works, the communication graphs are required to be balanced, i.e., the communication weight matrices are doubly stochastic. The paper gharesifard2012distributed () considered a fixed and directed graph with the requirement of a balanced graph. The work in iconsensus2012 () proposed distributed subgradient based algorithms in directed and fixed topologies, in which the messages among agents are propagated by “push-sum” protocol. However, the communication protocol is required to know the number of agents or the graph. In general, push-sum protocol is attractive for implementations since it can easily operate over directed communication topologies, and thus avoids incidents of deadlock that may occur in practice when using undirected communication topologies nedic2015distributed (). Nedić et al. in nedic2015distributed () designed subgradient-push distributed method for a class of unconstrained optimization problems, in which the requirement of a balanced graph was canceled. Their proposed method has a slower convergence rate with order of . Later, Nedić et al. in nedic2015stochastic () improved the convergence rate from to under the condition of strong convexity. However, they only considered unconstrained optimization problems.

The methods for solving distributed optimization problems subject to equality or (and) inequality constraints have received considerable attention bertsekas2003convex (); necoara2008application (); li2016a (). The authors in zhuon2012 () first proposed a distributed Lagrangian primal-dual subgradient method by characterizing the primal-dual optimal solutions as the saddle points of the Lagrangian function related to the problem under consideration. The work yuan2011distributed () developed a variant of the distributed primal-dual subgradient method by introducing multistep consensus mechanism. For more general distributed optimization problem with inequality constraints that couple all the agents’ decision variables, Chang et al. chang2014distributed () designed a novel distributed primal-dual perturbed subgradient method and analyzed the convergence. The implementation of the algorithms aforementioned usually involves projections onto some primal and dual constrained sets, respectively. In particular, they require constructing a compact set that contains the dual optimal set, and projecting the dual variable onto this set to guarantee the boundedness of dual iterates, which is of importance in establishing the convergence of the algorithms. However, the construction of this compact set is impractical since it involves each agent solving a general constrained convex problem Yuan2016Regularized (); Khuzani2016Distributed (). To ensure the boundedness of the norm of the dual variables, Yuan et al. in Yuan2016Regularized () proposed a regularized primal-dual distributed algorithm. However, the optimization problem only includes one constraint. Later, Khuzani et al. in Khuzani2016Distributed () investigated a distributed optimization with several inequality constraints, and established the convergence of their proposed distributed deterministic and stochastic primal-dual algorithms, respectively. Very recently, Falsone et al. falsone2016dual () designed a dual decomposition based distributed method for solving a separable convex optimization with coupled inequality constraints and provided the convergence analysis, but none of explicit convergence rate of their algorithm was given. Most of aforementioned works operating over undirected networks with the usage of doubly stochastic matrices are possible. However, it turns out that directed graphs depending on doubly stochastic matrices may be undesirable for a variety of reasons, see nedic2015distributed (); nedic2015stochastic ().

In this paper, we propose a distributed regularized dual gradient method for solving convex optimization problem subjected to local and coupling constraints over time-varying directed networks. The proposed method is based on push-sum protocol. Each agent is only required to know its out-degree at each time, without requiring knowledge of either the number of agents or the graph sequence. By augmenting the corresponding Lagrangian function with a quadratic regularization term, the norm of the multipliers is bounded, which does not require constructing a compact set containing the dual optimal set when compared with existing most of primal-dual methods. The convergence rate of the method with the order of for strongly convex objective functions is obtained. Moreover, the explicit bound on the constraint violations is also provided.

The main contributions of this paper are two folds. Firstly, we establish the upper bound on the norm of dual variables by resorting to the regularized Lagrangian function. Secondly, we obtain the explicit convergence rates of the proposed method over the directed unbalanced network. The work in this paper is related to the recent literatures nedic2015stochastic () and falsone2016dual (). The reference in nedic2015stochastic () addresses an unconstrained distributed optimization over time-varying directed networks, while our paper investigates a distributed optimization with coupling equality constraints. Our method can be viewed as an extension of push-sum based algorithms nedic2015stochastic () to a constrained setting. Compared with the method in falsone2016dual (), our proposed distributed algorithm is inspired by push-sum strategy over time-varying directed networks without the requirement of balanced network graphs, whereas the method in falsone2016dual () must require that the graphs are balanced and the communication matrices are doubly stochastic. In falsone2016dual (), the authors only establish the convergence of their approach. However, in this paper, we obtain the explicit convergence rates of the proposed method in the time-varying directed network topology. More importantly, we further give the explicit convergence estimate on constraint violations. The regularized primal-dual distributed methods proposed in Yuan2016Regularized (); Khuzani2016Distributed () require that the networks are undirected and the communication weight matrices are double stochastic, whereas our method can deal with distributed optimization problems over time-varying directed graphs, only needing the column stochastic matrices.

The remainder of this paper is organized as follows. In Section 2, we state the related problem, useful assumptions and preparatory work. In Section 3, we propose the distributed regularized dual gradient algorithm and give main results. In Section 4, we give some Lemmas and the proof of main results. Numerical simulations are given in Section 5. Finally, Section 6 draws some conclusions.

Notation: We use boldface to distinguish between the scalars and vectors in . For example, is a scalar and is a vector. For a matrix , we will use the to show its ’th entry. We use the to denote the Euclidean norm of a vector , and for the vector of ones. A convex function is -strongly convex with if the following relation holds, for all

 f(x)−f(y)≥∇g(y)⊤(x−y)+˜γ2||x−y||2,

where is any subgradient of at .

## 2 Distributed optimization problem with equality constraints

### 2.1 Constrained Multi-agent Optimization

Consider the following constrained optimization problem

 min{xi∈Xi}mi=1  F(x):=m∑i=1fi(xi)    s.t.    m∑i=1(Aixi−bi)=0, (1)

where there are agents associated with a time-varying network. Each agent only knows its own objective function : and its own constraints , and all agents subject to the coupling equality constraints , and . with , belongs to .

Problem (1) is quite general arising in diverse applications, for examples, distributed model predictive control necoara2008application (), network utility maximization Low1999optimization (); beck2014an (), economic dispatch problems for smart grid bolognani2015distributed (); zhangdistributed2016 ().

To decouple the coupling equality constraints, we introduce a regularized Lagrangian function of problem (1), given by

 L(x,λ):=m∑i=1[fi(xi)+λ⊤(Aixi−bi)−γi2λ⊤λ]=m∑i=1Li(xi,λ), (2)

where are the regularized Lagrangian function associated with the th agent, and is regularization parameter, for .

Define a regularized dual function of problem (1) as follows

 ϕ(λ):=minx∈XL(x,λ).

Note that the regularized Lagrangian function defined in (2) is separable with respect to . Thus, the regularized dual function can be rewritten as

 ϕ(λ)=m∑i=1ϕi(λ)=m∑i=1minxi∈XiLi(xi,λ), (3)

where can be regarded as the regularized dual function of agent .

Then, the regularized dual problem of problem (1) can be written as , or, equivalently,

 maxλm∑i=1ϕi(λ). (4)

The coupling equality constraints between agents is represented by the fact that is a common decision vector and all the agents should agree on its value.

### 2.2 Related assumptions

The following assumptions on the problem (1) and on the communication time-varying network are needed to show properties of convergence for the proposed method.

###### Assumption 1

For each , the function : is strongly convex, and the set is non-empty, convex and compact.

Note that, under the Assumption 1, we have:

(i) the function defined in (4) is -strongly concave, differentiable and its gradient is Lipschitz continuous with constant , where (see beck2014an (); li2016a (), for more details);

(ii) for any , there is a constant such that , due to the compactness of , .

We assume that each agent can communicate with other agents over a time-varying network. The communication topology is modeled by a directed graph over the vertex set with the edge set . Let represent the collection of in-neighbors and represent the collection of out-neighbors of agent at time , respectively. That is

 Nini[t]:={j|(j,i)∈E[t]}∪{i},
 Nouti[t]:={j|(i,j)∈E[t]}∪{i},

where represents agent may send its information to agent . And let be the out-degree of agent , i.e.,

 di[t]=|Nouti[t]|,

We introduce a time-varying communication weight matrix with elements , defined by

 (W[t])ij  =  ⎧⎪⎨⎪⎩    1dj[t],  when j∈Nini[t], i, j=1,2,…,m,      0,        otherwise. (5)

We need the following assumption on the weight matrix , which can be found in nedic2015distributed (), nedic2010constaints ().

###### Assumption 2

i) Every agent knows its out-degree at every time ;  ii) The graph sequence is -strongly connected, namely, there exists an integer such that the sequence with edge set is strongly connected, for all .

Note that the communicated weight matrix is column-stochastic. In this paper, we do not require the assumption of double-stochasticity on .

## 3 Algorithm and main results

### 3.1 Distributed regularized dual gradient algorithm

In general, the problem (1) could be solved in a centralized manner. However, if the number of agents is large, this may turn out to be computationally challenge. Additionally, each agent would be required to share its own information, such as the objective , the constraints and , either with the other agents or with a central coordinate collecting all information, which is possibly undesirable in many cases, due to privacy concerns.

To overcome both the computational and privacy issues stated above, we propose a Distributed Regularized Dual Gradient Algorithm (DRDGA, for short) by resorting to solve the regularized dual problem (4). Our proposed algorithm DRDGA is motivated by the gradient push-sum method nedic2015distributed () and dual decomposition falsone2016dual (); li2016a (), described as in Algorithm 1.

In Algorithm 1, each agent broadcasts (or pushes) the quantities and to all of the agents in its out-neighborhood . Then, each agent simply sums all the received messages to obtain in step 4 and in step 5, respectively. The update rules in steps 6-8 can be implemented locally. In particular, the update of local primal vector in step 7 is performed by minimizing with respect to evaluated at , while the update of the dual vector in step 8 involves the maximization of with respect to evaluated at . Note that the term in step 8 is the gradient of at .

### 3.2 Statement of main results

In this section, we will show that the main results of the convergence for the proposed Algorithm 1.

It is shown in nedic2010constaints () that the local primal vector does not converge to the optimal solution of problem (1) in general. Compared to , however, the following recursive auxiliary primal iterates

 ˆxi[T]=∑Tt=1(t−1)xi[t]T(T−1)2, for all T≥2

can show better convergence properties by setting , see zhuon2012 (); chang2014distributed (); beck2014an (). Define the averaging iterates as .

The following Theorem 1 first give an upper on the norm of dual variables. By controlling the norm of the dual variables, we in turn control the norm of the sub-gradients of the augmented Lagrangian function, which are instrumental to prove Theorem 2 and Theorem 3 below.

###### Theorem 1

Suppose that Assumptions 1 and 2 hold and the non-increasing stepsize sequence satisfies . Then, there is a positive constant such that for all

 supt||λi[t]||≤D.

In what follows, Theorem 2 shows the convergence rate of primal function’s value under Assumptions 1 and 2.

###### Theorem 2

(Convergence rate) Suppose Assumptions 1 and 2 are satisfied and the stepsize is taken as , where the constant is such that . Then, for all and , we have

 F(ˆxi[T])−F(x∗) ≤ 32Tδm∑i=1(Gi+γiD)(η1−ηm∑i=1||θi[0]||1+qmB1−η(1+lnT)) +qTm∑i=1(Gi+γiD)2.

where is the bound of dual variable, , , the scalar and satisfy

Theorem 2 shows that the iterative sequence of primal objective function converges to the optimal value at a rate of , i.e.,

 F(ˆx[T])−F(x∗)=O(lnTT)

with the constant relying on the regularization parameters , the bounds of dual variables and coupling constraints , initial values at the agents, and on both the speed of the network information diffusion and the imbalance of influences among the agents.

In the next theorem, we show that the upper bound on the constraint violation.

###### Theorem 3

(Constraint violation bound) Suppose Assumptions 1 and 2 are satisfied and the stepsize is taken as , where the constant is such that . Then, for all and , we have

 ||m∑j=1Ajˆxj[T]−bj||2 ≤γTδm∑j=1(Gj+γjD)(8η1−ηm∑j=1||μj[0]||1+8qmB1−η(1+lnT)) +qγ4Tm∑j=1(Gj+γjD)2.

where is the bound of dual variable, , , the scalar and satisfy

Theorem 3 provides that the bound of constraint violation measured by is of the order .

## 4 Proof of main results

Before the proof of main results, we need to establish some useful auxiliary lemmas. The following Lemma 1 exploits the structure of strongly concave functions with Lipschitz gradients, whose proof is motivated by Lemma 3 in nedic2015distributed (). We omit the proof here.

###### Lemma 1

Let be a strongly concave function with and have Lipschtiz continuous gradients with constant . Let and let be defined by

 y=z+β(∇h(z)+φ(z)),

where and is a mapping such that

 ||φ(z)||≤c,   ∀z∈Rp.

Then, there is a compact set (which depends on and the define of function , but not on ) such that

 ||y||≤{||z||,  ∀z∉V,R,  ∀z∈V,

where .

Based on Lemma 1, we are ready to prove our Theorem 1.
Proof of Theorem 1. By step 5 of Algorithm 1, we have

 ρ[t+1]=W[t]ρ[t],

where is the vector with entries . Further, the above relation can be recursively written as follows

 ρ[t]=W[t−1]W[t−2]⋯W[0]1,  for  all t≥1,

where we use the fact that , for all . Under Assumption 2, by Corollary 2(b) in nedic2015distributed (), for all , we have

 δ=inft=0,1,…(min1≤i≤m(W[t]W[t−1]…W[0]1)i)>0.

Therefore, we can obtain

 ρi[t]≥δ,  for  all  i  and  t. (6)

Using step 8 of Algorithm 1, we get

 θi[t] = ui[t]+β[t](Aixi[t]−bi−γiλi[t]) = ρi[t](λi[t]+β[t]ρi[t](Aixi[t]−bi−γiλi[t])),

Furthermore, the above equality gives rise to

 θi[t]ρi[t] = (7) = (1−γiβ[t]ρi[t])λi[t]+β[t]ρi[t](Aixi[t]−bi).

Since the transition matrix is column stochastic and , we have that , and . Together with (6) and , it yields

 limt→∞β[t]ρi[t]=0. (8)

Thus, for each , there exists a such that , for all .

Since the function defined in (4) is -strongly concave, and its gradient is Lipschitz continuous with constant , by Lemma 1, there exists a finite and a compact set such that, for all ,

 (9)

Let . Now we divide into two part ( and ) to prove the boundedness of , given by (7).

(i) By exploiting the mathematical induction, we will prove that, for all ,

 max1≤i≤m||λi[t]||≤˜R, (10)

where . Cleanly, if , the relation (10) is true. Suppose it is true at some time . Then, by (9), we have

 ∥θi[t]ρi[t]∥≤max{Ri,maxj||λj[t]||}≤˜R,  for  all  i, (11)

due to the induction hypothesis.

Next, in Lemma 4 of nedic2015stochastic (), we let , , and be taken as the vector of the th coordinates of the vectors , , where the coordinate index is arbitrary. By Lemma 4 of nedic2015stochastic (), we can get that each vector is a convex combination of the vector , i.e.,

 λi[t+1]=m∑j=1Qij[t]θi[t]ρi[t],  for  all  i  and  t≥0, (12)

where is a row stochastic matrix with entries . Due to the convexity of Euclidean norm , we further obtain

 ||λi[t+1]||≤m∑j=1Qij[t]||θj[t]ρj[t]||≤max1≤j≤m||θj[t]ρj[t]||,  for  all  i  and  t≥0. (13)

By (11) and (13), we have , thus implying that, at time

 max1≤i≤m||λi[t+1]||≤˜R.

Hence, the relation (11) holds, for all .

(ii) We prove that is bounded upper when . There is a constant such that , for all . Thus, together with (7) and (12), we can obtain that, for all

 max1≤i≤m||λi[t+1]|| ≤ max1≤j≤m(1−γjβ[t]ρj[t])||λj[t]||+β[t]ρj[t]||(Ajxj[t]−bj)|| ≤ max1≤j≤mC||λj[t]||+β[t]ρj[t]||(Ajxj[t]−bj)||. ≤ max1≤j≤mC||λj[t]||+max1≤j≤m¯¯¯βδGj,

where . Thus, exploiting the preceding relation recursively for , and the fact that the initial point is given in Algorithm 1, we conclude that there is a uniform deterministic bound on for all and . According to the above discussion, we conclude the proof. ∎

In order to prove Theorem 2 and 3, we need to use the following result, which is a generalization of Lemma 8 in nedic2015distributed ().

###### Lemma 2

Under the conditions of Theorem 1, for any and , we have

 ||¯¯¯θ[t+1]−λ||2 ≤ ||¯¯¯θ[t]−λ||2+4β[t+1]mm∑j=1(Gj+γjD)||λj[t+1]−¯¯¯θ[t]|| −β[t+1]mm∑j=1γj||λj[t+1]−λ||2+β2[t+1]mm∑j=1(Gj+γjD)2 −2β[t+1]m(L(x[t+1],λ)−L(x,¯¯¯θ[t])).

Proof : We first prove that is bounded, for any . Since is a column stochastic matrix, we have , for any vector . By step 4 of Algorithm 1, we have

From the definition of in step 6 of Algorithm 1, it gives rise to

 ¯¯¯θ[t]=1mm∑i=1ui[t+1]=1mm∑i=1ρi[t+1]λi[t+1].

Note that , and for all and . Thus, by the result of Theorem 1, we have, for all and ,

 ¯¯¯θ[t]≤maxi||λi[t+1]||≤D.

Now we are beginning to prove the result of Lemma 2. From step 8 of Algorithm 1, we have

 ¯¯¯θ[t+1]=¯¯¯θ[t]+β[t+1]mm∑j=1(Ajxj[t+1]−bj−γjλj[t+1]). (14)

For any , the relation (14) gives rise to

 ||¯¯¯θ[t+1]−λ||2 = ||¯¯¯θ[t]−λ+β[t+1]mm∑j=1(Ajxj[t+1]−bj−γjλj[t+1])||2 ≤ ||¯¯¯θ[t]−λ||2+β2[t+1]m2||m∑j=1(Ajxj[t+1]−bj−γjλj[t+1])||2 +2β[t+1]mm∑j=1(Ajxj[t+1]−bj−γjλj[t+1])⊤(¯¯¯θ[t]−λ).

By using the inequality , we can obtain

 ||m∑j=1(Ajxj[t+1]−bj−γjλj[t+1]||2 ≤ mm∑j=1||Ajxj[t+1]−bj−γjλj[t+1]||2 ≤ mm∑j=1(Gj+γjD)2.

Thus, we have, for all

 ||¯¯¯θ[t+1]−λ||2≤||¯¯¯θ[t]−λ||2+β2[t+1]mm∑j=1(Gj+γjD)2 +2β[t+1]mm∑j=1(Ajxj[t+1]−bj−γjλj[t+1])⊤(¯¯¯θ[t]−λ). (15)

We now consider the last term in the right-hand side of (4), it can rewritten as

 (Ajxj[t+1]−bj−γjλj[t+1])⊤(¯¯¯θ[t]−λ) = (Ajxj[t+1]−bj−γjλj[t+1])⊤(¯¯¯θ[t]−λj[t+1]+λj[t+1]−λ) = (Ajxj[t+1]−bj−γjλj[t+1])⊤(¯¯¯θ[t]−λj[t+1]) (16) +(Ajxj[t+1]−bj−γjλj[t+1])⊤(λj[t+1]−λ).

By the Cauchy-Schwarz inequality, we have

 −(Gj+γjD)||¯¯¯θ[t]−λj[t+1]|| ≤(Ajxj[t+1]−bj−γjλj[t+1])⊤(¯¯¯θ[t]−λj[t+1]). (17)

Since is -strongly concave, we have, for any

 (Ajxj[t+1]−bj−γjλj[t+1])⊤(λj[t+1]−λ) ≤ Lj(xj[t+1],λj[t+1])−Lj(xj[t+1],λ)−γj2||λj[t+1]−λ||2. (18)

By step 7 of Algorithm 1, for any , we can get

 fj(xj[t+1])+λj[t+1]⊤(Ajxj[t+1]−bj)−γj2λj[t+1]⊤λj[t+1] ≤ fj(xj)+λj[t+1]⊤(Ajxj−bj)−γj2λj[t+1]⊤λj[t+1].

Subtracting in above relation, we obtain