Distributed Optimization Using the Primal-Dual Method of Multipliers

# Distributed Optimization Using the Primal-Dual Method of Multipliers

## Abstract

In this paper, we propose the primal-dual method of multipliers (PDMM) for distributed optimization over a graph. In particular, we optimize a sum of convex functions defined over a graph, where every edge in the graph carries a linear equality constraint. In designing the new algorithm, an augmented primal-dual Lagrangian function is constructed which smoothly captures the graph topology. It is shown that a saddle point of the constructed function provides an optimal solution of the original problem. Further under both the synchronous and asynchronous updating schemes, PDMM has the convergence rate of (where denotes the iteration index) for general closed, proper and convex functions. Other properties of PDMM such as convergence speeds versus different parameter-settings and resilience to transmission failure are also investigated through the experiments of distributed averaging.

Distributed optimization, ADMM, PDMM, sublinear convergence.

## I Introduction

In recent years, distributed optimization has drawn increasing attention due to the demand for big-data processing and easy access to ubiquitous computing units (e.g., a computer, a mobile phone or a sensor equipped with a CPU). The basic idea is to have a set of computing units collaborate with each other in a distributed way to complete a complex task. Popular applications include telecommunication [3, 4], wireless sensor networks [5], cloud computing and machine learning [6]. The research challenge is on the design of efficient and robust distributed optimization algorithms for those applications.

To the best of our knowledge, almost all the optimization problems in those applications can be formulated as optimization over a graphic model :

 min{xi}∑i∈Vfi(xi)+∑(i,j)∈Efij(xi,xj), (1)

where and are referred to as node and edge-functions, respectively. For instance, for the application of distributed quadratic optimization, all the node and edge-functions are in the form of scalar quadratic functions (see [7, 8, 9]).

In the literature, a large number of applications (see [10]) require that every edge function , , is essentially a linear equality constraint in terms of and . Mathematically, we use to formulate the equality constraint for each , as demonstrated in Fig. 1. In this situation, (1) can be described as

 min{xi}∑i∈Vfi(xi)+∑(i,j)∈EIAijxi+Ajixj=cij(xi,xj), (2)

where denotes the indicator or characteristic function defined as if and if . In this paper, we focus on convex optimization of form (2), where every node-function is closed, proper and convex.

The majority of recent research have been focusing on a specialized form of the convex problem (2), where every edge-function reduces to . The above problem is commonly known as the consensus problem in the literature. Classic methods include the dual-averaging algorithm [11], the subgradient algorithm [12], the diffusion adaptation algorithm [13]. For the special case that are scalar quadratic functions (referred to as the distributed averaging problem), the most popular methods are the randomized gossip algorithm [5] and the broadcast algorithm [14]. See [15] for an overview of the literature for solving the distributed averaging problem.

The alternating-direction method of multipliers (ADMM) can be applied to solve the general convex optimization (2). The key step is to decompose each equality constraint into two constraints such as and with the help of the auxiliary variable . As a result, (2) can be reformulated as

 minx,zf(x)+g(z)subject toAx+Bz=c, (3)

where , and is a vector obtained by stacking up one after another. See [16] for using ADMM to solve the consensus problem of (2) (with edge-function ). The graphic structure is implicitly embedded in the two matrices and the vector . The reformulation essentially converts the problem on a general graph with many nodes (2) to a graph with only two nodes (3), allowing the application of ADMM. Based on (3), ADMM then constructs and optimizes an augmented Lagrangian function iteratively with respect to and a set of Lagrangian multipliers. We refer to the above procedure as synchronous ADMM as it updates all the variables at each iteration. Recently, the work of [17] proposed asynchronous ADMM, which optimizes the same function over a subset of the variables at each iteration.

We note that besides solving (2), ADMM has found many successful applications in the fields of signal processing and machine learning (see [10] for an overview). For instance, in [18] and [19], variants of ADMM have been proposed to solve a (possibly nonconvex) optimization problem defined over a graph with a star topology, which is motivated from big data applications. The work of [20] considers solving the consensus problem of (2) (with edge-function ) over a general graph, where each node function is further expressed as a sum of two component functions. The authors of [20] propose a new algorithm which includes ADMM as a special case when one component function is zero. In general, ADMM and its variants are quite simple and often provide satisfactory results after a reasonable number of iterations, making it a popular algorithm in recent years.

In this paper, we tackle the convex problem (2) directly instead of relying on the reformulation (3). Specifically, we construct an augmented primal-dual Lagrangian function for (2) without introducing the auxiliary variable as is required by ADMM. We show that solving (2) is equivalent to searching for a saddle point of the augmented primal-dual Lagrangian. We then propose the primal-dual method of multipliers (PDMM) to iteratively approach one saddle point of the constructed function. It is shown that for both the synchronous and asynchronous updating schemes, the PDMM converges with the rate of for general closed, proper and convex functions.

Further we evaluate PDMM through the experiments of distributed averaging. Firstly, it is found that the parameters of PDMM should be selected by a rule (see VI-C1) for fast convergence. Secondly, when there are transmission failures in the graph, transmission losses only slow down the convergence speed of PDMM. Finally, experimental comparison suggests that PDMM outperforms ADMM and the two gossip algorithms in [5] and [14].

This work is mainly devoted to the theoretical analysis of PDMM. In the literature, PDMM has already been successfully applied for solving a few other problems. The work of [21] investigates the efficiency of ADMM and PDMM for distributed dictionary learning. In [22], we have used both ADMM and PDMM for training a support vector machine (SVM). In the above examples it is found that PDMM outperforms ADMM in terms of convergence rate. In [23], the authors describes an application of the linearly constrained minimum variance (LCMV) beamformer for use in acoustic wireless sensor networks. The proposed algorithm computes the optimal beamformer output at each node in the network without the need for sharing raw data within the network. PDMM has been successfully applied to perform distributed beamforming. This suggests that PDMM is not only theoretically interesting but also might be powerful in real applications.

## Ii Problem Setting

In this section, we first introduce basic notations needed in the rest of the paper. We then make a proper assumption about the existence of optimal solutions of the problem. Finally, we derive the dual problem to (2) and its Lagrangian function, which will be used for constructing the augmented primal-dual Lagrangian function in Section III.

### Ii-a Notations and functional properties

We first introduce notations for a graphic model. We denote a graph as , where represents the set of nodes and represents the set of edges in the graph, respectively. We use to denote the set of all directed edges. Therefore, . The directed edge starts from node and ends with node . We use to denote the set of all neighboring nodes of node , i.e., . Given a graph , only neighboring nodes are allowed to communicate with each other directly.

Next we introduce notations for mathematical description in the remainder of the paper. We use bold small letters to denote vectors and bold capital letters to denote matrices. The notation (or ) represents a symmetric positive semi-definite matrix (or a symmetric positive definite matrix). The superscript represents the transpose operator. Given a vector , we use to denote its norm.

Finally, we introduce the conjugate function. Suppose is a closed, proper and convex function. Then the conjugate of is defined as [24, Definition 2.1.20]

 h∗(δ)\lx@stackrelΔ=maxyδTy−h(y), (4)

where the conjugate function is again a closed, proper and convex function. Let be the optimal solution for a particular in (4). We then have

 δ′∈∂yh(y′), (5)

where represents the set of all subgradients of at (see [24, Definition 2.1.23]). As a consequence, since , we have

 h(y′)= y′Tδ′−h∗(δ′)=maxδy′Tδ−h∗(δ), (6)

and we conclude that as well.

### Ii-B Problem assumption

With the notation for a graph, we first reformulate the convex problem (2) as

 (7)

where each function is assumed to be closed, proper and convex, and . For every edge , we let . The vector is thus of dimension . In general, and are two different matrices. The matrix operates on in the linear constraint of edge . The notation s. t. in (7) stands for “subject to”. We take the reformulation (7) as the primal problem in terms of .

The primal Lagrangian for (7) can be constructed as

 (8)

where is the Lagrangian multiplier (or the dual variable) for the corresponding edge constraint in (7), and the vector is obtained by stacking all the dual variables , , on top of one another. Therefore, is of dimension . The Lagrangian function is convex in for fixed , and concave in for fixed . Throughout the rest of the paper, we will make the following (common) assumption:

###### Assumption 1.

There exists a saddle point to the Lagrangian function such that for all and we have

 Lp(x⋆,δ)≤Lp(x⋆,δ⋆)≤Lp(x,δ⋆).

Or equivalently, the following optimality (KKT) conditions hold for :

 ∑j∈NiATijδ⋆ij∈∂fi(x⋆i) ∀i∈V (9) Ajix⋆j+Aijx⋆i=cij ∀(i,j)∈E. (10)

### Ii-C Dual problem and its Lagrangian function

We first derive the dual problem to (7). Optimizing over and yields

 maxδminxLp(x,δ) =maxδ∑i∈V−f∗i(∑j∈NiATijδij)+∑(i,j)∈EδTijcij, (11)

where is the conjugate function of as defined in (4), satisfying Fenchel’s inequality

 (12)

Under Assumption 1, the dual problem (11) is equivalent to the primal problem (7). That is suppose is a saddle point of . Then solves the primal problem (7) and solves the dual problem (11).

At this point, we need to introduce auxiliary variables to decouple the node dependencies in (11). Indeed, every , associated to edge , is used by two conjugate functions and . As a consequence, all conjugate functions in (11) are dependent on each other. To decouple the conjugate functions, we introduce for each edge two auxiliary node variables and , one for each node and , respectively. The node variable is owned by and updated at node and is related to neighboring node . Hence, at every node we introduce new node variables. With this, we can reformulate the original dual problem as

 maxδ,{λi} −∑i∈Vf∗i(ATiλi)+∑(i,j)∈EδTijcij s. t. λi|j=λj|i=δij∀(i,j)∈E, (13)

where is obtained by vertically concatenating all , , and is obtained by horizontally concatenating all , . To clarify, the product in (13) equals to

 (14)

Consequently, we let . In the above reformulation (13), each conjugate function only involves the node variable , facilitating distributed optimization.

Next we tackle the equality constraints in (13). To do so, we construct a (dual) Lagrangian function for the dual problem (13), which is given by

 L′d(δ,λ,y)= −∑i∈Vf∗i(ATiλi)+∑(i,j)∈EδTijcij (15)

where is obtained by concatenating all the Lagrangian multipliers , , one after another.

We now argue that each Lagrangian multiplier , , in (15) can be replaced by an affine function of . Suppose is a saddle point of . By letting for every , Fenchel’s inequality (12) must hold with equality at from which we derive that

 0 ∈∂λi|j[f∗i(ATiλ⋆i)]−Aijx⋆i =∂λi|j[f∗i(ATiλ⋆i)]+Ajix⋆j−cij∀[i,j]∈→E.

One can then show that where for every , is a saddle point of . We therefore restrict the Lagrangian multiplier to be of the form so that the dual Lagrangian becomes

 Ld(δ,λ,x)= −∑(i,j)∈EδTij(cij−Aijxi−Ajixj). (16)

We summarize the result in a lemma below:

###### Lemma 1.

If is a saddle point of , then is a saddle point of , where for every .

We note that might not be equivalent to . By inspection of the optimality conditions of (16), not every saddle point of might lead to due to the generality of the matrices . In next section we will introduce quadratic penalty functions w.r.t. to implicitly enforce the equality constraints .

To briefly summarize, one can alternatively solve the dual problem (13) instead of the primal problem. Further, by replacing with an affine function of in (15), the dual Lagrangian share two variables and with the primal Lagrangian . We will show in next section that the special form of in (16) plays a crucial role for constructing the augmented primal-dual Lagrangian.

## Iii Augmented Primal-Dual Lagrangian

In this section, we first build and investigate a primal-dual Lagrangian from and . We show that a saddle point of the primal-dual Lagrangian does not always lead to an optimal solution of the primal or the dual problem.

To address the above issue, we then construct an augmented primal-dual Lagrangian by introducing two additional penalty functions. We show that any saddle point of the augmented primal-dual Lagrangian leads to an optimal solution of the primal and the dual problem, respectively.

### Iii-a Primal-dual Lagrangian

By inspection of (8) and (16), we see that in both and , the edge variables are related to the terms . As a consequence, if we add the primal and dual Lagrangian functions, the edge variables will cancel out and the resulting function contains node variables and only.

We hereby define the new function as the primal-dual Lagrangian below:

###### Definition 1.

The primal-dual Lagrangian is defined as

 Lpd(x,λ)=Lp(x,δ)+Ld(δ,λ,x) =∑i∈V[fi(xi)−∑j∈NiλTj|i(Aijxi−cij)−f∗i(ATiλi)]. (17)

is convex in for fixed and concave in for fixed , suggesting that it is essentially a saddle-point problem (see [25], [26] for solving different saddle point problems). For each edge , the node variables and substitute the role of the edge variable . The removal of enables to design a distributed algorithm that only involves node-oriented optimization (see next section for PDMM).

Next we study the properties of saddle points of :

###### Lemma 2.

If solves the primal problem (7), then there exists a such that is a saddle point of .

###### Proof.

If solves the primal problem (7), then there exists a such that is a saddle point of and by Lemma 1, there exist for every so that is a saddle point of . Hence

 Lpd(x⋆,λ) =Lp(x⋆,δ)+Ld(δ,λ,x⋆) ≤Lp(x⋆,δ⋆)+Ld(δ⋆,λ⋆,x⋆) =Lpd(x⋆,λ⋆) ≤Lp(x,δ⋆)+Ld(δ⋆,λ⋆,x)

The fact that is a saddle point of , however, is not sufficient for showing (or ) being optimal for solving the primal problem (7) (for solving the dual problem (13)).

###### Example 1 (x⋆ not optimal).

Consider the following problem

 minx1,x2f1(x1)+f2(x2)s.t.x1−x2=0, (18)

With this, the primal Lagrangian is given by , so that the dual function is given by , where

 f∗1(δ12)=f∗2(−δ12)={δ120≤δ12≤1+∞otherwise.

Hence, the optimal solution for the primal and dual problem is and , respectively. The primal-dual Lagrangian in this case is given by

 Lpd(x,λ)= f1(x1)+f2(x2)−f∗1(λ1|2)−f∗2(−λ2|1) −x1λ2|1+x2λ1|2. (19)

One can show that every point is a saddle point of , which does not necessarily lead to .

It is clear from Example 1 that finding a saddle point of does not necessarily solve the primal problem (7). Similarly, one can also build another example illustrating that a saddle point of does not necessarily solve the dual problem (13).

### Iii-B Augmented primal-dual Lagrangian

The problem that not every saddle point of leads to an optimal point of the primal or dual problem can be solved by adding two quadratic penalty terms to as

 LP(x,λ)= Lpd(x,λ)+hPp(x)−hPd(λ), (20)

where and are defined as

 hPp(x)= ∑(i,j)∈E12∥∥Aijxi+Ajixj−cij∥∥2Pp,ij (21) hPd(λ)= (22)

where and

 Pp ={PTp,ij=Pp,ij≻0|(i,j)∈E} Pd ={PTd,ij=Pd,ij≻0|(i,j)∈E}.

The set of positive definite matrices remains to be specified.

Let and denote the primal and dual feasible set, respectively. It is clear that (or ) with equality if and only if (or ). The introduction of the two penalty functions essentially prevents non-feasible and/or to correspond to saddle points of . As a consequence, we have a saddle point theorem for which states that solves the primal problem (7) if and only if there exits such that is a saddle point of . To prove this result, we need the following lemma.

###### Lemma 3.

Let and be two saddle points of . Then

 (23)

Further, and are two saddle points of as well.

###### Proof.

Since and are two saddle points of , we have

 LP(x′,λ⋆)≤ LP(x′,λ′)≤LP(x⋆,λ′) LP(x⋆,λ′)≤ LP(x⋆,λ⋆)≤LP(x′,λ⋆).

Combining the above two inequality chains produces (23). In order to show that is a saddle point, we have . The proof for is similar. ∎

###### Theorem 1.

If solves the primal problem (7), there exists such that is a saddle point of . Conversely, if is a saddle point of , then and solves the primal and the dual problem, respectively. Or equivalently, the following optimality conditions hold

 ∑j∈NiATijλ′j|i∈∂xifi(x′i) ∀i∈V (24) Aijx′i+Ajix′j−cij=0 ∀(i,j)∈E (25) λ′i|j−λ′j|i=0 ∀(i,j)∈E. (26)
###### Proof.

If solves the primal problem, then there exists a such that is a saddle point of by Lemma 2. Since and , we have , and , from which we conclude that is a saddle point of as well.

Conversely, let be a saddle point of . We first show that solves the primal problem. We have from Lemma 3 that , which can be simplified as

 Lp(x′,δ⋆)+Ld(δ⋆,λ⋆,x′)+hPp(x′) =Lp(x⋆,δ⋆)+Ld(δ⋆,λ⋆,x⋆),

from which we conclude that and thus so that . In addition, since is a saddle point of by Lemma 3, we have

 ∑j∈NiATijδ⋆ij=∑j∈NiATijλ⋆j|i∈∂xifi(x′i),∀i∈V,

and we conclude that solves the primal problem as required. Similarly, one can show that solves the dual problem.

Based on the above analysis, we conclude that the optimality conditions for being a saddle point of are given by (24)-(26). The set of optimality conditions is redundant and can be derived from (24)-(26) (see (4)-(6) for the argument). ∎

Theorem 1 states that instead of solving the primal problem (7) or the dual problem (13), one can alternatively search for a saddle point of . To briefly summarize, we consider solving the following min-max problem in the rest of the paper

 (x⋆,λ⋆)=argminxmaxλLP(x,λ). (27)

We will explain in next section how to iteratively approach the saddle point in a distributed manner.

## Iv Primal-Dual Method of Multipliers

In this section, we present a new algorithm named primal-dual method of multipliers (PDMM) to iteratively approach a saddle point of . We propose both the synchronous and asynchronous PDMM for solving the problem.

### Iv-a Synchronous updating scheme

The synchronous updating scheme refers to the operation that at each iteration, all the variables over the graph update their estimates by using the most recent estimates from their neighbors from last iteration. Suppose is the estimate obtained from the th iteration, where . We compute the new estimate at iteration as

 (^xk+1i,^λk+1i) […,^λk,Ti−1,λTi,^λk,Ti+1,…]T)  i∈V. (28)

By inserting the expression (20) for into (28), the updating expression can be further simplified as

 ^xk+1i= (29) ^λk+1i= (30)

Eq. (29)-(30) suggest that at iteration , every node performs parameter-updating independently once the estimates of its neighboring variables are available. In addition, the computation of and can be carried out in parallel since and are not directly related in . We refer to (29)-(30) as node-oriented computation.

In order to run PDMM over the graph, each iteration should consist of two steps. Firstly, every node computes by following (29)-(30), accounting for information-fusion. Secondly, every node sends to its neighboring node for all neighbors, accounting for information-spread. We take as the common message to all neighbors of node and as a node-specific message only to neighbor . In some applications, it may be preferable to exploit broadcast transmission rather than point-to-point transmission in order to save energy. We will explain in Subsection IV-C that the transmission of , , can be replaced by broadcast transmission of an intermediate quantity.

Finally, we consider terminating the iterates (29)-(30). One can check if the estimate becomes stable over consecutive iterates (see Corollary 1 for theoretical support).

### Iv-B Asynchronous updating scheme

The asynchronous updating scheme refers to the operation that at each iteration, only the variables associated with one node in the graph update their estimates while all other variables keep their estimates fixed. Suppose node is selected at iteration . We then compute by optimizing based on the most recent estimates from its neighboring nodes. At the same time, the estimates , , remain the same. By following the above computational instruction, can be obtained as

 (31) (32)

Similarly to (29)-(30), and can also be computed separately in (31). Once the update at node is complete, the node sends the common message and node-specific messages to its neighbors We will explain in next subsection how to exploit broadcast transmission to replace point-to-point transmission.

In practice, the nodes in the graph can either be randomly activated or follow a predefined order for asynchronous parameter-updating. One scheme for realizing random node-activation is that after a node finishes parameter-updating, it randomly activates one of its neighbors for next iteration. Another scheme is to introduce a clock at each node which ticks at the times of a (random) Poisson process (see [5] for detailed information). Each node is activated only when its clock ticks. As for node-activation in a predefined order, cyclic updating scheme is probably most straightforward. Once node finishes parameter-updating, it informs node for next iteration. For the case that node and are not neighbors, the path from node to can be pre-stored at node to facilitate the process. In Subsection V-D, we provide convergence analysis only for the cyclic updating scheme. We leave the analysis for other asynchronous schemes for future investigation.

###### Remark 1.

To briefly summarize, synchronous PDMM scheme allows faster information-spread over the graph through parallel parameter-updating while asynchronous PDMM scheme requires less effort from node-coordination in the graph. In practice, the scheme-selection should depend on the graph (or network) properties such as the feasibility of parallel computation, the complexity of node-coordination and the life time of nodes.

### Iv-C Simplifying node-based computations and transmissions

It is clear that for both the synchronous and asynchronous schemes, each activated node has to perform two minimizations: one for and the other one for . In this subsection, we show that the computations for the two minimizations can be simplified. We will also study how the point-to-point transmission can be replaced with broadcast transmission. To do so, we will consider two scenarios:

#### Avoiding conjugate functions

In the first scenario, we consider using instead of to update . Our goal is to simplify computations by avoiding the derivation of .

By using the definition of in (4), the computation (30) for (which also holds for asynchronous PDMM) can be rewritten as

 ^λk+1i= (33)

We denote the optimal solution for in (33) as . The optimality conditions for , , and can then be derived from (33) as

 0∈ATi^λk+1i−∂wifi(wk+1i) (34) (35)

where (14) is used in deriving (35). Since is a nonsingular matrix, (35) defines a mapping from to :

 ^λk+1i|j (36)

With this mapping, (34) can then be reformulated as

 ∈∂wifi(wk+1i). (37)

By inspection of (37), it can be shown that (37) is in fact an optimality condition for the following optimization problem

 wk+1i=argminwi [fi(wi)+12∥cij−Aji^xkj−Aijwi∥2P−1d,ij −wTi∑j∈NiATij^λkj|i]. (38)

The above analysis suggests that can be alternatively computed through an intermediate quantity . We summarize the result in a proposition below.

###### Proposition 1.

Considering a node at iteration , the new estimate for each can be obtained by following (36), where is computed by (38).

Proposition 1 suggests that the estimate can be easily computed from . We argue in the following that the point-to-point transmission of can be replaced with broadcast transmission of .

We see from (36) that the computation of the node-specific message (from node to node ) only consists of the quantities , and . Since and are available at node , the message can therefore be computed at node once the common message is received. In other words, it is sufficient for node to broadcast both and to all its neighbors. Every node-specific message , , can then be computed at node alone.

Finally, in order for the broadcast transmission to work, we assume there is no transmission failure between neighboring nodes. The assumption ensures that there is no estimate inconsistency between neighboring nodes, making the broadcast transmission reliable.

#### Reducing two minimizations to one

In the second scenario, we study under what conditions the two minimizations (29)-(30) (which also hold for asynchronous PDMM) reduce to one minimization.

###### Proposition 2.

Considering a node at iteration , if the matrix for every neighbor is chosen to be