Decentralized Approximate Newton Methodsfor In-Network Optimization

# Decentralized Approximate Newton Methods for In-Network Optimization

Hejie Wei, Zhihai Qu, Xuyang Wu, Hao Wang, and Jie Lu This work has been supported by the National Natural Science Foundation of China under grant 61603254 and the Natural Science Foundation of Shanghai under grant 16ZR1422500.Hejie Wei, ZhiHai Qu, Xuyang Wu, Hao Wang, and Jie Lu are with the School of Information Science and Technology, ShanghaiTech University, 201210 Shanghai, China. (e-mail: {weihj, quzhh1, wuxy, wanghao1, lujie}@shanghaitech.edu.cn)
###### Abstract

This paper proposes a set of Decentralized Approximate Newton (DEAN) methods for addressing in-network convex optimization, where nodes in a network seek for a consensus that minimizes the sum of their individual objective functions through local interactions only. The proposed DEAN algorithms allow each node to repeatedly take a local approximate Newton step, so that the nodes not only jointly emulate the (centralized) Newton method but also drive each other closer. Under weaker assumptions in comparison with most existing distributed Newton-type methods, the DEAN algorithms enable all the nodes to asymptotically reach a consensus that can be arbitrarily close to the optimum. Also, for a particular DEAN algorithm, the consensus error among the nodes vanishes at a linear rate and the iteration complexity to achieve any given accuracy in optimality is provided. Furthermore, when the optimization problem reduces to a quadratic program, the DEAN algorithms are guaranteed to linearly converge to the exact optimal solution.

ptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptpt

Decentralized Approximate Newton Methods

for In-Network Optimization

Hejie Wei, Zhihai Qu, Xuyang Wu, Hao Wang, and Jie Lu

00footnotetext: This work has been supported by the National Natural Science Foundation of China under grant 61603254 and the Natural Science Foundation of Shanghai under grant 16ZR1422500.00footnotetext: Hejie Wei, ZhiHai Qu, Xuyang Wu, Hao Wang, and Jie Lu are with the School of Information Science and Technology, ShanghaiTech University, 201210 Shanghai, China. (e-mail: {weihj, quzhh1, wuxy, wanghao1, lujie}@shanghaitech.edu.cn)

Index Terms

• Distributed optimization, decentralized algorithm, Newton method

## I Introduction

In many engineering applications such as learning by computer networks [1], coordination of multi-agent systems [2], estimation by sensor networks [3], and resource allocation in communication networks [4], nodes in a networked system often need to cooperate with each other in order to minimize the sum of their individual objective functions.

There have been a large number of decentralized/distributed algorithms for such in-network optimization problems, which allow nodes in the network to address the problem by means of interacting with their neighbors only. Most of these algorithms are first-order methods, where the nodes utilize subgradients/gradients of their local objectives to update (e.g., [3, 5, 6, 7, 8, 9, 10, 4, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]). However, the first-order algorithms may suffer from slow convergence rate, especially when the problem is ill-conditioned. This motivates the development of decentralized second-order methods, where the Hessian matrices of the local objectives, if available, are involved in computing the iterates. Such second-order methods can be roughly classified into the following two categories:

The first category is the methods based on second-order approximations of certain dual-related objectives. For instance, the decentralized Exact Second-Order Method (ESOM) [24] considers a second-order approximation of an augmented Lagrangian function, and the Decentralized Quadratically Approximated ADMM (DQM) [25] introduces a quadratic approximation to a decentralized version of the Alternating Direction Method of Multipliers (ADMM).

The second category is the Newton-type methods, such as the distributed Broyden-Fletcher-Goldfarb-Shanno (D-BFGS) method [26], the Network-Newton (NN) method [27], the Distributed Quasi-Newton (DQN) method [28], and the Newton-Raphson Consensus (NRC) method [29]. Among these methods, D-BFGS, NN, and DQN relax the consensus constraint by adding a penalty to the objective function and approximate the Newton direction of the penalized objective in a decentralized manner. As a result, these methods are only guaranteed to converge to a suboptimal solution. NRC utilizes an average consensus scheme to approximate the Newton-Raphson direction in a distributed fashion. Although NRC may converge to the exact optimal solution, no explicit parameter condition to guarantee the convergence is provided, making it difficult to be implemented in practice.

In this paper, we propose a family of Decentralized Approximate Newton methods, referred to as DEAN, for solving in-network convex optimization. The DEAN algorithms are developed by letting every node execute a local Newton-like step at each iteration, where the inverse of the Hessian of its own objective function is involved and the gradient term in the conventional Newton step is replaced by the sum of the differences between the node and each of its neighbors, which are measured by the gradients of a class of locally strongly convex functions associated with the corresponding links. This intends to approximate the traditional (centralized) Newton method and in the meanwhile, drive all the nodes together. The DEAN algorithms are endowed with the following results and advantages:

1. The DEAN algorithms asymptotically drive all the nodes to a consensus that lies in an arbitrarily small neighborhood of the optimum. In addition, if the local objectives are positive definite quadratic functions, the nodes converge to the exact optimum at a linear rate.

2. With a particular choice of the functions associated with the links in DEAN, the disagreement among the nodes is shown to drop to zero at a linear rate. Further, for any given accuracy , we provide the iteration complexity (i.e., a bound on the number of iterations needed) to achieve -suboptimality.

3. The above convergence results are established under the assumption that the local objectives of the nodes are locally strongly convex, which is less restricted than the global strong convexity assumed by most existing second-order methods [24, 25, 27, 28, 29].

4. Compared to other Newton-type methods, DEAN has the lowest communication cost per iteration.

5. Simulation results illustrate the competitive performance of DEAN in comparison with several existing Newton-type methods.

The outline of this paper is as follows: Section II formulates the problem. Section III describes the proposed DEAN algorithms and Section IV is dedicated to the convergence analysis. Section V presents the simulation results. Concluding remarks are provided in Section VI. All the proofs are in the appendix.

### I-a Notation and preliminaries

Throughout this paper, we use to denote the Euclidean norm, the unordered pair, the absolute value of a real number or the cardinality of a set, and the range of a matrix. In addition, is the -dimensional all-one vector and is the identity matrix. For any , is the diagonal matrix whose diagonal entries are . For any , is the vector obtained by stacking . Given any and , represents the closed ball with center and radius . Also, for any set , is the convex hull of and is the projection of onto . For any differentiable function , denotes the gradient of at and, if is twice differentiable, represents the Hessian matrix of at . For any , means is positive semidefinite and means is positive definite. For any symmetric positive semidefinite matrix , we use to denote the -th smallest eigenvalue of , the largest eigenvalue of , and the pseudoinverse of .

A differentiable function is said to be locally strongly convex if for any convex and compact set , there exists such that for any , where is called the convexity parameter of on . It is said to be (globally) strongly convex if there exists such that for any . A vector-valued or matrix-valued function is said to be locally Lipschitz continuous if for any compact set contained in the domain of , there exists such that the Lipschitz condition holds for all . Also, is said to be the Lipschitz constant of on .

## Ii Problem Formulation

Consider an undirected, connected graph , where , is the node set and is the link set. For each node , the set of its neighbors is denoted by . The nodes are required to solve the following problem:

 minx∈Rn∑i∈Vfi(x), (1)

where each is the local objective function of node and satisfies the following assumption:

###### Assumption 1

For each , is twice continuously differentiable and locally strongly convex, and has a minimizer. In addition, is locally Lipschitz continuous.

Assumption 1 guarantees that each has a unique minimizer , so that there is a unique optimal solution to (II). In addition, given any convex and compact set , there exist such that . Another implication of Assumption 1 is that is locally Lipschitz continuous.

Among the existing distributed second-order methods [24, 25, 26, 27, 28, 29], most of them assume the ’s to be (globally) strongly convex [24, 25, 27, 28, 29], which is more restricted than the local strong convexity in Assumption 1. One example of functions that are locally strongly convex but not (globally) strongly convex is the objective of logistic regression [1], i.e., with given and , which often arises in machine learning. The D-BFGS method [26] allows each to be a general convex function, yet it requires, like other second-order algorithms in [24, 27, 28, 29], each to be globally Lipschitz continuous, which is unnecessary for problem (II) under Assumption 1. Moreover, the local Lipschitz continuity of in Assumption 1 is weaker than the three times continuous differentiability of in [29] and the global Lipschitz continuity of in [24, 25, 27].

## Iii Decentralized Approximate Newton Methods

In this section, we develop a class of decentralized Newton-type algorithms to address problem (II).

To do so, we first reformulate problem (II) as follows:

 minx∈RnN F(x)=∑i∈Vfi(xi) s.t. xi=xj,∀i,j∈V,

where and . If we directly apply the classic Newton method to solve the above problem, i.e.,

 xk+1=xk−α(∇2F(xk))−1∇F(xk), (2)

where is the Newton step-size, then generally the consensus constraint cannot be satisfied. To overcome this, we replace the gradient term in (2) by , where is given by

 [HG]ij=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩∑s∈Niα{i,s},ifi=j,−α{i,j},if{i,j}∈E,0,otherwise, (3)

with . In this way, if the th -dimensional block of is assigned to node , we obtain

 xk+1i=xki+(∇2fi(xki))−1∑j∈Niα{i,j}(xkj−xki). (4)

This replacement can potentially drive all the nodes to a consensus. To see this, consider the special case where each , . In this case, (II) becomes an average consensus problem and (4) reduces to

 xk+1i =xki+∑j∈Niα{i,j}(xkj−xki).

This is indeed a classic linear consensus scheme [30], with which all the ’s asymptotically reach a consensus at if the ’s are appropriately selected.

To better capture the behavior of the Newton method, we further extend the linear consensus term in (4) to a nonlinear one, i.e.,

 xk+1i= xki+(∇2fi(xki))−1∑j∈Niα{i,j}(∇g{i,j}(xkj) −∇g{i,j}(xki)), (5)

where associated with each link are a constant weight and a convex function . We assume to satisfy a similar but less restrictive assumption than :

###### Assumption 2

For each , is continuously differentiable and locally strongly convex. In addition, is locally Lipschitz continuous.

There are numerous choices of satisfying Assumption 2. For example, we may let

 g{i,j}(x)=fi(x)+fj(x),∀{i,j}∈E, (6)

or

 g{i,j}(x)=12xTA{i,j}x,∀{i,j}∈E, (7)

where can be any symmetric positive definite matrix known to both node and node . Observe that when each , (III) reduces exactly to (4).

Although (III) may enable the nodes to eventually attain a consensus, it is still unclear how far the ’s are from the optimum . Notice that for each ,

 ∥∑i∈V∇fi(xk+1i)−∑i∈V∇fi(xki)∥ = ∥∑i∈V∫10∇2fi(xki+s(xk+1i−xki))(xk+1i−xki)ds∥ = ∥∑i∈V∫10[∇2fi(xki+s(xk+1i−xki))−∇2fi(xki)](xk+1i−xki)ds∥ ≤ ∑i∈V∫10∥∇2fi(xki+s(xk+1i−xki))−∇2f(xki)∥ds ⋅∥xk+1i−xki∥, (8)

in which the second equality follows from (III) that . This, together with (III) and the local Lipschitz continuity of each , implies that if remain bounded for all , then

 ∥∑i∈V∇fi(xk+1i)−∑i∈V∇fi(xki)∥ ≤ C⋅max{i,j}∈Eα{i,j}∑{i,j}∈E∥xki−xkj∥

for some constant . This indicates that once all the ’s become identical, the gradient sum would remain constant. Further, if the initial gradient sum is zero, by properly selecting the weights , we may be able to keep very small, so that the ’s, once agreeing with each other, would be sufficiently close to the optimum .

Based on the above observations, to make each approach the optimum , we set each to the unique minimizer of , i.e.,

 x0i=x∗i:=argminx∈Rnfi(x). (9)

Hence, . Note from this and (8) that if each is a positive definite quadratic function, i.e.,

 fi(x)=(x−bi)TBi(x−bi)2,Bi=BTi≻0,bi∈Rn, (10)

then . Thus, if all the ’s reach a consensus, the consensus is the optimum .

The initialization (9) and the update (III) together yield a class of Decentralized Approximate Newton methods, referred to as DEAN algorithms. As is shown in Algorithm 1, the implementation of the DEAN algorithms is fully decentralized. The initialization (9) can be completed by each node on their own. The update (III) requires each node to evaluate the inverse of the Hessian of its local objective at its current estimate and to exchange with its neighbors.

Prior to implementing DEAN, each pair of neighboring nodes need to agree on the selection of the function satisfying Assumption 2. For the option of given by (6), the update (III) can be executed if each node shares its local objective with all its neighbors. However, this could be prohibitively costly in some cases. Instead, the nodes may adopt the following scheme to avoid exchanging the ’s: For every , each node first sends and to all its neighbors. Upon receiving and , each node computes for every and sends it to neighbor . Through such local interactions, each node is able to update via (III) and (6) without exchanging its local objective with its neighbors. For another example of in (7), each pair of neighbors only need to jointly determine , which can be done at negligible communication cost (e.g., we may simply set ).

###### Remark 1

If all the weights are identical, the DEAN algorithms (III) and (9) can be viewed as a finite-difference discretization of the continuous-time ZGS algorithms in the earlier work [31], for which the ZGS manifold is guaranteed to be positive invariant. Nevertheless, here we allow each to be distinct and determined only by the neighboring nodes . Moreover, unlike the ZGS algorithms that require global strong convexity of the ’s to establish convergence, the DEAN algorithms relax this condition to local strong convexity. Furthermore, the discrete-time nature of DEAN requires significantly different tools for convergence analysis.

Finally, we compare DEAN with the existing decentralized Newton-type methods, including D-BFGS[26], NN-K, K [27], DQN-K, K [28], and NRC [29], in respect of their communication costs at each iteration. Note from Algorithm 1 that DEAN essentially requires every node to transmit vectors of dimension per iteration. Every node at each iteration in D-BFGS needs to transmit vectors of the same dimension. During one iteration of NN-K and DQN-K , transmissions of -dimensional vectors are executed by each node. In addition, NRC needs every node to transmit vectors in and matrices in at each iteration. Therefore, the communication cost of DEAN is the lowest among these Newton-type methods.

## Iv Convergence Analysis

In this section, we analyze the convergence performance of the DEAN algorithms.

To this end, we utilize the Lyapunov function candidate given by

 V(x)=∑i∈Vfi(x∗)−fi(xi)−∇fi(xi)T(x∗−xi). (11)

Due to Assumption 1, and the equality holds if and only if . Hence, can be viewed as a measure of the suboptimality of . Further, we introduce the following notations based on , which will be used to present the convergence results.

First of all, for each , let

 Ci={x∈Rn:fi(x∗)−fi(x)−∇fi(x)T(x∗−x)≤V(x0)},

where is the initial state in the DEAN algorithms given by (9). Clearly, are compact. Thus, there exist such that

 ∇2fi(x)⪰~θiIn,∀x∈conv{Ci}, (12) ∇2fi(x)⪰¯θiIn,∀x∈conv{∪j∈VCj}. (13)

In addition, for each , define the compact set

 C{i,j}=conv{B(x0i;2√2V(x0)~θi)∪B(x0j;2 ⎷2V(x0)~θj)}. (14)

It follows from Assumption 2 that there exist such that

 γ{i,j}∥x−y∥2≤(∇g{i,j}(x)−∇g{i,j}(y))T(x−y) ≤Γ{i,j}∥x−y∥2,∀x,y∈C{i,j}. (15)

Arbitrarily pick an and suppose the weights are selected from the interval . Then, for each , let

 δi=√2V(x0)[2√~θi+¯α~θi∑j∈NiΓ{i,j}(1√~θi+1√~θj)]. (16)

Due again to Assumption 1, there exist and such that ,

 (∇fi(x)−∇fi(y))T(x−y)≤Θi∥x−y∥2, (17) ∥∇2fi(x)−∇2fi(y)∥≤Li∥x−y∥. (18)

Moreover, we let be such that

 ∇2fi(x)⪰θiIn,∀x∈conv{Ci∪B(x0i;δi)}. (19)

Note that . For convenience, denote and . If is (globally) strongly convex, then we can take as well as the above and all equal to the convexity parameter of over .

Our first result shows that is non-increasing in and provides its drop at each iteration:

###### Lemma 1 (Monotonicity of Lyapunov function)

Suppose Assumptions 1 and 2 hold. Let be generated by DEAN described in Algorithm 1 with . If, in addition,

 α{i,j}<12Γ{i,j}min{θ2i|Ni|(Θi−θi2+Li2√2V(x0)θi)−1, θ2j|Nj|(Θj−θj2+Lj2√2V(x0)θj)−1},∀{i,j}∈E, (20)

then for each ,

 ⋅[(θi2−Θi−Li2√2V(x0)θi)|Ni|α{i,j}θ2i+12Γ{i,j}]≤0. (21)

Proof: See Appendix A.

###### Remark 2

In Lemma 1 as well as the statements in the rest of the paper, the constant in the condition can be chosen as any positive scalar, which plays a role in given by (16) and, thus, affects the values of .

Lemma 1 implies that is a Lyapunov function which keeps strictly decreasing until become identical. Also, since is bounded from below, exists. This leads to the theorem below, which says that all the nodes are able to reach a consensus:

###### Theorem 1 (Asymptotic convergence to consensus)

Suppose Assumptions 1 and 2 hold. Let be generated by DEAN described in Algorithm 1 with . Suppose (1) holds. Then,

 limk→∞∥xki−xkj∥=0,∀i,j∈V. (22)

Proof: See Appendix B.

In the following theorem, we further show that the nodes not only attain a consensus as in Theorem 1, but also asymptotically achieve -accuracy in optimality for any given (i.e., ), provided that the weights are properly related to :

###### Theorem 2 (Asymptotic convergence to suboptimality)

Suppose Assumptions 1 and 2 hold. Let be generated by DEAN described in Algorithm 1 with . For each , let and . Given any , if

 (23)

then .

Proof: See Appendix C.

Below, we explore the convergence rates of the DEAN algorithms. For simplicity, here we only consider DEAN with each , which indeed can be extended to more general cases. To present the convergence rate results, we consider the Laplacian matrix of the graph :

 [LG]ij=⎧⎪⎨⎪⎩|Ni|,ifi=j,−1,if{i,j}∈E,0,otherwise. (24)

Observe that is symmetric positive semidefinite. Also, since is connected, has only one eigenvalue at zero. Its second smallest eigenvalue (i.e., the algebraic connectivity of ) and its largest eigenvalue . The following theorem shows that the nodes achieve a consensus at a linear rate, which depends on and , and provides a bound on the distance between the consensus and the optimum :

###### Theorem 3 (Rate of convergence)

Suppose Assumption 1 holds. Let be generated by DEAN described in Algorithm 1 with and . Suppose (1) holds and . Then, there exists , such that

 ∥xk−~x∥≤max{i,j}∈Eα{i,j}λmax(LG)θ(1−q)∥x0∥qk,

 ∥~x−x∗∥≤maxi∈VLi~ρi⋅√NV(x0)2∑i∈V¯θi,

where .

Proof: See Appendix D.

Theorem 3 says that the consensus error among the nodes vanishes at a linear rate. In addition, the consensus can be sufficiently close to if the weights are sufficiently small. Following Theorems 2 and 3, below we present the iteration complexity of DEAN, which states that -accuracy can be reached within iterations:

###### Theorem 4 (Iteration complexity)

Suppose Assumption 1 holds. Let be generated by DEAN described in Algorithm 1 with . Given any , let

 α{i,j}=ϵ¯ζ{i,j}+~ζ{i,j}ϵ,∀{i,j}∈E, (25)

where and (with and defined in Theorem 2) are such that . Then, for all , where

 Kϵ=Θλ2(LG)(max{i,j}∈E¯ζ{i,j}ϵ+max{i,j}∈E~ζ{i,j}) ⋅ln⎛⎝2λmax(LG)∥x0∥Θmax{i,j}∈E(¯ζ{i,j}+~ζ{i,j}ϵ)ϵθλ2(LG)min{i,j}∈E¯ζ{i,j}⎞⎠.

Proof: See Appendix E.

Finally, recall from Section III that when each is a positive definite quadratic function in the form of (10), we guarantee that . This, along with (22) in Theorem 1, suggests that . Additionally, the rate of convergence to is derived in the following proposition:

###### Proposition 1

Suppose Assumption 2 holds. For each , let be given by (10), , and . Let be generated by DEAN described in Algorithm 1 with . Also suppose (1) holds. Then, for each ,

 V(xk) ≤(1−ρ)kV(x0), (26) ∑i∈Vθi∥xki−x∗∥2 ≤(1−ρ)k∑i∈VΘi∥x0i−x∗∥2, (27)

where , is a positive semidefinite matrix given by

 [R]ij=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(12−1N)Θi+12N2∑ℓ∈VΘℓ,ifi=j,−Θi+Θj2N+12N2∑ℓ∈