Decentralized Approximate Newton Methods
for InNetwork Optimization
Abstract
This paper proposes a set of Decentralized Approximate Newton (DEAN) methods for addressing innetwork convex optimization, where nodes in a network seek for a consensus that minimizes the sum of their individual objective functions through local interactions only. The proposed DEAN algorithms allow each node to repeatedly take a local approximate Newton step, so that the nodes not only jointly emulate the (centralized) Newton method but also drive each other closer. Under weaker assumptions in comparison with most existing distributed Newtontype methods, the DEAN algorithms enable all the nodes to asymptotically reach a consensus that can be arbitrarily close to the optimum. Also, for a particular DEAN algorithm, the consensus error among the nodes vanishes at a linear rate and the iteration complexity to achieve any given accuracy in optimality is provided. Furthermore, when the optimization problem reduces to a quadratic program, the DEAN algorithms are guaranteed to linearly converge to the exact optimal solution.
ptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptpt
Decentralized Approximate Newton Methods
for InNetwork Optimization
Hejie Wei, Zhihai Qu, Xuyang Wu, Hao Wang, and Jie Lu
^{0}^{0}footnotetext: This work has been supported by the National Natural Science Foundation of China under grant 61603254 and the Natural Science Foundation of Shanghai under grant 16ZR1422500.^{0}^{0}footnotetext: Hejie Wei, ZhiHai Qu, Xuyang Wu, Hao Wang, and Jie Lu are with the School of Information Science and Technology, ShanghaiTech University, 201210 Shanghai, China. (email: {weihj, quzhh1, wuxy, wanghao1, lujie}@shanghaitech.edu.cn)
Index Terms

Distributed optimization, decentralized algorithm, Newton method
I Introduction
In many engineering applications such as learning by computer networks [1], coordination of multiagent systems [2], estimation by sensor networks [3], and resource allocation in communication networks [4], nodes in a networked system often need to cooperate with each other in order to minimize the sum of their individual objective functions.
There have been a large number of decentralized/distributed algorithms for such innetwork optimization problems, which allow nodes in the network to address the problem by means of interacting with their neighbors only. Most of these algorithms are firstorder methods, where the nodes utilize subgradients/gradients of their local objectives to update (e.g., [3, 5, 6, 7, 8, 9, 10, 4, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]). However, the firstorder algorithms may suffer from slow convergence rate, especially when the problem is illconditioned. This motivates the development of decentralized secondorder methods, where the Hessian matrices of the local objectives, if available, are involved in computing the iterates. Such secondorder methods can be roughly classified into the following two categories:
The first category is the methods based on secondorder approximations of certain dualrelated objectives. For instance, the decentralized Exact SecondOrder Method (ESOM) [24] considers a secondorder approximation of an augmented Lagrangian function, and the Decentralized Quadratically Approximated ADMM (DQM) [25] introduces a quadratic approximation to a decentralized version of the Alternating Direction Method of Multipliers (ADMM).
The second category is the Newtontype methods, such as the distributed BroydenFletcherGoldfarbShanno (DBFGS) method [26], the NetworkNewton (NN) method [27], the Distributed QuasiNewton (DQN) method [28], and the NewtonRaphson Consensus (NRC) method [29]. Among these methods, DBFGS, NN, and DQN relax the consensus constraint by adding a penalty to the objective function and approximate the Newton direction of the penalized objective in a decentralized manner. As a result, these methods are only guaranteed to converge to a suboptimal solution. NRC utilizes an average consensus scheme to approximate the NewtonRaphson direction in a distributed fashion. Although NRC may converge to the exact optimal solution, no explicit parameter condition to guarantee the convergence is provided, making it difficult to be implemented in practice.
In this paper, we propose a family of Decentralized Approximate Newton methods, referred to as DEAN, for solving innetwork convex optimization. The DEAN algorithms are developed by letting every node execute a local Newtonlike step at each iteration, where the inverse of the Hessian of its own objective function is involved and the gradient term in the conventional Newton step is replaced by the sum of the differences between the node and each of its neighbors, which are measured by the gradients of a class of locally strongly convex functions associated with the corresponding links. This intends to approximate the traditional (centralized) Newton method and in the meanwhile, drive all the nodes together. The DEAN algorithms are endowed with the following results and advantages:

The DEAN algorithms asymptotically drive all the nodes to a consensus that lies in an arbitrarily small neighborhood of the optimum. In addition, if the local objectives are positive definite quadratic functions, the nodes converge to the exact optimum at a linear rate.

With a particular choice of the functions associated with the links in DEAN, the disagreement among the nodes is shown to drop to zero at a linear rate. Further, for any given accuracy , we provide the iteration complexity (i.e., a bound on the number of iterations needed) to achieve suboptimality.

Compared to other Newtontype methods, DEAN has the lowest communication cost per iteration.

Simulation results illustrate the competitive performance of DEAN in comparison with several existing Newtontype methods.
The outline of this paper is as follows: Section II formulates the problem. Section III describes the proposed DEAN algorithms and Section IV is dedicated to the convergence analysis. Section V presents the simulation results. Concluding remarks are provided in Section VI. All the proofs are in the appendix.
Ia Notation and preliminaries
Throughout this paper, we use to denote the Euclidean norm, the unordered pair, the absolute value of a real number or the cardinality of a set, and the range of a matrix. In addition, is the dimensional allone vector and is the identity matrix. For any , is the diagonal matrix whose diagonal entries are . For any , is the vector obtained by stacking . Given any and , represents the closed ball with center and radius . Also, for any set , is the convex hull of and is the projection of onto . For any differentiable function , denotes the gradient of at and, if is twice differentiable, represents the Hessian matrix of at . For any , means is positive semidefinite and means is positive definite. For any symmetric positive semidefinite matrix , we use to denote the th smallest eigenvalue of , the largest eigenvalue of , and the pseudoinverse of .
A differentiable function is said to be locally strongly convex if for any convex and compact set , there exists such that for any , where is called the convexity parameter of on . It is said to be (globally) strongly convex if there exists such that for any . A vectorvalued or matrixvalued function is said to be locally Lipschitz continuous if for any compact set contained in the domain of , there exists such that the Lipschitz condition holds for all . Also, is said to be the Lipschitz constant of on .
Ii Problem Formulation
Consider an undirected, connected graph , where , is the node set and is the link set. For each node , the set of its neighbors is denoted by . The nodes are required to solve the following problem:
(1) 
where each is the local objective function of node and satisfies the following assumption:
Assumption 1
For each , is twice continuously differentiable and locally strongly convex, and has a minimizer. In addition, is locally Lipschitz continuous.
Assumption 1 guarantees that each has a unique minimizer , so that there is a unique optimal solution to (II). In addition, given any convex and compact set , there exist such that . Another implication of Assumption 1 is that is locally Lipschitz continuous.
Among the existing distributed secondorder methods [24, 25, 26, 27, 28, 29], most of them assume the ’s to be (globally) strongly convex [24, 25, 27, 28, 29], which is more restricted than the local strong convexity in Assumption 1. One example of functions that are locally strongly convex but not (globally) strongly convex is the objective of logistic regression [1], i.e., with given and , which often arises in machine learning. The DBFGS method [26] allows each to be a general convex function, yet it requires, like other secondorder algorithms in [24, 27, 28, 29], each to be globally Lipschitz continuous, which is unnecessary for problem (II) under Assumption 1. Moreover, the local Lipschitz continuity of in Assumption 1 is weaker than the three times continuous differentiability of in [29] and the global Lipschitz continuity of in [24, 25, 27].
Iii Decentralized Approximate Newton Methods
In this section, we develop a class of decentralized Newtontype algorithms to address problem (II).
To do so, we first reformulate problem (II) as follows:
s.t. 
where and . If we directly apply the classic Newton method to solve the above problem, i.e.,
(2) 
where is the Newton stepsize, then generally the consensus constraint cannot be satisfied. To overcome this, we replace the gradient term in (2) by , where is given by
(3) 
with . In this way, if the th dimensional block of is assigned to node , we obtain
(4) 
This replacement can potentially drive all the nodes to a consensus. To see this, consider the special case where each , . In this case, (II) becomes an average consensus problem and (4) reduces to
This is indeed a classic linear consensus scheme [30], with which all the ’s asymptotically reach a consensus at if the ’s are appropriately selected.
To better capture the behavior of the Newton method, we further extend the linear consensus term in (4) to a nonlinear one, i.e.,
(5) 
where associated with each link are a constant weight and a convex function . We assume to satisfy a similar but less restrictive assumption than :
Assumption 2
For each , is continuously differentiable and locally strongly convex. In addition, is locally Lipschitz continuous.
There are numerous choices of satisfying Assumption 2. For example, we may let
(6) 
or
(7) 
where can be any symmetric positive definite matrix known to both node and node . Observe that when each , (III) reduces exactly to (4).
Although (III) may enable the nodes to eventually attain a consensus, it is still unclear how far the ’s are from the optimum . Notice that for each ,
(8) 
in which the second equality follows from (III) that . This, together with (III) and the local Lipschitz continuity of each , implies that if remain bounded for all , then
for some constant . This indicates that once all the ’s become identical, the gradient sum would remain constant. Further, if the initial gradient sum is zero, by properly selecting the weights , we may be able to keep very small, so that the ’s, once agreeing with each other, would be sufficiently close to the optimum .
Based on the above observations, to make each approach the optimum , we set each to the unique minimizer of , i.e.,
(9) 
Hence, . Note from this and (8) that if each is a positive definite quadratic function, i.e.,
(10) 
then . Thus, if all the ’s reach a consensus, the consensus is the optimum .
The initialization (9) and the update (III) together yield a class of Decentralized Approximate Newton methods, referred to as DEAN algorithms. As is shown in Algorithm 1, the implementation of the DEAN algorithms is fully decentralized. The initialization (9) can be completed by each node on their own. The update (III) requires each node to evaluate the inverse of the Hessian of its local objective at its current estimate and to exchange with its neighbors.
Prior to implementing DEAN, each pair of neighboring nodes need to agree on the selection of the function satisfying Assumption 2. For the option of given by (6), the update (III) can be executed if each node shares its local objective with all its neighbors. However, this could be prohibitively costly in some cases. Instead, the nodes may adopt the following scheme to avoid exchanging the ’s: For every , each node first sends and to all its neighbors. Upon receiving and , each node computes for every and sends it to neighbor . Through such local interactions, each node is able to update via (III) and (6) without exchanging its local objective with its neighbors. For another example of in (7), each pair of neighbors only need to jointly determine , which can be done at negligible communication cost (e.g., we may simply set ).
Remark 1
If all the weights are identical, the DEAN algorithms (III) and (9) can be viewed as a finitedifference discretization of the continuoustime ZGS algorithms in the earlier work [31], for which the ZGS manifold is guaranteed to be positive invariant. Nevertheless, here we allow each to be distinct and determined only by the neighboring nodes . Moreover, unlike the ZGS algorithms that require global strong convexity of the ’s to establish convergence, the DEAN algorithms relax this condition to local strong convexity. Furthermore, the discretetime nature of DEAN requires significantly different tools for convergence analysis.
Finally, we compare DEAN with the existing decentralized Newtontype methods, including DBFGS[26], NNK, K [27], DQNK, K [28], and NRC [29], in respect of their communication costs at each iteration. Note from Algorithm 1 that DEAN essentially requires every node to transmit vectors of dimension per iteration. Every node at each iteration in DBFGS needs to transmit vectors of the same dimension. During one iteration of NNK and DQNK , transmissions of dimensional vectors are executed by each node. In addition, NRC needs every node to transmit vectors in and matrices in at each iteration. Therefore, the communication cost of DEAN is the lowest among these Newtontype methods.
Iv Convergence Analysis
In this section, we analyze the convergence performance of the DEAN algorithms.
To this end, we utilize the Lyapunov function candidate given by
(11) 
Due to Assumption 1, and the equality holds if and only if . Hence, can be viewed as a measure of the suboptimality of . Further, we introduce the following notations based on , which will be used to present the convergence results.
First of all, for each , let
where is the initial state in the DEAN algorithms given by (9). Clearly, are compact. Thus, there exist such that
(12)  
(13) 
In addition, for each , define the compact set
(14) 
It follows from Assumption 2 that there exist such that
(15) 
Arbitrarily pick an and suppose the weights are selected from the interval . Then, for each , let
(16) 
Due again to Assumption 1, there exist and such that ,
(17)  
(18) 
Moreover, we let be such that
(19) 
Note that . For convenience, denote and . If is (globally) strongly convex, then we can take as well as the above and all equal to the convexity parameter of over .
Our first result shows that is nonincreasing in and provides its drop at each iteration:
Lemma 1 (Monotonicity of Lyapunov function)
Proof: See Appendix A.
Remark 2
Lemma 1 implies that is a Lyapunov function which keeps strictly decreasing until become identical. Also, since is bounded from below, exists. This leads to the theorem below, which says that all the nodes are able to reach a consensus:
Theorem 1 (Asymptotic convergence to consensus)
Proof: See Appendix B.
In the following theorem, we further show that the nodes not only attain a consensus as in Theorem 1, but also asymptotically achieve accuracy in optimality for any given (i.e., ), provided that the weights are properly related to :
Theorem 2 (Asymptotic convergence to suboptimality)
Proof: See Appendix C.
Below, we explore the convergence rates of the DEAN algorithms. For simplicity, here we only consider DEAN with each , which indeed can be extended to more general cases. To present the convergence rate results, we consider the Laplacian matrix of the graph :
(24) 
Observe that is symmetric positive semidefinite. Also, since is connected, has only one eigenvalue at zero. Its second smallest eigenvalue (i.e., the algebraic connectivity of ) and its largest eigenvalue . The following theorem shows that the nodes achieve a consensus at a linear rate, which depends on and , and provides a bound on the distance between the consensus and the optimum :
Theorem 3 (Rate of convergence)
Proof: See Appendix D.
Theorem 3 says that the consensus error among the nodes vanishes at a linear rate. In addition, the consensus can be sufficiently close to if the weights are sufficiently small. Following Theorems 2 and 3, below we present the iteration complexity of DEAN, which states that accuracy can be reached within iterations:
Theorem 4 (Iteration complexity)
Proof: See Appendix E.