Distributed NonConvex FirstOrder Optimization and Information Processing: Lower Complexity Bounds and Rate Optimal Algorithms
Abstract
We consider a class of distributed nonconvex optimization problems often arises in modern distributed signal and information processing, in which a number of agents connected by a network collectively optimize a sum of smooth (possibly nonconvex) local objective functions. We address the following fundamental question: For a class of unconstrained nonconvex problems with Lipschitz continuous gradient, by only utilizing local gradient information, what is the fastest rate that distributed algorithms can achieve, and how to achieve those rates.
We develop a lower bound analysis that identifies difficult problem instances for any firstorder method. We show that in the worstcase it takes any firstorder algorithm iterations to achieve certain solution, where is the network diameter, and is the Lipschitz constant of the gradient. Further for a general problem class and a number of network classes, we propose optimal primaldual gradient methods whose rates precisely match the lower bounds (up to a ploylog factor). To the best of our knowledge, this is the first time that lower rate bounds and optimal methods have been developed for distributed nonconvex problems. Our results provide guidelines for future design of distributed optimization algorithms, convex and nonconvex alike.
Keywords. Nonconvex distributed optimization; Optimal methods; Lower Complexity Bounds; Primaldual methods.
1 Introduction
1.1 Problem and motivation
In this work, we consider the following distributed optimization problem over a network
(1) 
where is a smooth and possibly nonconvex function accessible to agent . There is no central controller, and the agents are connected by a network defined by an undirected and unweighted graph , with vertices and edges. Each agent can only communicate with its immediate neighbors, and it can access one component function .
A common way to reformulate problem (1) in the distributed setting is given below. Define the degree of node as ; Define the incidence matrix (IM) as follows: if and it connects vertex and with , then if , if and otherwise. Introduce local variables , and suppose the graph is connected, then the following formulation is equivalent to the global consensus problem
(2) 
The main benefit of the above formulation is that the objective function is now separable, and the linear constraint encodes the network connectivity pattern.
1.2 Distributed nonconvex optimization
Nonconvex distributed optimization has gained considerable attention recently, and has found applications in training neural networks [1], distributed information processing and machine learning [2, 3, 4], and distributed signal processing [5].
The problem (1) and (2) have been studied extensively in the literature when ’s are all convex; see for example [6, 7, 8, 9, 10, 11]. Primal methods such as distributed subgradient method [6], the EXTRA method [9], as well as primaldual based methods such as distributed augmented Lagrangian method [11], Alternating Direction Method of Multipliers (ADMM) [12, 13, 14, 15] have been proposed.
On the contrary, only recently there have been some works addressing the more challenging problems without assuming convexity of ; see recent developments in [16, 17, 18, 4, 5, 19, 20, 21]. Reference [4] develops a nonconvex ADMM based methods (with the first known global sublinear convergence rate) for solving the distributed consensus problem (1). However the network considered therein is a star network in which the local nodes are all connected to a central controller. Reference [20] proposes a primaldual based method for unconstrained problem over a connected network, and derives the first global convergence rate for this setting. In [5] and follow up works [22, 23], the authors utilize certain gradient tracking idea to solve a constrained nonsmooth distributed problem over possibly timevarying networks. It is worth noting that the distributed algorithms proposed in all these works converge to firstorder stationary solutions, which contain local maximum, local minimum and saddle points. Only recently, the authors of [24] developed firstorder distributed algorithms that are capable of computing secondorder stationary solutions (which under suitable conditions become local optimal solutions).
1.3 Lower and upper rate bounds analysis
Despite all the recent interests and contributions in this field, one major question remains open:
rgb]0.9,0.9,0.9
(Q) What is the best convergence rate achievable by any distributed algorithms for the nonconvex problem (1)?
Question seeks to find a “best convergence rate”, which is a characterization of the smallest number of iterations required to achieve certain high quality solutions, among all distributed algorithms. Clearly, understanding provides fundamental insights to distributed optimization and information processing. For example, the answer to can provide meaningful optimal estimates on the total amount of communication effort required to achieve a given level of accuracy. Further, the identified optimal strategies capable of attaining the best convergence rates will also help guide the practical design of distributed information processing algorithms.
Question is easy to state, but formulating it rigorously is quite involved and a number of delicate issues have to be clarified. Below we provide a high level discussion on some of the issues.
(1) Fix Problem and Network Classes. A class of problems and networks of interest should be fixed. Roughly speaking, in this work is the family of smooth unconstrained problem (1), and is defined over the set of connected and unweighted graphs with finite number of nodes.
(2) Characterize HighQuality Solutions. For a properly defined error constant , one needs to define a highquality solution in distributed and nonconvex setting. Differently from the centralized case, the following questions have to be addressed: Should the solution quality be evaluated based on the averaged iterates among all the agents, or on the individual iterates? Shall we include some consensus measure in the solution characterization? Different solution notion could potentially lead to different lower and upper rate bounds.
(3) Fix Algorithm Classes. A class of algorithms has to be fixed. In the classical complexity analysis in (centralized) optimization, it is common to define the class of algorithms by the information structures that they utilize [25]. In the distributed and nonconvex setting, it is necessary to specify both function information that can be used by individual nodes, as well as the communication protocols allowed.
(4) Develop Sharp Upper Bounds. It is necessary to develop algorithms within class , which possess provable and sharp global convergence rate for problem/network class . These algorithms provide achievable upper bounds on the global convergence rates.
(5) Identify Lower Bounds. It is important to characterize the worst rates achievable by any algorithm in class for problem/network class . This task involves identifying instances in that are difficult for algorithm class .
(6) Match Lower and Upper Bounds. The key task is to investigate whether the developed algorithms are rate optimal, in the sense that rate upper bounds match the worstcase lower bounds. Roughly speaking, matching two bounds requires that for the class of problem and networks , the following quantities should be matched between the lower and upper bounds: i) the order of the error constants ; ii) the order of problem parameters such as , or that of network parameters such as the spectral gap, diameter, etc.
Convergence rate analysis (aka iteration complexity analysis) for convex problems dates back to Nesterov, Nemirovsky and Yudin [26, 27], in which lower bounds and optimal firstorder algorithms have been developed; also see [28]. In recent years, many accelerated firstorder algorithms achieving those lower bounds for different kinds of convex problems have been derived; see e.g., [29, 30, 31], including those developed for distributed convex optimization [32]. In those works, the optimality measure used is , and the lower bound can be expressed as [28, Theorem 2.2.2]
(3) 
where is the Lipschitz constant for ; (resp. ) is the global optimal solution (resp. the initial solution); is the iteration index. Therefore to achieve optimal solution in which , one needs iterations. Recently the above approach has been extended to distributed strongly convex optimization in [33]. In particular, the authors consider problem (1) in which each is strongly convex, and they provide lower and upper rate bounds for a class of algorithms in which the local agents can utilize both and its Fenchel conjugate . Unfortunately this result is not directly related to the class of “firstorder” method of interest, since beyond the firstorder gradient information, the Fenchel conjugate is also needed, but computing this quantity requires performing certain exact minimization, which itself involves solving a strongly convex optimization problem (with unknown complexity). Recently related works in this direction also include [34]. Therefore the optimal firstorder distributed algorithm for strongly convex problems are still left open, not to mention for general convex and nonconvex distributed problems.
Network Instances  Problem Classes  
Uniform Lipschitz  Nonuniform Lipschitz  Rate Achieving Algorithm  
Complete/Star Graph  DGPDA (proposed)  
Path+Star Graph  DGPDA+ (proposed)  
Path/Circle Graph  DGPDA+ (proposed)  
Centralized  Gradient Descent 
When the problem becomes nonconvex, the size of the gradient function can be used as a measure of solution quality. In particular, let , then it has been shown that the classical gradient descent (GD) method achieves the following rate [28, page 28]
It has been shown in [35] that the above rate is (almost) tight for GD. Recently, [36] has further shown that the above rate is optimal for any firstorder methods that only utilizes the gradient information, when applied to problems with Lipschitz gradient. However, no lower bound analysis has been developed for distributed nonconvex problem (14); there are even not many algorithms that provide achievable upper rate bounds (except for the recent works [4, 20, 37, 38]), not to mention any analysis on the tightness/sharpness of these upper bounds.
1.4 Contribution of this work
In this work, we address various issues arise in answering . Our main contributions are given below:
1) We identify a class of nonconvex problems and networks , a class of distributed firstorder algorithms , and rigorously define the optimality gap that measures the progress of the algorithms;
2) We develop the first lower rate bound for class to solve class : to achieve optimality, it is necessary for any to perform rounds of communication among all the nodes;
3) We design a class of algorithms belonging to [named distributed gradient primaldual algorithm (DGPDA), and its variants], which achieves optimality condition with provable global rates [in the order of ];
4) We show that the DGPDA class are optimal methods in for problem class as well as a number of its refinements, in that they precisely achieve the lower complexity bounds under all these situations (up to a ploylog factor).
The main results of the paper are outlined in Table 1.
Notations. For a given symmetric matrix , we use , and to denote the maximum, the minimum and the minimum nonzero eigenvalues for a symmetric matrix ; We use to denote an identity matrix with size . We use to denote the set . For a vector we use to denote its th element. We use to denote where is the problem dimension. We use to denote two connected nodes and .
2 Preliminaries
2.1 The class , ,
We present the classes of problems, networks and algorithms to be studied, as well as some useful results. We parameterize these classes using a few key parameters so that we can specify their subclasses when needed.
Problem Class. A problem is in class if it satisfies the following conditions.

The objective is a sum of functions; see (1).

Each component function ’s has Lipschitz gradient:
(4) Define the matrix of Lipschitz constants as:
(5) 
The function is lower bounded over , i.e.,
(6)
These assumptions are rather mild. For example an satisfies [A2A3] is not required to be secondorder differentiable. Below we provide a few nonconvex functions that satisfy Assumption [A2A3], and each of those can be the component function ’s. Note that the first four functions are of particular interest in learning neural networks, as they are commonly used as activation functions.
(1) The sigmoid function is given by We have , , therefore [A2A3] are true with .
(2) The function satisfies , . So [A2A3] hold with .
(3) The function satisfies so [A2A3] hold with .
(4) The logit function is related to the function as follows
then Assumptions [A2A3] are again satisfied.
(5) The function has applications in structured matrix factorization [39]. Clearly it is lower bounded. Its second order derivative is and it is also bounded.
(6) Other functions like , , are easy to verify. Consider where . This function is interesting because it is not secondorder differentiable; nonetheless we can verify that [A2A3] are satisfied with .
Network Class. Consider the following descriptions.

The network is represented by an undirected and unweighted graph , with vertices and edges, and edge weights all being .

The graph is connected, with graph Laplacian matrix , and the normalized Lapacian (NL) matrix [40]
(7) 
The graph has diameter , which is the largest distance path among any pair of nodes:
(8)
We use to denote a class of network described in with nodes and being its diameter. Based on different definitions of Laplacian matrices, we define the spectral gap of the graph as
(9) 
Algorithm Class. Define the neighbor set for node as
(10) 
We say that a distributed, firstorder algorithm is in class if it satisfies the following conditions.

At iteration , each node can obtain some network related constants, such as , , , , etc.

At iteration , the output of node is a linear combination of the historical output of its neighboring set, as well as historical local gradients, i.e.,
(11) The linear combination coefficients can be dependent on those constants obtained at iteration .
Clearly is a class of firstorder methods because only historical gradient information is used in the computation. It is also a class of distributed algorithms because at each iteration the nodes are only allowed to communicate with its neighbors. Note that we do not specify the periteration communication patterns, therefore each can receive either one or multiple vectors of size , as long as they are coming from the set specified in (11). This adds flexibility to the practical algorithm design, for example one can allow asynchronous communication in which at a given iteration, delayed messages from the past are also received (although we do not investigate these possibilities in the current work).
2.2 Solution Quality Measure
Next we provide definition for the quality of the solution. Since we consider using firstorder methods to solve nonconvex problems, it is expected that in the end some firstorder stationary solution will be computed, in which the size of the gradient of is small.
Our first definition is related to a global variable . We say that is a globalsolution if the following holds
(12) 
This definition is conceptually simple and it is identical to the centralized criteria in Section 1.3. However it has the following issues. First, no global variable will be formed in the entire network, so criteria (12) is difficult to evaluate. Second, there is no characterization of how close the local variables ’s are. To see the second point, consider the following toy example arise in the nonconvex setting.
Example 1: Consider a network with and and . Suppose local variable and . Then if we pick , we have
This suggests that at iteration , there exists one linear combination that makes measure (12) precisely zero. However one can hardly say that the current solution is a good solution for problem (2).
To address the above issues, we provide a second definition which is directly related to local variables . We say that is a local solution if the following holds
(13) 
Clearly this definition takes into consideration the consensus error as well as the size of the local gradients. When applied to Example 1, this measure will be large. We will use shorthanded notation to denote the left hand side of (13).
In our work we will focus on providing answers to the following question: For any given , how many iterations (as a function of ) does it need for any algorithm in class to achieve the local solution.
2.3 Some Useful Facts
Below we provide a few facts about the above classes.
On Lipschitz constants. Assume that each has Lipschitz continuous gradient with constant in (4). Then we have
(14) 
Then we have the following [the matrix is defined in (5)]
which implies
(15) 
On Quantities for Graph . Define the following matrices:
(16) 
Define where the absolute value is taken componentwise. Then we have
(17)  
where is the degree matrix defined in (7). The NL matrix defined in (7) can be expressed as
For two diagonal matrices and of appropriate sizes, the normalized generalized Laplacian (NGL) matrix is defined as
(18) 
and in particular
Below we list some useful results about NL [41] [40]. First, all eigenvalues of lie in the interval . The eigenvalues of NL for a number of special graphs are given below:
1) Complete Graph: The eigenvalues are and (with multiplicity ), so ;
2) Star Graph: The eigenvalues are and (with multiplicity ), and , so ;
3) Path Graph: The eigenvalues are for , and .
4) Cycle Graph: The eigenvalues are for , and .
5) Path+Star Graph: Consider the following graph. First construct a path graph with nodes, then divide the remaining nodes into groups, each with either or nodes. Starting from the two ends, connecting each group with one of the nodes using a star topology. It is easy to verify that the resulting graph has the following property:
(19) 
Then for this new graph we have . This fact is due to [40, Lemma 1.9], that
(20) 
Since for NL , we get the desired bound.
Below we show an extension of the result [40, Section 1.4], on the lower bound for the second smallest eigenvalue of .
Lemma 2.1
Suppose that is a connected graph. We have
(21) 
3 Lower Complexity Bounds
In this section we develop the lower complexity bounds for algorithms in class to solve problems over network . Our proof is for the case where ’s have uniform Lipschitz constants, i.e., , , but our analysis approach can be extended to nonuniform cases as well. Our proof combines ideas from the classical proof in Nesterov [25], as well as two recent constructions [36] (for centralized nonconvex problems) and [33] (for strongly convex distributed problems). Our construction differs from the previous works in a number of ways, in particular, the constructed functions are only firstorder differentiable, but not secondorder differentiable. Further, we use the local solution (13) to measure the quality of the solution, which makes the analysis more involved compared with the existing global error measures in [25, 36, 33].
To begin with, we construct the following two nonconvex functions
(22) 
as well as their average version
(23) 
Here we have , for all , , and . Later we make our construction so that functions and are easy to analyze, while and will be in the desired function class in . Without loss of generality, in the construction we will assume will be Lipschitz with constant , for all .
3.1 Path Graph ()
First we consider the extreme case in which the nodes form a path graph with nodes and each node has its own local function , shown in Figure 1.
For notational simplicity assume that is a multiple of , that is for some integer . Also assume that is an odd number without loss of generality.
Let us define the component functions ’s in (22) as follows.
(24) 
where we have defined the following functions
(25a)  
(25b) 
The component functions are given as below
Suppose , then the average function becomes:
Further for a given error constant and a given averaged Lipschitz constant , let us define
(26) 
Therefore we also have, if , then
(27) 
First we present some properties of the component functions ’s.
Lemma 3.1
The functions and satisfy the following.

For all , , .

We have the following bounds for the functions and their first and secondorder derivatives:

We have the following key property:
(28) 
The function is lower bounded as follows:

The first order derivative of (resp. ) is Lipschitz continuous with some constant (resp. , ).
Proof. Property 1) is obviously true.
To prove Property 2), note that following holds for (note, is not second order differentiable over ).
(29) 
Obviously, is an increasing function (), therefore the lower and upper bounds are ; is increasing on and decreasing on , where , therefore the lower and upper bounds are ; is decreasing on and increasing on , therefore the lower and upper bounds are , i.e.,
Further, for all , the following holds
(30) 
Similarly, we have
To show Property 3), note that for all and ,
where the first inequality is true because is strictly increasing and is strictly decreasing for all , and that .
Next we show Property 4). Note that and . Therefore we have and using the construction in (24)
(31)  
(32) 
where the first inequality follows and second follows , we reach the conclusion.
Finally we show Property 5), using the fact that a function is Lipschitz if it is piecewise smooth with bounded derivative. From construction (24), the first order partial derivative of can be expressed below.
Case I) If is even, we have
(33) 
Case II) If is odd but not 1, we have
(34) 
Case III) If , we have
(35) 
Obviously, is a piecewise smooth function for any , either equals zero or separated at the nondifferentiable point because of the function .
Further, fix a point and a unit vector where . Define
to be the directional projection of on to the direction at point . We will show that there exists such that for all .
where we take and .
The second order partial derivative of () is given as follows when is even:
Apply Lemma 3.1 i) and ii), we have:
Note the bound here is exactly the same for is odd but not , while for we have following
Therefore we have