Distributed Non-Convex First-Order Optimization and Information Processing: Lower Complexity Bounds and Rate Optimal Algorithms

Distributed Non-Convex First-Order Optimization and Information Processing: Lower Complexity Bounds and Rate Optimal Algorithms

Haoran Sun and Mingyi Hong H. Sun and M. Hong are with the Department of Electrical and Computer Engineering (ECE), University of Minnesota, Minneapolis, MN 55414, USA. Email: {sun00111,mhong}@umn.edu
Abstract

We consider a class of distributed non-convex optimization problems often arises in modern distributed signal and information processing, in which a number of agents connected by a network collectively optimize a sum of smooth (possibly non-convex) local objective functions. We address the following fundamental question: For a class of unconstrained non-convex problems with Lipschitz continuous gradient, by only utilizing local gradient information, what is the fastest rate that distributed algorithms can achieve, and how to achieve those rates.

We develop a lower bound analysis that identifies difficult problem instances for any first-order method. We show that in the worst-case it takes any first-order algorithm iterations to achieve certain -solution, where is the network diameter, and is the Lipschitz constant of the gradient. Further for a general problem class and a number of network classes, we propose optimal primal-dual gradient methods whose rates precisely match the lower bounds (up to a ploylog factor). To the best of our knowledge, this is the first time that lower rate bounds and optimal methods have been developed for distributed non-convex problems. Our results provide guidelines for future design of distributed optimization algorithms, convex and non-convex alike.

Keywords. Non-convex distributed optimization; Optimal methods; Lower Complexity Bounds; Primal-dual methods.

1 Introduction

1.1 Problem and motivation

In this work, we consider the following distributed optimization problem over a network

(1)

where is a smooth and possibly non-convex function accessible to agent . There is no central controller, and the agents are connected by a network defined by an undirected and unweighted graph , with vertices and edges. Each agent can only communicate with its immediate neighbors, and it can access one component function .

A common way to reformulate problem (1) in the distributed setting is given below. Define the degree of node as ; Define the incidence matrix (IM) as follows: if and it connects vertex and with , then if , if and otherwise. Introduce local variables , and suppose the graph is connected, then the following formulation is equivalent to the global consensus problem

(2)

The main benefit of the above formulation is that the objective function is now separable, and the linear constraint encodes the network connectivity pattern.

1.2 Distributed non-convex optimization

Non-convex distributed optimization has gained considerable attention recently, and has found applications in training neural networks [1], distributed information processing and machine learning [2, 3, 4], and distributed signal processing [5].

The problem (1) and (2) have been studied extensively in the literature when ’s are all convex; see for example [6, 7, 8, 9, 10, 11]. Primal methods such as distributed subgradient method [6], the EXTRA method [9], as well as primal-dual based methods such as distributed augmented Lagrangian method [11], Alternating Direction Method of Multipliers (ADMM) [12, 13, 14, 15] have been proposed.

On the contrary, only recently there have been some works addressing the more challenging problems without assuming convexity of ; see recent developments in [16, 17, 18, 4, 5, 19, 20, 21]. Reference [4] develops a non-convex ADMM based methods (with the first known global sublinear convergence rate) for solving the distributed consensus problem (1). However the network considered therein is a star network in which the local nodes are all connected to a central controller. Reference [20] proposes a primal-dual based method for unconstrained problem over a connected network, and derives the first global convergence rate for this setting. In [5] and follow up works [22, 23], the authors utilize certain gradient tracking idea to solve a constrained nonsmooth distributed problem over possibly time-varying networks. It is worth noting that the distributed algorithms proposed in all these works converge to first-order stationary solutions, which contain local maximum, local minimum and saddle points. Only recently, the authors of [24] developed first-order distributed algorithms that are capable of computing second-order stationary solutions (which under suitable conditions become local optimal solutions).

1.3 Lower and upper rate bounds analysis

Despite all the recent interests and contributions in this field, one major question remains open:

rgb]0.9,0.9,0.9

(Q)   What is the best convergence rate achievable by any distributed algorithms for the non-convex problem (1)?

Question seeks to find a “best convergence rate”, which is a characterization of the smallest number of iterations required to achieve certain high quality solutions, among all distributed algorithms. Clearly, understanding provides fundamental insights to distributed optimization and information processing. For example, the answer to can provide meaningful optimal estimates on the total amount of communication effort required to achieve a given level of accuracy. Further, the identified optimal strategies capable of attaining the best convergence rates will also help guide the practical design of distributed information processing algorithms.

Question is easy to state, but formulating it rigorously is quite involved and a number of delicate issues have to be clarified. Below we provide a high level discussion on some of the issues.

(1) Fix Problem and Network Classes. A class of problems and networks of interest should be fixed. Roughly speaking, in this work is the family of smooth unconstrained problem (1), and is defined over the set of connected and unweighted graphs with finite number of nodes.

(2) Characterize High-Quality Solutions. For a properly defined error constant , one needs to define a high-quality solution in distributed and non-convex setting. Differently from the centralized case, the following questions have to be addressed: Should the solution quality be evaluated based on the averaged iterates among all the agents, or on the individual iterates? Shall we include some consensus measure in the solution characterization? Different solution notion could potentially lead to different lower and upper rate bounds.

(3) Fix Algorithm Classes. A class of algorithms has to be fixed. In the classical complexity analysis in (centralized) optimization, it is common to define the class of algorithms by the information structures that they utilize [25]. In the distributed and non-convex setting, it is necessary to specify both function information that can be used by individual nodes, as well as the communication protocols allowed.

(4) Develop Sharp Upper Bounds. It is necessary to develop algorithms within class , which possess provable and sharp global convergence rate for problem/network class . These algorithms provide achievable upper bounds on the global convergence rates.

(5) Identify Lower Bounds. It is important to characterize the worst rates achievable by any algorithm in class for problem/network class . This task involves identifying instances in that are difficult for algorithm class .

(6) Match Lower and Upper Bounds. The key task is to investigate whether the developed algorithms are rate optimal, in the sense that rate upper bounds match the worst-case lower bounds. Roughly speaking, matching two bounds requires that for the class of problem and networks , the following quantities should be matched between the lower and upper bounds: i) the order of the error constants ; ii) the order of problem parameters such as , or that of network parameters such as the spectral gap, diameter, etc.

Convergence rate analysis (aka iteration complexity analysis) for convex problems dates back to Nesterov, Nemirovsky and Yudin [26, 27], in which lower bounds and optimal first-order algorithms have been developed; also see [28]. In recent years, many accelerated first-order algorithms achieving those lower bounds for different kinds of convex problems have been derived; see e.g., [29, 30, 31], including those developed for distributed convex optimization [32]. In those works, the optimality measure used is , and the lower bound can be expressed as [28, Theorem 2.2.2]

(3)

where is the Lipschitz constant for ; (resp. ) is the global optimal solution (resp. the initial solution); is the iteration index. Therefore to achieve -optimal solution in which , one needs iterations. Recently the above approach has been extended to distributed strongly convex optimization in [33]. In particular, the authors consider problem (1) in which each is strongly convex, and they provide lower and upper rate bounds for a class of algorithms in which the local agents can utilize both and its Fenchel conjugate . Unfortunately this result is not directly related to the class of “first-order” method of interest, since beyond the first-order gradient information, the Fenchel conjugate is also needed, but computing this quantity requires performing certain exact minimization, which itself involves solving a strongly convex optimization problem (with unknown complexity). Recently related works in this direction also include [34]. Therefore the optimal first-order distributed algorithm for strongly convex problems are still left open, not to mention for general convex and non-convex distributed problems.

Network Instances Problem Classes
Uniform Lipschitz Nonuniform Lipschitz Rate Achieving Algorithm
Complete/Star Graph D-GPDA (proposed)
Path+Star Graph D-GPDA+ (proposed)
Path/Circle Graph D-GPDA+ (proposed)
Centralized Gradient Descent
Table 1: The main results of the paper. The entries show the best rate bounds achieved by the proposed algorithms (either GPDA or GPDA+) for a number of specific graphs and problem class; is the Lipschitz constant for [see (4)]; for the uniform case ; for the nonuniform case we assume that for some constant and ; is the diameter of graph . For the uniform Lipschitz the lower rate bounds derived for the particular graph matches the upper rate bounds (we only show the latter in the table). The last row shows the rate achieved by the centralized gradient descent algorithm. The notation denote big with some polynomial in logarithms.

When the problem becomes non-convex, the size of the gradient function can be used as a measure of solution quality. In particular, let , then it has been shown that the classical gradient descent (GD) method achieves the following rate [28, page 28]

It has been shown in [35] that the above rate is (almost) tight for GD. Recently, [36] has further shown that the above rate is optimal for any first-order methods that only utilizes the gradient information, when applied to problems with Lipschitz gradient. However, no lower bound analysis has been developed for distributed non-convex problem (14); there are even not many algorithms that provide achievable upper rate bounds (except for the recent works [4, 20, 37, 38]), not to mention any analysis on the tightness/sharpness of these upper bounds.

1.4 Contribution of this work

In this work, we address various issues arise in answering . Our main contributions are given below:

1) We identify a class of non-convex problems and networks , a class of distributed first-order algorithms , and rigorously define the -optimality gap that measures the progress of the algorithms;

2) We develop the first lower rate bound for class to solve class : to achieve -optimality, it is necessary for any to perform rounds of communication among all the nodes;

3) We design a class of algorithms belonging to [named distributed gradient primal-dual algorithm (D-GPDA), and its variants], which achieves -optimality condition with provable global rates [in the order of ];

4) We show that the D-GPDA class are optimal methods in for problem class as well as a number of its refinements, in that they precisely achieve the lower complexity bounds under all these situations (up to a ploylog factor).

The main results of the paper are outlined in Table 1.

Notations. For a given symmetric matrix , we use , and to denote the maximum, the minimum and the minimum nonzero eigenvalues for a symmetric matrix ; We use to denote an identity matrix with size . We use to denote the set . For a vector we use to denote its th element. We use to denote where is the problem dimension. We use to denote two connected nodes and .

2 Preliminaries

2.1 The class , ,

We present the classes of problems, networks and algorithms to be studied, as well as some useful results. We parameterize these classes using a few key parameters so that we can specify their subclasses when needed.

Problem Class. A problem is in class if it satisfies the following conditions.

  • The objective is a sum of functions; see (1).

  • Each component function ’s has Lipschitz gradient:

    (4)

    Define the matrix of Lipschitz constants as:

    (5)
  • The function is lower bounded over , i.e.,

    (6)

These assumptions are rather mild. For example an satisfies [A2-A3] is not required to be second-order differentiable. Below we provide a few non-convex functions that satisfy Assumption [A2-A3], and each of those can be the component function ’s. Note that the first four functions are of particular interest in learning neural networks, as they are commonly used as activation functions.

(1) The sigmoid function is given by We have , , therefore [A2-A3] are true with .

(2) The function satisfies , . So [A2-A3] hold with .

(3) The function satisfies so [A2-A3] hold with .

(4) The logit function is related to the function as follows

then Assumptions [A2-A3] are again satisfied.

(5) The function has applications in structured matrix factorization [39]. Clearly it is lower bounded. Its second order derivative is and it is also bounded.

(6) Other functions like , , are easy to verify. Consider where . This function is interesting because it is not second-order differentiable; nonetheless we can verify that [A2-A3] are satisfied with .

Network Class. Consider the following descriptions.

  • The network is represented by an undirected and unweighted graph , with vertices and edges, and edge weights all being .

  • The graph is connected, with graph Laplacian matrix , and the normalized Lapacian (NL) matrix [40]

    (7)
  • The graph has diameter , which is the largest distance path among any pair of nodes:

    (8)

We use to denote a class of network described in with nodes and being its diameter. Based on different definitions of Laplacian matrices, we define the spectral gap of the graph as

(9)

Algorithm Class. Define the neighbor set for node as

(10)

We say that a distributed, first-order algorithm is in class if it satisfies the following conditions.

  • At iteration , each node can obtain some network related constants, such as , , , , etc.

  • At iteration , the output of node is a linear combination of the historical output of its neighboring set, as well as historical local gradients, i.e.,

    (11)

    The linear combination coefficients can be dependent on those constants obtained at iteration .

Clearly is a class of first-order methods because only historical gradient information is used in the computation. It is also a class of distributed algorithms because at each iteration the nodes are only allowed to communicate with its neighbors. Note that we do not specify the per-iteration communication patterns, therefore each can receive either one or multiple vectors of size , as long as they are coming from the set specified in (11). This adds flexibility to the practical algorithm design, for example one can allow asynchronous communication in which at a given iteration, delayed messages from the past are also received (although we do not investigate these possibilities in the current work).

2.2 Solution Quality Measure

Next we provide definition for the quality of the solution. Since we consider using first-order methods to solve non-convex problems, it is expected that in the end some first-order stationary solution will be computed, in which the size of the gradient of is small.

Our first definition is related to a global variable . We say that is a global--solution if the following holds

(12)

This definition is conceptually simple and it is identical to the centralized criteria in Section 1.3. However it has the following issues. First, no global variable will be formed in the entire network, so criteria (12) is difficult to evaluate. Second, there is no characterization of how close the local variables ’s are. To see the second point, consider the following toy example arise in the non-convex setting.

Example 1: Consider a network with and and . Suppose local variable and . Then if we pick , we have

This suggests that at iteration , there exists one linear combination that makes measure (12) precisely zero. However one can hardly say that the current solution is a good solution for problem (2).

To address the above issues, we provide a second definition which is directly related to local variables . We say that is a local -solution if the following holds

(13)

Clearly this definition takes into consideration the consensus error as well as the size of the local gradients. When applied to Example 1, this measure will be large. We will use short-handed notation to denote the left hand side of (13).

In our work we will focus on providing answers to the following question: For any given , how many iterations (as a function of ) does it need for any algorithm in class to achieve the local -solution.

2.3 Some Useful Facts

Below we provide a few facts about the above classes.

On Lipschitz constants. Assume that each has Lipschitz continuous gradient with constant in (4). Then we have

(14)

Then we have the following [the matrix is defined in (5)]

which implies

(15)

On Quantities for Graph . Define the following matrices:

(16)

Define where the absolute value is taken component-wise. Then we have

(17)

where is the degree matrix defined in (7). The NL matrix defined in (7) can be expressed as

For two diagonal matrices and of appropriate sizes, the normalized generalized Laplacian (NGL) matrix is defined as

(18)

and in particular

Below we list some useful results about NL [41] [40]. First, all eigenvalues of lie in the interval . The eigenvalues of NL for a number of special graphs are given below:

1) Complete Graph: The eigenvalues are and (with multiplicity ), so ;

2) Star Graph: The eigenvalues are and (with multiplicity ), and , so ;

3) Path Graph: The eigenvalues are for , and .

4) Cycle Graph: The eigenvalues are for , and .

5) -Path+Star Graph: Consider the following graph. First construct a path graph with nodes, then divide the remaining nodes into groups, each with either or nodes. Starting from the two ends, connecting each group with one of the nodes using a star topology. It is easy to verify that the resulting graph has the following property:

(19)

Then for this new graph we have . This fact is due to [40, Lemma 1.9], that

(20)

Since for NL , we get the desired bound.

Below we show an extension of the result [40, Section 1.4], on the lower bound for the second smallest eigenvalue of .

Lemma 2.1

Suppose that is a connected graph. We have

(21)

3 Lower Complexity Bounds

In this section we develop the lower complexity bounds for algorithms in class to solve problems over network . Our proof is for the case where ’s have uniform Lipschitz constants, i.e., , , but our analysis approach can be extended to non-uniform cases as well. Our proof combines ideas from the classical proof in Nesterov [25], as well as two recent constructions [36] (for centralized non-convex problems) and [33] (for strongly convex distributed problems). Our construction differs from the previous works in a number of ways, in particular, the constructed functions are only first-order differentiable, but not second-order differentiable. Further, we use the local- solution (13) to measure the quality of the solution, which makes the analysis more involved compared with the existing global error measures in [25, 36, 33].

To begin with, we construct the following two non-convex functions

(22)

as well as their average version

(23)

Here we have , for all , , and . Later we make our construction so that functions and are easy to analyze, while and will be in the desired function class in . Without loss of generality, in the construction we will assume will be Lipschitz with constant , for all .

3.1 Path Graph ()

First we consider the extreme case in which the nodes form a path graph with nodes and each node has its own local function , shown in Figure 1.

Figure 1: The path graph used in our construction.
Figure 2: The functional value, and derivatives of .
Figure 3: The functional value, and derivatives of .

For notational simplicity assume that is a multiple of , that is for some integer . Also assume that is an odd number without loss of generality.

Let us define the component functions ’s in (22) as follows.

(24)

where we have defined the following functions

(25a)
(25b)

The component functions are given as below

Suppose , then the average function becomes:

Figure 4: The functional value for .

Further for a given error constant and a given averaged Lipschitz constant , let us define

(26)

Therefore we also have, if , then

(27)

First we present some properties of the component functions ’s.

Lemma 3.1

The functions and satisfy the following.

  1. For all , , .

  2. We have the following bounds for the functions and their first and second-order derivatives:

  3. We have the following key property:

    (28)
  4. The function is lower bounded as follows:

  5. The first order derivative of (resp. ) is Lipschitz continuous with some constant (resp. , ).

Proof. Property 1) is obviously true.

To prove Property 2), note that following holds for (note, is not second order differentiable over ).

(29)

Obviously, is an increasing function (), therefore the lower and upper bounds are ; is increasing on and decreasing on , where , therefore the lower and upper bounds are ; is decreasing on and increasing on , therefore the lower and upper bounds are , i.e.,

Further, for all , the following holds

(30)

Similarly, we have

To show Property 3), note that for all and ,

where the first inequality is true because is strictly increasing and is strictly decreasing for all , and that .

Next we show Property 4). Note that and . Therefore we have and using the construction in (24)

(31)
(32)

where the first inequality follows and second follows , we reach the conclusion.

Finally we show Property 5), using the fact that a function is Lipschitz if it is piecewise smooth with bounded derivative. From construction (24), the first order partial derivative of can be expressed below.

Case I) If is even, we have

(33)

Case II) If is odd but not 1, we have

(34)

Case III) If , we have

(35)

Obviously, is a piecewise smooth function for any , either equals zero or separated at the non-differentiable point because of the function .

Further, fix a point and a unit vector where . Define

to be the directional projection of on to the direction at point . We will show that there exists such that for all .

where we take and .

The second order partial derivative of () is given as follows when is even:

Apply Lemma 3.1 i) and ii), we have:

Note the bound here is exactly the same for is odd but not , while for we have following

Therefore we have