Gradient-Free Multi-Agent Nonconvex Nonsmooth Optimization

Gradient-Free Multi-Agent Nonconvex Nonsmooth Optimization

Davood Hajinezhad Department of Mechanical Engineering and Materials Science, Duke University, USA, Email: dhajinezhad@gmail.com    Michael M. Zavlanos Department of Mechanical Engineering and Materials Science, Duke University, USA, Email: michael.zavlanos@duke.edu
Abstract

In this paper we consider the problem of minimizing the sum of nonconvex and possibly nonsmooth functions over a connected multi-agent network, where the agents have partial knowledge about the global cost function and can only access the zeroth-order information (i.e., the functional values) of their local cost functions. We propose and analyze a distributed primal-dual gradient-free algorithm for this challenging problem. We show that by appropriately choosing the parameters, the proposed algorithm converges to the set of first order stationary solutions with a provable global sublinear convergence rate. Numerical experiments demonstrate the effectiveness of our proposed method for optimizing nonconvex and nonsmooth problems over a network.


1 Introduction

Consider a network with distributed agents that collectively solve the following optimization problem

(1)

Here, is possibly a nonconvex nonsmooth function, that is only available to agent . Such distributed optimization problems arise in many applications such as machine learning [1, 2], resource allocation [3], robotic networks [4], and signal processing [5]. See [6] for more applications.

Many distributed optimization methods have been proposed to solve problem (1). Many of them rely on consensus between the agents and assume that the cost functions are convex. One of the first such methods is the distributed subgradient (DSG) algorithm [7]. Subsequently, a number of similar consensus-based algorithms were also proposed to solve distributed convex optimization problems in the form of (1); see, e.g., [8, 9, 10, 11]. These methods only converge to a neighborhood of the solution set unless they use diminishing stepsizes, which often makes them slow. Faster algorithms using constant stepsizes include the incremental aggregated gradient (IAG) method [12], the exact first-order algorithm (EXTRA) [13], and Accelerated Distributed Augmented Lagrangian (ADAL) algorithm. See also [14, 15, 16] for optimization of convex problems.

Optimization of nonconvex functions is a much more challenging problem. Only recently, have there been developed a few nonconvex distributed optimization algorithms motivated by applications in resource allocation in ad-hoc network [17], sparse PCA [18], and flow control in communication networks [19]; see also [20, 21, 22, 19, 23, 24, 25, 26, 27] for additional algorithms developed for nonconvex optimization problems.

Regardless of convexity and/or smoothness, all the aforementioned methods require that either the first order (gradient/subgradient) information or the explicit form of the objective function is available to the agents. However, in many important practical situations such information can be expensive to obtain, or even impossible. Examples include, simulation-based optimization where the objective function can only be evaluated using repeated simulation [28], training deep neural networks where the relationship between the variables and the cost function is too complicated to derive an explicit form of the gradient [29], and bandit optimization where a player optimizes a sequence of cost functions having only knowledge of a single function value each time [30]. In these cases, the zeroth-order information about the objective function values is often readily available. Such zeroth-order information can be obtained through a stochastic zeroth-order oracle (). In particular, suppose that , then the th at agent returns a noisy version of denoted by that satisfies

(2)

where is a random variable representing the noise.

Recently, centralized zeroth-order optimization has received significant attention. In [31], Nesterov proposed a general framework for analyzing zeroth-order algorithms and provided the global convergence rates for both convex and nonconvex problems. In [32], the authors established a stochastic zeroth-order method, which again can deal with both convex and nonconvex (but smooth) optimization problems. In [33] a Mirror Descent based zeroth-oder algorithm was proposed for solving convex optimization problems. Both these zeroth-order algorithms are centralized and cannot be implemented over a multi-agent network. A few recent works [34, 35, 36] considered zeroth-order distributed convex (possibly nonsmooth) problems, but none of these works can address nonconvex problems. For distributed nonconvex but smooth optimization problem using zeroth-order information a primal-dual algorithm has been proposed in [37, 38].

In this paper, we propose a new algorithm for distributed nonconvex and nonsmooth optimization with zeroth-order information. Specifically, we first show that this problem can be reformulated as a linearly constrained optimization problem over a connected multi-agent network. Then, we propose a nonconvex primal-dual based algorithm, which requires only local communication among the agents, and utilizes local zeroth-order information. Theoretically, We show that the current solution converges approximately to a stationary solution of the problem. We also provide numerical results that corroborate the theoretical findings.

Notation. Given a vector and a matrix , we use and to denote the Euclidean norm of vector , and spectral norm of matrix , respectively. represents the transpose of matrix . We define . The notation is used to denote the inner product of two vectors , . For matrices and , is the Kronecker product of and . To denote an identity matrix we use . denotes the expectation with respect to all random variables, and denotes the expectation with respect to the random variable .

2 Problem Definition and Proposed Algorithm

Let us define a graph , where is the node set with , and is the edge set with . We assume that is undirected, meaning that if then . Moreover, every agent can only communicate with its direct neighbors in the set , and let denote the degree of node . We assume that the graph is connected, meaning that there is a path, i.e., a sequence of nodes where consecutive nodes are neighbors, between any two nodes in .

In order to decompose problem (1) let us introduce new variables that are local to every agent . Then, problem (1) can be reformulated as follows:

(3)

The set of constraints enforce consensus on the local variables and for all neighbors . We stack all the local variables in a vector , where . Moreover, we define the Degree matrix to be a diagonal matrix where ; let . For a given graph , the incidence matrix is a matrix where and , where is the th edge of ; the rest of the entries of are all zero. Let . Finally, we define the Signed and Signless Laplacian matrix, denoted by and , respectively as

(4)
(5)

Using the above notations, problem (3) can be written in the following compact form:

(6)

2.1 Preliminaries

In this section, we first introduce some standard techniques presented in [31] for approximating and smoothing the gradient of a given function. Suppose that is a Gaussian random vector and let be some smoothing parameter. The smoothed version of function is defined as

(7)

Then it can be shown that the function is differentiable and its gradient is given by Eq. (22) in [31]

(8)

Further, assuming that the original function is Lipschitz continuous, i.e., there exists such that for all , it can be shown that (see [31, Lemma 2]) is also Lipschitz continuous with constant . In other words, for all we have

(9)

Let denote the noisy functional value of the function obtained from an associated as in equation (2). In view of (8), the gradient of can be approximated as

(10)

where the constant is the smoothing parameter. It can be easily checked that is an unbiased estimator of , i.e.,

(11)

For simplicity we define . For a given number of independent samples of , we define the sample average , where . It is easy to see that for any , is also an unbiased estimator of .

2.2 The Proposed Algorithm

In this part we propose a primal-dual algorithm for the distributed optimization problem (6). Let be the multiplier associated with the consensus constraint for each . Moreover, stack all ’s in a vector . Then, the augmented Lagrangian (AL) function for problem (6) is given by

(12)

where is a constant. Moreover, as in (7) define the smoothed version of the local function . At iteration of the algorithm we obtain an unbiased estimation of the gradient of local function as follows. For every sample we generate a random vector from an i.i.d standard Gaussian distribution and calculate similar to (10) by

(13)

Define . The following theorem bounds the norm of .

Theorem 1 (Theorem 4 [31])

If is a Lipschitz continuous function with constant , then

(14)
1:Input: Degree matrix , total number of iterations , number of samples , smoothing parameter
2:Initialize: Primal variable , dual variable
3:for  to  do
4:Update primal variable and dual variable by
(15)
(16)
5:end for
6: Choose uniformly randomly
7:Output: .
Algorithm 1 The Proposed Algorithm for problem (1)

Our proposed algorithm is summarized in Algorithm 1. In the primal step (15), an approximate gradient descent step is taken towards minimizing the augmented Lagrangian function with respect to . In particular, in the first-order approximation of the AL function the true gradient of the function is approximated by the noisy zeroth-order estimate and then, a matrix-weighted quadratic penalty is used. This term is critical for the algorithm itself, as well as for the analysis. The dual step (16) is then performed, which is a gradient ascent step over the dual variable .

To see how Algorithm 1 can be implemented in a distributed way, consider the optimality condition for (15) as

(17)

Utilizing (4) and (5), we have

To implement this primal iteration, each agent only requires local information as well as information from its neighbors . This is because is a diagonal matrix and the structure of the matrix ensures that the th block vector of is only related to . For the dual step w.l.o.g we assign the dual variable to node and therefore, from (16) we have

(18)

which only requires the local information as well as information from the neighbors in .

3 The Convergence Analysis

In this section we study the convergence of Algorithm 1. We make the following assumptions.

Assumptions A. We assume that

  • The function is Lipschitz continuous.

  • The function is lower bounded.

The above assumptions on the objective are quite standard in the analysis of first order optimization Algorithms (1). To simplify notation let be the -field generated by the entire history of algorithm up to iteration , be the smallest nonzero eigenvalue of , and be the successive difference of the differences of the primal iterates. In the analysis that follows we will make use of the following relations:

  • For any given vectors and we have

    (19)
    (20)
  • For given vectors we have that

    (21)

Our proof consists of a series of lemmas leading to the main convergence rate result. In our presentation we try to provide insights about the analysis steps, while the proofs of the results are relegated to the appendix.

First we bound the difference between and its unbiased estimation as follows:

(22)

where the last inequality follows from (14). The next lemma bounds the change of the dual variables by that of the primal variables. The proofs of the results that follow can be found in the appendix.

Lemma 1

Suppose Assumptions A hold true. Let denote the gradient Lipschitz constant for function . Then we have the following inequity:

(23)

The next step is the key in our analysis. We define the smoothed version of the AL function in a similar way as (7) and denote it by . For notational simplicity let us define . From equation (9) we know that function is Lipschitz continuous with constant . Now let be some positive constant and set . Moreover, we define

where . Finally we define the following potential function:

(24)

We study the behavior of the proposed potential function as the algorithm proceeds.

Lemma 2

Suppose Assumptions A hold true. We have that

(25)

where

(26)

Note that the constants and in (25) can be made positive as long as we choose constants and large enough. In particular, the following conditions are sufficient to have positive

(27)

where .

The key insight obtained from Lemma 2 is that, a proper combination of a primal objective (i.e., the AL function), and the dual gap (i.e., the violation of the feasibility) can be served as the potential function that guides the progress of the algorithm.

In the next lemma we show that is lower bounded.

Lemma 3

Suppose Assumptions A hold true, and constant is selected as Then there exists a constant that is independent of the total number of iterations so that

(28)

To characterize the convergence rate of Algorithm 1, let us define the stationarity gap of the smoothed version of problem (6) as

(29)

It can be easily checked that if and only if is a KKT point of the smoothed version of problem (6). For simplicity let us denote .

At this point we are ready to combine the previous results to obtain our main theorem.

Theorem 2

Suppose Assumptions A hold true, the penalty parameter satisfies the condition given in Lemma 2, and the constant satisfies Then, there exist constants such that

(30)

From Theorem 2 we can observe that there exists always a constant term in the right-hand-side of the stationarity gap. Therefore, no matter how many iterations we run the algorithm, we always converge to a neighborhood of a stationary point. However, if we choose the number of samples , we have the following bound:

(31)

which verifies the sublinear convergence rate for the algorithm.

4 Numerical Results

In this section we illustrate the proposed algorithm through numerical simulations. For our experiments we study two nonconvex distributed optimization problems.

First, we consider a simple nonconvex nonsmooth distributed optimization problem defined bellow:

(32)

where for each agent we have

In (32) the problem dimension and the number of nodes in the network is . Therefore, . The details of the underlying graph are discussed in [39]. We compare Algorithm 1 using a constant stepsize that satisfies (27) and the Randomize Gradient Free (RGF) algorithm proposed in [34] using diminishing stepsize . Note that in theory RGF only works for the convex problems. However, we include it here for the purpose of comparison only. We compare the two algorithms in terms of the stationarity gap defined in (29) and the constraint violation . The stopping criterion is set to iterations and the results are the average over independent trials. Figs 0(a) and 0(b) show our comparative results. We observe that the stationarity gap and the consensus error vanish faster for our proposed algorithm than for the RGF algorithm.

(a) Comparison of proposed Algorithm 1, and RGF algorithm [34] in terms of the stationarity gap for the nonconvex nonsmooth distributed optimization problem (32).
(b) Comparison of proposed Algorithm 1, and RGF algorithm [34] in terms of the constraint violation (i.e. ) for the nonconvex nonsmooth distributed optimization problem (32).

Next, we study a mini-batch binary classification problem using nonconvex nonsmooth regularizers, where each node stores (batch size) data points. For this problem the local function is given by

where and are the feature vector and the label for the th data point of agent [40]. The nonconvex nonsmooth regularization term imposes sparsity to vector , the constant controls the sparsity level, and is a small number. The network has nodes and each node contains randomly generated data point. Algorithm 1 and RGF run for iterations and Figures 0(c) and 0(d) illustrate the stationarity gap and the constraint violation versus the iteration counter. From these plots we can observe that, as in the previous problem, Algorithm 1 is faster than the RGF. Note again that, in theory, RGF is designed for convex problems only. To the best of our knowledge, our algorithm is the first provable distributed zeroth-order method for nonconvex and nonsmooth problems.

(c) Comparison of proposed Algorithm 1, and RGF algorithm [34] in terms of the stationarity gap for binary classification problem.
(d) Comparison of proposed Algorithm 1, and RGF algorithm [34] in terms of the constraint violation (i.e. ) for binary classification problem.

5 Conclusion

In this work, we proposed a distributed gradient-free optimization algorithm to solve nonconvex and nonsmooth problems utilizing local zeroth-order information. We rigorously analyzed the convergence rate of the proposed algorithm and demonstrated its performance via simulation. To the best of our knowledge, this is the first distributed framework for the solution of nonconvex and nonsmooth distributed optimization problems that also has a provable sublinear convergence rate. The proposed framework can be used to solve a variety of problems where access to first or second order information is very expensive or even impossible.

Appendix

5.1 Proof of Lemma 1

From equation (16) we have

(33)

Equation (33) implies that lies in the column space of , therefore we have

(34)

where denotes the smallest non-zero eigenvalue of . Utilizing equation (33) and equation (17), we obtain

(35)

Replacing with in equation (35) and then using the definition of we obtain

(36)

where the last inequality follows from (21). Adding and subtracting to the second term on the r.h.s of (5.1) and taking the expectation on both sides gives

(37)

where the last inequity is true because of (3), the fact that is gradient Lipschitz with constant , and the inequality . The proof is complete.

5.2 Proof of Lemma 2

First we prove that the function is strongly convex with respect to variable when . From Assumption A.1 and the fact that we have

This proves that is strongly convex with modulus . Using this fact, we can bound as follows:

(38)

where the last inequity holds true due to the strong convexity of and (33). Now using (35) we have

here the last inequality follows from (20) with . Taking expectation on both sides we have

(39)

where the first inequality follows from (5.1) and the second inequality follows from (3). Now we bound . From the optimality condition for problem (15) and the dual update (16) we have

Similarly, for the th iteration, we have

Setting in first equation, in second equation, and adding them, we obtain

(40)

The l.h.s can be expressed as follows:

(41)

where the first equality follows from (16) and the second equality follows from (19). For the r.h.s of (40) we have

where the first inequality follows from (20). To get the second inequality we add and subtract to and use (21). Taking expectation on both sides, we have

(42)

where the inequality follows from (3) and the last equality follows from (19). Combining (5.2) and (5.2), we obtain